{"title": "On Efficient Heuristic Ranking of Hypotheses", "book": "Advances in Neural Information Processing Systems", "page_first": 444, "page_last": 450, "abstract": null, "full_text": "On Efficient Heuristic Ranking of \n\nHypotheses \n\nSteve Chien, Andre Stechert, and Darren Mutz \n\nJet Propulsion Laboratory, California Institute of Technology \n\n4800 Oak Grove Drive, MIS 525-3660, Pasadena, CA 91109-8099 \n\nsteve.chien@jpl.nasa.gov, Voice: (818) 306-6144 FAX: (818) 306-6912 \n\nContent Areas: Applications (Stochastic Optimization),Model Selection Algorithms \n\nAbstract \n\nThis paper considers the problem of learning the ranking of a set \nof alternatives based upon incomplete information (e.g., a limited \nnumber of observations). We describe two algorithms for hypoth(cid:173)\nesis ranking and their application for probably approximately cor(cid:173)\nrect (PAC) and expected loss (EL) learning criteria. Empirical \nresults are provided to demonstrate the effectiveness of these rank(cid:173)\ning procedures on both synthetic datasets and real-world data from \na spacecraft design optimization problem. \n\n1 \n\nINTRODUCTION \n\nIn many learning applications, the cost of information can be quite high, imposing \na requirement that the learning algorithms glean as much usable information as \npossible with a minimum of data. For example: \n\n\u2022 In speedup learning, the expense of processing each training example can \n\nbe significant [Tadepalli921. \n\n\u2022 In decision tree learning, ihe cost of using all available training examples \n\nwhen evaluating potential attributes for partitioning can be computation(cid:173)\nally ex.pensive [Musick93]. \n\u2022 In evaruating medical treatment policies, additional training examples im-\nEly suboptimal treatment of human subjects. \n\u2022 n data-poor applications, training data may be very scarce and learning \nas well as possible from limited data may be key. \n\nThis paper provides a statistical decision-theoretic framework for the ranking of \nparametric distributions. This framework will provide the answers to a wide range \nof questions about algorithms such as: how much information is enough? At what \npoint do we have adequate information to rank the alternatives with some requested \nconfidence? \n\n\fOn Efficient Heuristic Ranking of Hypotheses \n\n445 \n\nThe remainder of this paper is structured as follows. First, we describe the hypoth(cid:173)\nesis ranking problem more formally, including definitions for the probably approxi(cid:173)\nmately correct (PAC) and expected loss (EL) decision criteria. We then define two \nalgorithms for establishing these criteria for the hypothesis ranking problem - a re(cid:173)\ncursive hypothesis selection algorithm and an adjacency based algorithm. Next, we \ndescribe empirical tests demonstrating the effectiveness of these algorithms as well \nas documenting their improved performance over a standard algorithm from the sta(cid:173)\ntistical ranking literature. Finally, we describe related work and future extensions \nto the algorithms. \n\n2 HYPOTHESIS RANKING PROBLEMS \nHypothesis ranking problems, an extension of hypothesis selection problems, are an \nabstract class of learning problems where an algorithm is given a set of hypotheses \nto rank according to expected utility over some unknown distribution, where the \nexpected utility must be estimated from training data. \nIn many of these applications, a system chooses a single alternative and never re(cid:173)\nvisits the decision. However, some systems require the ability to investigate several \noptions (either serially or in parallel), such as in beam search or iterative broad(cid:173)\nening, where the ranking formulation is most appropriate. Also, as is the case \nwith evolutionary approaches, a system may need to populate future alternative \nhypotheses on the basis of the ranking of the current population[Goldberg89] . \nIn any hypothesis evaluation problem, always achieving a correct ranking is im(cid:173)\npossible in practice, because the actual underlying probability distributions are \nunavailable and there is always a (perhaps vanishingly) small chance that the al(cid:173)\ngorithms will be unlucky because only a finite number of samples can be taken. \nConsequently, rather than always requiring an algorithm to output a correct rank(cid:173)\ning, we impose probabilistic criteria on the rankings to be produced. While several \nfamilies of such requirements exist, in this paper we examine two, the probably \napproximately correct (PAC) requirement from the computational learning theory \ncommunity [Valiant84] and the expected loss (EL) requirement frequently used in \ndecision theory and gaming problems [Russe1l92] . \nThe expected utility of a hypothesis can be estimated by observing its values over a \nfinite set of training examples. However, to satisfy the PAC and EL requirements, \nan algorithm must also be able to reason about the potential difference between \nthe estimated and true utilities of each hypotheses. Let Ui be the true expected \nutility of hypothesis i and let Ui be the estimated expected utility of hypothesis i. \nWithout loss of generality, let us presume that the proposed ranking of hypotheses \nis U1 > U2 >, ... , > Uk-I> Uk. The PAC requirement states that for some user(cid:173)\nspecified \u00a3. with probability 1 - 8: \n\nk-l \n\n/\\ [(Ui + f) > MAX(Ui+I, ... ,UIe)] \n\n;=1 \n\n(1) \n\nCorrespondingly, let the loss L of selecting a hypothesis HI to be the best from a \nset of k hypotheses HI, ... , Hk be as follows . \n\nL(HI' {HI, ... ,HIe}) = MAX(O, MAX(U2 , ... ,UIe) - UI) \n\nand let the loss RL of a ranking H 1, ... , H k be as follows. \n\nRL(Hl, ... , Hie) = L L(Hi, {Hi+l, ... , Hie}) \n\nIe-I \n\ni=1 \n\nA hypothesis ranking algorithm which obeys the expected loss requirement must \nproduce rankings that on average have less than the requested expected loss bound. \n\n(2) \n\n(3) \n\n\f446 \n\nS. Chien, A. Stechert and D. Mutz \n\nConsider ranking the hypotheses with expected utilities: U1 = 1.0, U2 = 0.95, U3 = \n0.86. The ranking U2 > U1 > U3 is a valid PAC ranking for { = 0.06 but not for \n{ = 0.01 and has an observed loss of 0.05 + 0 = 0.05. \nHowever, while the confidence in a pairwise comparison between two hypotheses is \nwell understood, it is less clear how to ensure that desired confidence is met in the \nset of comparisons required for a selection or the more complex set of comparisons \nrequired for a ranking. Equation 4 defines the confidence that Ui + { > Uj, when \nthe distribution underlying the utilities is normally distributed with unknown and \nunequal variances. \n\n(4) \n\nwhere \u00a2 represents the cumulative standard normal distribution function, and n, \nUi-j, and Si-j are the size, sample mean, and sample standard deviation of the \nblocked differential distribution, respectively 1 . \nLikewise, computation of the expected loss for asserting an ordering between a pair \nof hypotheses is well understood, but the estimation of expected loss for an entire \nranking is less clear. Equation 5 defines the expected loss for drawing the conclusion \nUi > Uj, again under the assumption ~fnormality (see [Chien95] for further details). \n\n' e \nEL(Ui > Ujl = .-] \n\n~\n\nU'-i :l \n-O .6n ( ) \nSi_j \n\n';21rn \n\nfJ -j \n\noc \n\n+ ~ e- O \u2022 6 \u2022 dz \n\n:l \n\n.,j2; \n\n_ \\~irn \n\n(5) \n\n'-J \n\nIn the next two subsections, we describe two interpretations for estimating the like(cid:173)\nlihood that an overall ranking satisfies the PAC or EL requirements by estimating \nand combining pairwise PAC errors or EL estimates. Each of these interpretations \nlends itself directly to an algorithmic implementation as described below. \n\n2.1 RANKING AS RECURSIVE SELECTION \nOne way to determine a ranking HI, ... , Hk is to view ranking as recursive selection \nfrom the set of remaining candidate hypotheses. In this view, the overall ranking \nerror, as specified by the desired confidence in PAC algorithms and the loss thresh(cid:173)\nhold in EL algorithms, is first distributed among k - 1 selection errors which are \nthen further subdivided into pairwise comparison errors. Data is then sampled un(cid:173)\ntil the estimates of the pairwise comparison error (as dictated by equation 4 or 5) \nsatisfy the bounds set by the algorithm. \n\nThus, another degree of freedom in the design of recursive ranking algorithms is \nthe method by which the overall ranking error is ultimately distributed among \nindividual pairwise comparisons between hypotheses. Two factors influence the \nway in which we compute error distribution. First, our model of error combination \ndetermines how the error allocated for individual comparisons or selections combines \ninto overall ranking error and thus how many candidates are available as targets \nfor the distribution. Using Bonferroni's inequality, one combine errors additively, \nbut a more conservative approach might be to assert that because the predicted \n\"best\" hypothesis may change during sampling in the worst case the conclusion \nmight depend on all possible pairwise comparisons and thus the error should be \ndistributed among all (~) pairs of hypotheses2 ). \n\nINote that in our approach we block examples to further reduce sampling complexity. \nBlocking forms estimates by using the difference in utility between competing hypotheses \non each observed example. Blocking can significantly reduce the variance in the data when \nthe hypotheses are not independent. It is trivial to modify the formulas to address the \ncases in which it is not possible to block data (see [Moore94, Chien95] for further details). \n\n2For a discussion of this issue, see pp. 18-20 of [Gratch93]. \n\n\fOn Efficient Heuristic Ranking of Hypotheses \n\n447 \n\nSecond, our policy with respect to allocation of error among the candidate com(cid:173)\nparisons or selections determines how samples will be distributed. For example, in \nsome contexts, the consequences of early selections far outweigh those of later se(cid:173)\nlections. For these scenarios, we have implemented ranking algorithms that divide \noverall ranking error unequally in favor of earlier selections3 . Also, it is possible to \ndivide selection error into pairwise error unequally based on estimates of hypothesis \nparameters in order to reduce sampling cost (for example, [Gratch94] allocates error \nrationally) . \n\nWithin the scope of this paper, we only consider algorithms that: (1) combine \npairwise error into selection error additively, (2) combine selection error into overall \nranking error additively and (3) allocate error equally at each level. \n\nOne disadvantage of recursive selection is that once a hypothesis has been selected, \nit is removed from the pool of candidate hypotheses. This causes problems in rare \ninstances when, while sampling to increase the confidence of some later selection, \nthe estimate for a hypothesis' mean changes enough that some previously selected \nhypothesis no longer dominates it. In this case, the algorithm is restarted taking \ninto account the data sampled so far. \nThese assumptions result in the following formulations (where d(U11>\u00a3 {U2' ... , Uk}) \nis used to denote the error due to the action of selecting hypothesis 1 under Equation \n1 from the set {HI, ... , Hk} and d(UII>{U2, ... , Uk}) denotes the error due to selection \nloss in situations where Equation 2 applies): \n\nt5rec(UI > U2 > ... > Uk) = \n\nt5rec (U2 > U3 > ... > Uk) \n+t5(UI t>. {U2 , \u2022\u2022\u2022 ,Uk}) \n\n(6) \n\nwhere drec(Uk) = 0 (the base case for the recursion) and the selection error is as \ndefined in [Chien95]: \n\nt5(Ul t>. {U2 , \u2022\u2022\u2022 ,Uk}) = L 15 1 ,. \n\nk \n\n.=2 \n\n(7) \n\nusing Equation 4 to compute pairwise confidence. \n\nAlgorithmically, we implement this by: \n\n1. sampling a default number of times to seed the estimates for each hypothesis \n\n2. allocating the error to selection and pairwise comparisons as indicated \n\nmean and variance, \n\nabove, \n\n3. sampling until the desired confidences for successive selections is met, and \n4. restarting the algorithm if any of the hypotheses means changed signifi(cid:173)\n\ncantly enough to change the overall ranking. \n\nAn analogous recursive selection algorithm based on expected loss is defined as \nfollows. \n\nEL rec(U2 > U3 > ... > Uk) \n+EL(U1 t> {U2 , \u2022\u2022\u2022 ,Uk}) \n\nwhere ELrec(Uk) = 0 and the selection EL is as defined in [Chien95]: \n\nEL(U1 I> {U2, ... , Uk}) = L EL(Ut, Ud \n\nk \n\ni=2 \n\n3Space constraints preclude their description here. \n\n(8) \n\n(9) \n\n\f448 \n\nS. Chien, A. StecheT1 and D. Mutz \n\n2.2 RANKING BY COMPARISON OF ADJACENT ELEMENTS \nAnother interpretation of ranking confidence (or loss) is that only adjacent elements \nin the ranking need be compared. In this case, the overall ranking error is divided \ndirectly into k -1 pairwise comparison errors. This leads to the following confidence \nequation for the PAC criteria: \n\n(10) \n\n(11) \n\ndac(i(Ul > U2 > ... > Uk) = Ldi,i+1 \n\nk-l \n\nAnd the following equation for the EL criteria.k_l \n\nELac(i(Ul > U2 > ... > Uk) = '2: EL (Ui,Ui+d \n\ni=l \n\ni=l \n\nBecause ranking by comparison of adjacent hypotheses does not establish the dom(cid:173)\ninance between non-adjacent hypotheses (where the hypotheses are ordered byob(cid:173)\nserved mean utility), it has the advantage of requiring fewer comparisons than \nrecursive selection (and thus may require fewer samples than recursive selection). \nHowever, for the same reason, adjacency algorithms may be less likely to correctly \nbound probability of correct selection (or average loss) than the recursive selection \nalgorithms. In the case of the PAC algorithms, this is because f-dominance is not \nnecessarily transitive. In the case of the EL algorithms, it is because expected loss is \nnot additive when considering two hypothesis relations sharing a common hypoth(cid:173)\nesis. For instance, the size of the blocked differential distribution may be different \nfor each of the pairs of hypotheses being compared. \n\n2.3 OTHER RELEVANT APPROACHES \nMost standard statistical ranking/selection approaches make strong assumptions \nabout the form of the problem (e.g., the variances associated with underlying utility \ndistribution of the hypotheses might be assumed known and equal). Among these, \nTurnbull and Weiss [Turnbull84] is most comparable to our PAC-based approach4. \nTurnbull and Weiss treat hypotheses as normal random variables with unknown \nmean and unknown and unequal variance. However, they make the additional \nstipulation that hypotheses are independent. So, while it is still reasonable to \nuse this approach when the candidate hypotheses are not independent, excessive \nstatistical error or unnecessarily large training set sizes may result. \n\n3 EMPIRICAL PERFORMANCE EVALUATION \nWe now turn to empirical evaluation of the hypothesis ranking techniques on real(cid:173)\nworld datasets. This evaluation serves three purposes. First, it demonstrates that \nthe techniques perform as predicted (in terms of bounding the probability of incor(cid:173)\nrect selection or expected loss). Second, it validates the performance of the tech(cid:173)\nniques as compared to standard algorithms from the statistical literature. Third, \nthe evaluation demonstrates the robustness of the new approaches to real-world \nhypothesis ranking problems. \n\nAn experimental trial consists of solving a hypothesis ranking problem with a given \ntechnique and a given set of problem and control parameters. We measure perfor(cid:173)\nmance by (1) how well the algorithms satisfy their respective criteria; and (2) the \nnumber of samples taken. Since the performance of these statistical algorithms on \nany single trial provides little information about their overall behavior, each trial \nis repeated multiple times and the results are averaged across 100 trials. Because \n\n4 PAC-based approaches have been investigated extensively in the statistical ranking and \nselection literature under the topic of confidence interval based algorithms (see [Haseeb85] \nfor a review of the recent literature). \n\n\fOn Efficient Heuristic Ranking of Hypotheses \n\n449 \n\nTable 1: Estimated expected total number of observations to rank DS-2 spacecraft \ndesigns. Achieved \n\n. h ' \n\nthesis. \n\nk' \n\nf \n\nt \n\n'Y \n\npro a I I yo correc ran mg IS sown m paren \nk \n10 \n10 \n10 \n\nb bTt \n!!.. \n2 \n2 \n2 \n\nPACrec \n144 1.00 \n160 1.00 \n177 1.00 \n\nPACod ' \n92 0.98 \n98 1.00 \n103 0.99 \n\n534 {0.96 \n667 (0 .98 \n793 (0.99 \n\n0.75 \n0.90 \n0.95 \n\nTURNtlULL \n\nTable 2: Estimated expected total number of observations and expected loss of an \nincorrect ranking of DS-2 penetrator designs. \n\nrec \n\nEL \nParameters \nk ~ Samples \n10 \n152 \n200 \n10 \n10 \n378 \n\n0.10 \n0.05 \n0.02 \n\nEL d ' \n\na \n\nLoss \n0.005 \n0.003 \n0 .003 \n\nl:)amples \n77 \n90 \n139 \n\nl..oss \n0.014 \n0.006 \n0.003 \n\nthe PAC and expected loss criteria are not directly comparable, the approaches are \nanalyzed separately. \n\nExperimental results from synthetic datasets are reported in [Chien97]. The eval(cid:173)\nuation of our approach on artificially generated data is used to show that: (1) the \ntechniques correctly bound probability of incorrect ranking and expected loss as pre(cid:173)\ndicted when the underlying assumptions are valid even when the underlying utility \ndistributions are inherently hard to rank, and (2) that the PAC techniques com(cid:173)\npare favorably to the algorithm of Thrnbull and Weiss in a wide variety of problem \nconfigurations. \nThe test of real-world applicability is based on data drawn from an actual NASA \nspacecraft design optimization application. This data provides a strong test of the \napplicability of the techniques in that all of the statistical techniques make some \nform of normality assumption - yet the data in this application is highly non-normal. \n\nTables 1 and 2 show the results of ranking 10 penetrator designs using the PAC(cid:173)\nbased, Thrnbull, and expected loss algorithms In this problem the utility function \nis the depth of penetration of the penetrator, with those cases in which the pen(cid:173)\netrator does not penetrate being assigned zero utility. As shown in Table 1, both \nPAC algorithms significantly outperformed the Thrnbull algorithm, which is to be \nexpected because the hypotheses are somewhat correlated (via impact orientations \nand soil densities). Table 2 shows that the ELrec expected loss algorithm effectively \nbounded actual loss but the ELad,i algorithm was inconsistent. \n\n4 DISCUSSION AND CONCLUSIONS \nThere are a number of areas of related work. First, there has been considerable \nanalysis of hypothesis selection problems. Selection problems have been formalized \nusing a Bayesian framework [Moore94, Rivest88] that does not require an initial \nsample, but uses a rigorous encoding of prior knowledge. Howard [Howard70] also \ndetails a Bayesian framework for analyzing learning cost for selection problems. If \none uses a hypothesis selection framework for ranking, allocation of pairwise errors \ncan be performed rationally [Gratch94]. Reinforcement learning work [Kaelbling93] \nwith immediate feedback can also be viewed as a hypothesis selection problem. \nIn su~mary, this paper has described the hypothesis ranking problem, an extension \nto the hypothesis selection problem. We defined the application of two decision \ncriteria, probably approximately correct and expected loss, to this problem. We then \ndefined two families of algorithms, recursive selection and adjacency, for solution of \nhypothesis ranking problems. Finally, we demonstrated the effectiveness of these \nalgorithms on both synthetic and real-world datasets, documenting improved per(cid:173)\nformance over existing statistical approaches. \n\n\f450 \n\nReferences \n\nS. Chien, A. Stechert and D. Mutz \n\n[Bechhofer54] R.E. Bechhofer, \"A Single-sample Multiple Decision Procedure for Ranking \nMeans of Normal Populations with Known Variances,\" Annals of Math. Statistics (25) \n1, 1954 pp. 16-39. \n\n[Chien95] S. A. Chien, J . M. Gratch and M. C. Burl, \"On the Efficient Allocation of \nResources for Hypothesis Evaluation: A Statistical Approach,\" IEEE Trans. Pattern \nAnalysis and Machine Intelligence 17 (7), July 1995, pp. 652-665. \n\n[Chien97] S. Chien, A. Stechert, and D. Mutz, \"Efficiently Ranking Hypotheses \n\nin Machine Learning,\" JPL-D-14661, June 1997. Available online at http://www(cid:173)\naig.jpl.nasa.gov/public/www/pas-bibliography.html \n\n[Goldberg89] D. Goldberg, Genetic Algorithms in Search, Optimization and Machine \n\nLearning, Add. Wes., 1989. \n\n[Govind81] Z. Govindarajulu, \"The Sequential Statistical Analysis,\" American Sciences \n\nPress, Columbus, OH, 1981. \n\n[Gratch92] J . Gratch and G. Dejong, \"COMPOSER: A Probabilistic Solution to the Util(cid:173)\nity Problem in Speed-up Learning,\" Proc. AAAI92, San Jose, CA, July 1992, pp. 235-\n240. \n\n[Gratch93] J. Gratch, \"COMPOSER: A Decision-theoretic Approach to Adaptive Problem \nSolving,\" Tech. Rep. UIUCDCS-R-93-1806, Dept. Compo Sci., Univ. Illinois, May 1993. \n[Gratch94] J. Gratch, S. Chien, and G. Dejong, \"Improving Learning Performance \nThrough Rational Resource Allocation,\" Proc. AAAI94, Seattle, WA, August 1994, \npp. 576-582. \n\n[Greiner92] R. Greiner and I. Jurisica, \"A Statistical Approach to Solving the EBL Utility \n\nProblem,\" Proc. AAAI92, San Jose, CA, July 1992, pp. 241-248. \n\n[Haseeb85] R. M. Haseeb, Modern Statistical Selection, Columbus, OH: Am. Sciences \n\nPress, 1985. \n\n[Hogg78] R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, Macmillan \n\nInc., London, 1978. \n\n[Howard70] R. A. Howard, Decision Analysis: Perspectives on Inference, Decision, and \n\nExperimentation,\" Proceedings of the IEEE 58, 5 (1970), pp. 823-834. \n\n[Kaelbling93] L. P. Kaelbling, Learning in Embedded Systems, MIT Press, Cambridge, \n\nMA,1993. \n\n[Minton88] S. Minton, Learning Search Control Knowledge: An Explanation-Based Ap(cid:173)\n\nproach, Kluwer Academic Publishers, Norwell, MA, 1988. \n\n[Moore94] A. W. Moore and M. S. Lee, \"Efficient Algorithms for Minimizing Cross Vali(cid:173)\n\ndation Error,\" Proc. ML94, New Brunswick, MA, July 1994. \n\n[Musick93] R. Musick, J. Catlett and S. Russell, \"Decision Theoretic Subsampling for \n\nInduction on Large Databases,\" Proc. ML93, Amhert, MA, June 1993, pp. 212-219. \n\n[Rivest88] R. L. Rivest and R. Sloan, A New Model for Inductive Inference,\" Proc. 2nd \n\nConference on Theoretical Aspects of Reasoning about Knowledge, 1988. \n\n[Russell92] S. Russell and E . Wefald, Do the Right Thing: Studies in Limited Rationality, \n\nMIT Press, MA. \n\n[Tadepalli92] P. Tadepalli, \"A theory of unsupervised speedup learning,\" Proc. AAAI92\" \n\npp. 229-234. \n\n[Turnbull84] Turnbull and Weiss, \"A class of sequential procedures for k-sample problems \nconcerning normal means with unknown unequal variances,\" in Design of Experiments: \nranking and selection, T. J. Santner and A. C. Tamhane (eds. ), Marcel Dekker, 1984. \n[Valiant84] L. G. Valiant, \"A Theory of the Learnable,\" Communications of the ACM 27, \n\n(1984), pp. 1134-1142. \n\n\f", "award": [], "sourceid": 1388, "authors": [{"given_name": "Steve", "family_name": "Chien", "institution": null}, {"given_name": "Andre", "family_name": "Stechert", "institution": null}, {"given_name": "Darren", "family_name": "Mutz", "institution": null}]}