{"title": "Prodding the ROC Curve: Constrained Optimization of Classifier Performance", "book": "Advances in Neural Information Processing Systems", "page_first": 1409, "page_last": 1415, "abstract": "", "full_text": "Prodding the ROC Curve: Constrained\nOptimization of Classi\ufb01er Performance\n\nMichael C. Mozer*+, Robert Dodier*, Michael D. Colagrosso*+,\n\nC\u00e9sar Guerra-Salcedo*, Richard Wolniewicz*\n\n* Advanced Technology Group + Department of Computer Science\n\nAthene Software\n2060 Broadway\n\nBoulder, CO 80302\n\nUniversity of Colorado\n\nCampus Box 430\n\nBoulder, CO 80309\n\nAbstract\n\nWhen designing a two-alternative classi\ufb01er, one ordinarily aims to maximize the\nclassi\ufb01er\u2019s ability to discriminate between members of the two classes. We\ndescribe a situation in a real-world business application of machine-learning\nprediction in which an additional constraint is placed on the nature of the solu-\ntion: that the classi\ufb01er achieve a speci\ufb01ed correct acceptance or correct rejection\nrate (i.e., that it achieve a \ufb01xed accuracy on members of one class or the other).\nOur domain is predicting churn in the telecommunications industry. Churn\nrefers to customers who switch from one service provider to another. We pro-\npose four algorithms for training a classi\ufb01er subject to this domain constraint,\nand present results showing that each algorithm yields a reliable improvement in\nperformance. Although the improvement is modest in magnitude, it is nonethe-\nless impressive given the dif\ufb01culty of the problem and the \ufb01nancial return that it\nachieves to the service provider.\n\nWhen designing a classi\ufb01er, one must specify an objective measure by which the classi-\n\ufb01er\u2019s performance is to be evaluated. One simple objective measure is to minimize the\nnumber of misclassi\ufb01cations. If the cost of a classi\ufb01cation error depends on the target and/\nor response class, one might utilize a risk-minimization framework to reduce the expected\nloss. A more general approach is to maximize the classi\ufb01er\u2019s ability to discriminate one\nclass from another class (e.g., Chang & Lippmann, 1994).\n\nAn ROC curve (Green & Swets, 1966) can be used to visualize the discriminative\nperformance of a two-alternative classi\ufb01er that outputs class posteriors. To explain the\nROC curve, a classi\ufb01er can be thought of as making a positive/negative judgement as to\nwhether an input is a member of some class. Two different accuracy measures can be\nobtained from the classi\ufb01er: the accuracy of correctly identifying an input as a member of\nthe class (a correct acceptance or CA), and the accuracy of correctly identifying an input\nas a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates,\nit is necessary to pick a threshold above which the classi\ufb01er\u2019s probability estimate is inter-\npreted as an \u201caccept,\u201d and below which is interpreted as a \u201creject\u201d\u2014call this the criterion.\nThe ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the\nthreshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1,\nthe CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1\n\n\f(a)\n\n0\n0\n1\n\n0\n8\n\n0\n6\n\n0\n4\n\n0\n2\n\n0\n\ne\nt\na\nr\n \nn\no\ni\nt\nc\ne\n\nj\n\ne\nr\n \nt\nc\ne\nr\nr\no\nc\n\n(b)\n\n0\n0\n1\n\n0\n8\n\n0\n6\n\n0\n4\n\n0\n2\n\n0\n\ne\nt\na\nr\n \nn\no\ni\nt\nc\ne\n\nj\n\ne\nr\n \nt\nc\ne\nr\nr\no\nc\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\ncorrect acceptance rate\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\ncorrect acceptance rate\n\nFIGURE 1. (a) two ROC curves re\ufb02ecting discrimination performance; the dashed curve\nindicates better performance. (b) two plausible ROC curves, neither of which is clearly\nsuperior to the other.\n\nand the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotoni-\ncally nonincreasing. The degree to which the curve is bowed re\ufb02ects the discriminative\nability of the classi\ufb01er. The dashed curve in Figure 1a is therefore a better classi\ufb01er than\nthe solid curve.\n\nThe degree to which the curve is bowed can be quanti\ufb01ed by various measures such\nas the area under the ROC curve or d\u2019, the distance between the positive and negative dis-\ntributions. However, training a classi\ufb01er to maximize either the ROC area or d\u2019 often\nyields the same result as training a classi\ufb01er to estimate posterior class probabilities, or\nequivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The\nROC area and d\u2019 scores are useful, however, because they re\ufb02ect a classi\ufb01er\u2019s intrinsic\nability to discriminate between two classes, regardless of how the decision criterion is set.\nThat is, each point on an ROC curve indicates one possible CA/CR trade off the classi\ufb01er\ncan achieve, and that trade off is determined by the criterion. But changing the criterion\ndoes not change the classi\ufb01er\u2019s intrinsic ability to discriminate.\n\nGenerally, one seeks to optimize the discrimination performance of a classi\ufb01er. How-\never, we are working in a domain where overall discrimination performance is not as criti-\ncal as performance at a particular point on the ROC curve, and we are not interested in the\nremainder of the ROC curve. To gain an intuition as to why this goal should be feasible,\nconsider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they\nsatisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not\ndecrease and the CR rate does not increase. Although the bow shape of the solid curve is\ntypical, it is not mandatory; the precise shape of the curve depends on the nature of the\nclassi\ufb01er and the nature of the domain. Thus, it is conceivable that a classi\ufb01er could pro-\nduce a curve like the dashed one. The dashed curve indicates better performance when the\nCA rate is around 50%, but worse performance when the CA rate is much lower or higher\nthan 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint\nthat the CA rate is around 50%, or to maximize the CA rate subject to the constraint that\nthe CR rate is around 90%, the dashed curve is superior to the solid curve. One can imag-\nine that better performance can be obtained along some stretches of the curve by sacri\ufb01c-\ning performance along other stretches of the curve. Note that obtaining a result such as the\ndashed curve requires a nonstandard training algorithm, as the discrimination performance\nas measured by the ROC area is worse for the dashed curve than for the solid curve.\n\nIn this paper, we propose and evaluate four algorithms for optimizing performance in\na certain region of the ROC curve. To begin, we explain the domain we are concerned with\nand why focusing on a certain region of the ROC curve is important in this domain.\n\n\f1 OUR DOMAIN\nAthene Software focuses on predicting and managing subscriber churn in the telecommu-\nnications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). \u201cChurn\u201d\nrefers to the loss of subscribers who switch from one company to the other. Churn is a sig-\nni\ufb01cant problem for wireless, long distance, and internet service providers. For example,\nin the wireless industry, domestic monthly churn rates are 2\u20133% of the customer base.\nConsequently, service providers are highly motivated to identify subscribers who are dis-\nsatis\ufb01ed with their service and offer them incentives to prevent churn.\n\nWe use techniques from statistical machine learning\u2014primarily neural networks and\nensemble methods\u2014to estimate the probability that an individual subscriber will churn in\nthe near future. The prediction of churn is based on various sources of information about a\nsubscriber, including: call detail records (date, time, duration, and location of each call,\nand whether call was dropped due to lack of coverage or available bandwidth), \ufb01nancial\ninformation appearing on a subscriber\u2019s bill (monthly base fee, additional charges for\nroaming and usage beyond monthly prepaid limit), complaints to the customer service\ndepartment and their resolution, information from the initial application for service (con-\ntract details, rate plan, handset type, credit report), market information (e.g., rate plans\noffered by the service provider and its competitors), and demographic data.\n\nChurn prediction is an extremely dif\ufb01cult problem for several reasons. First, the busi-\nness environment is highly nonstationary; models trained on data from a certain time\nperiod perform far better with hold-out examples from that same time period than exam-\nples drawn from successive time periods. Second, features available for prediction are\nonly weakly related to churn; when computing mutual information between individual\nfeatures and churn, the greatest value we typically encounter is .01 bits. Third, information\ncritical to predicting subscriber behavior, such as quality of service, is often unavailable.\n\nObtaining accurate churn predictions is only part of the challenge of subscriber\nretention. Subscribers who are likely to churn must be contacted by a call center and\noffered some incentive to remain with the service provider. In a mathematically principled\nbusiness scenario, one would frame the challenge as maximizing pro\ufb01tability to a service\nprovider, and making the decision about whether to contact a subscriber and what incen-\ntive to offer would be based on the expected utility of offering versus not offering an\nincentive. However, business practices complicate the scenario and place some unique\nconstraints on predictive models. First, call centers are operated by a staff of customer ser-\nvice representatives who can contact subscribers at a \ufb01xed rate; consequently, our models\ncannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal\nbusiness strategies at the service providers constrain the minimum acceptable CA or CR\nrates (above and beyond the goal of maximizing pro\ufb01tability). Third, contracts that Athene\nmakes with service providers will occasionally call for achieving a speci\ufb01c target CA and\nCR rate. These three practical issues pose formal problems which, to the best of our\nknowledge, have not been addressed by the machine learning community.\n\nThe formal problems can be stated in various ways, including: (1) maximize the CA\nrate, subject to the constraint that a \ufb01xed percentage of the subscriber base is identi\ufb01ed as\npotential churners, (2) optimize the CR rate, subject to the constraint that the CA rate\nshould be a CA, (3) optimize the CA rate, subject to the constraint that the CR rate should\nbe a CR, and \ufb01nally\u2014what marketing executives really want\u2014(4) design a classi\ufb01er that\nhas a CA rate of a CA and a CR rate of a CR. Problem (1) sounds somewhat different than\nproblems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate\nas a function of the total fraction of subscribers identi\ufb01ed by the model. Problem (1) thus\nimposes the constraint that the solution lies at one coordinate of the lift curve, just as prob-\nlems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC\ncurve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although\naddressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and\nthus we focus on it. Our goal is not altogether unreasonable, because a solution to problem\n\n\f(4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere\nexcept in the region near CA a CA and CR a CR. Hence, the approaches we consider will\ntrade off performance in some regions of the ROC curve against performance in other\nregions. We call this prodding the ROC curve.\n\n2 FOUR ALGORITHMS TO PROD THE ROC CURVE\nIn this section, we describe four algorithms for prodding the ROC curve toward a target\nCA rate of a CA and a target CR rate of a CR.\n2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES\nSuppose we train a classi\ufb01er on a set of positive and negative examples from a class\u2014\nchurners and nonchurners in our domain. Following training, the classi\ufb01er will assign a\nposterior probability of class membership to each example. The examples can be sorted by\nthe posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2).\nWe can identify the thresholds, q CA and q CR, which yield CA and CR rates of a CA and\na CR, respectively. If the classi\ufb01er\u2019s discrimination performance fails to achieve the target\nCA and CR rates, then q CA will be lower than q CR, as depicted in the Figure. If we can\nbring these two thresholds together, we will achieve the target CA and CR rates. Thus, the\n\ufb01rst algorithm we propose involves training a series of classi\ufb01ers, attempting to make clas-\nsi\ufb01er n+1 achieve better CA and CR rates by focusing its effort on examples from classi-\n\ufb01er n that lie between q CA and q CR; the positive examples must be pushed above q CR and\nthe negative examples must be pushed below q CA. (Of course, the thresholds are speci\ufb01c\nto a classi\ufb01er, and hence should be indexed by n.) We call this the emphasis algorithm,\nbecause it involves placing greater weight on the examples that lie between the two thresh-\nolds. In the Figure, the emphasis for classi\ufb01er n+1 would be on examples e5 through e8.\nThis retraining procedure can be iterated until the classi\ufb01er\u2019s training set performance\nreaches asymptote.\n\nIn our implementation, we de\ufb01ne a weighting of each example i for training classi\ufb01er\nn\n if example i is\n. For classi\ufb01er 1,\ni\n otherwise, where k e is a constant, k e > 1.\n\nn,\n1=\nnot in the region of emphasis, or\n\n. For subsequent classi\ufb01ers,\n\n1\ni\n\n1+\n\nn\ni\n\n1+\n\nn\ni\n\n=\n\nn\ni\n\ne\n\n=\n\nn\ni\n\n2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES\nThe second algorithm we propose is related to the \ufb01rst, but takes a slightly different per-\nspective on the continuum depicted in Figure 2. Positive examples below q CA\u2014such as\ne2\u2014are clearly the most dif \ufb01cult positive examples to classify correctly. Not only are they\nthe most dif\ufb01cult positive examples, but they do not in fact need to be classi\ufb01ed correctly\nto achieve the target CA and CR rates. Threshold q CR does not depend on examples such\nas e2, and threshold q CA allows a fraction (1\u2013a CA) of the positive examples to be classi\ufb01ed\nincorrectly. Likewise, one can argue that negative examples above q CR\u2014such as e10 and\ne11\u2014need not be of concern. Essentially , the second algorithm, which we term thedeem-\nphasis algorithm, is like the emphasis algorithm in that a series of classi\ufb01ers are trained,\nbut when training classi\ufb01er n+1, less weight is placed on the examples whose correct clas-\n\nq CA\n\nq CR\n\ne1 e2 e3\n\ne4\n\ne5\n\ne6\n\ne7\n\ne8\n\ne9\n\ne10 e11 e12 e13\n\n0\n\nchurn probability\n\n1\n\nFIGURE 2. A schematic depiction of all training examples arranged by the classi\ufb01er\u2019s\nposterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar\ncorresponds to a negative example (e.g., a nonchurner).\n\nl\nl\nl\nl\nl\nk\nl\n\fsi\ufb01cation is unnecessary to achieve the target CA and CR rates for classi\ufb01er n. As with the\nemphasis algorithm, the retraining procedure can be iterated until no further performance\nimprovements are obtained on the training set. Note that the set of examples given empha-\nsis by the previous algorithm is not the complement of the set of examples deemphasized\nby the current algorithm; the algorithms are not identical.\n\nIn our implementation, we assign a weight to each example i for training classi\ufb01er n,\n if example i is not\n otherwise, where k d is a constant, k d<1.\n\n. For subsequent classi\ufb01ers,\n\nin the region of deemphasis, or\n\n. For classi\ufb01er 1,\n\n1=\n\n1+\n\n1+\n\n1\ni\n\nn\ni\n\nn\ni\n\nn\ni\n\n=\n\n=\n\nn\ni\n\nn\ni\n\nd\n\n2.3 CONSTRAINED OPTIMIZATION\nThe third algorithm we propose is formulated as maximizing the CR rate while maintain-\ning the CA rate equal to a CA. (We do not attempt to simultaneously maximize the CA rate\nwhile maintaining the CR rate equal to a CR.) Gradient methods cannot be applied directly\nbecause the CA and CR rates are nondifferentiable, but we can approximate the CA and\nCR rates with smooth differentiable functions:\n\nCA w t,\n\n(\n\n)\n\n=\n\n1\n------\nP\n\ni P\u02db\n\n(\n\nf xi w,\n(\n\n)\n\n)\n\nt\u2013\n\nCR w t,\n\n(\n\n)\n\n=\n\n1\n-------\nN\n\ni N\u02db\n\n\u2013(\nt\n\nf xi w,\n(\n\n)\n\n)\n\n,\n\n)\n\n=\n\ny( )\n\n+(\n1\n\nexp b y\u2013(\n\nwhere P and N are the set of positive and negative examples, respectively, f(x,w) is the\nmodel posterior for input x, w is the parameterization of the model, t is a threshold, and s\nis a sigmoid function with scaling parameter b:\n. The larger b\nis, the more nearly step-like the sigmoid is and the more nearly equal the approximations\nare to the model CR and CA rates. We consider the problem formulation in which CA is a\nconstraint and CR is a \ufb01gure of merit. We convert the constrained optimization problem\ninto an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982),\nwhich involves iteratively maximizing an objective function\n--- CA w t,\n2\n\nwith a \ufb01xed Lagrangian multiplier, n, and then updating n following the optimization step:\n are the values found by the optimization\n10=\n\n and iterate until n converges.\n\nstep. We initialize\n\nm CA w* t*,\n\nn CA w t,\n\nCR w t,\n\nCA\nand \ufb01x\n\nw*\n1=\n\n)\n\u2013\n1=\n\n and\nand\n\n(\nA w t,\n\n, where\n\n) 1\u2013\n\nCA\n\nCA\n\nt*\n\n+\n\n+\n\n=\n\n+\n\n)\n\n\u2013\n\n)\n\n\u2013\n\n(\n\n)\n\n)\n\n(\n\n(\n\n(\n\n2\n\n2.4 GENETIC ALGORITHM\nThe fourth algorithm we explore is a steady-state genetic search over a space de\ufb01ned by\nthe continuous parameters of a classi\ufb01er (Whitley, 1989). The \ufb01tness of a classi\ufb01er is the\nreciprocal of the number of training examples falling between the q CA and q CR thresholds.\nMuch like the emphasis algorithm, this \ufb01tness function encourages the two thresholds to\ncome together. The genetic search permits direct optimization over a nondifferentiable cri-\nterion, and therefore seems sensible for the present task.\n\n3 METHODOLOGY\nFor our tests, we studied two large data bases made available to Athene by two telecom-\nmunications providers. Data set 1 had 50,000 subscribers described by 35 input features\nand a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input fea-\ntures and a churn rate of 6.42%. For each data base, the features input to the classi\ufb01er were\nobtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose\nthese two large, real world data sets because achieving gains with these data sets should be\nmore dif\ufb01cult than with smaller, less noisy data sets. Plus, with our real-world data, we\ncan evaluate the cost savings achieved by an improvement in prediction accuracy. We per-\nformed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn\nratio in each split.\n\nIn all tests, we chose\n\n, values which, based on our\npast experience in this domain, are ambitious yet realizable targets for data sets such as\n\n0.90\n\n0.50\n\nand\n\nCA\n\nCR\n\n=\n\n=\n\nb\nl\nl\nl\nl\nl\nk\nl\ns\nb\n(cid:229)\ns\nb\n(cid:229)\ns\nb\na\nm\na\nn\nn\na\n\u2039\nn\nm\nb\na\na\n\fthese. We used a logistic regression model (i.e., a no hidden unit neural network) for our\nstudies, believing that it would be more dif\ufb01cult to obtain improvements with such a\nmodel than with a more \ufb02exible multilayer perceptron. For the emphasis and deemphasis\nalgorithms, models were trained to minimize mean-squared error on the training set. We\nchose k e = 1.3 and k d = .75 by quick exploration. Because the weightings are cumulative\nover training restarts, the choice of k is not critical for either algorithm; rather, the magni-\ntude of k controls how many restarts are necessary to reach asymptotic performance, but\nthe results we obtained were robust to the choice of k. The emphasis and deemphasis algo-\nrithms were run for 100 iterations, which was the number of iterations required to reach\nasymptotic performance on the training set.\n\n4 RESULTS\nFigure 3 illustrates training set performance for the emphasis algorithm on data set 1. The\ngraph on the left shows the CA rate when the CR rate is .9, and the graph on the right show\nthe CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the\nROC curve is improving in the region around (a CA, a CR).\n\nFigure 4 shows cross-validation performance on the two data sets for the four prod-\nding algorithms as well as for a traditional least-squares training procedure. The emphasis\nand deemphasis algorithms yield reliable improvements in performance in the critical\nregion of the ROC curve over the traditional training procedure. The constrained-optimi-\nzation and genetic algorithms perform well on achieving a high CR rate for a \ufb01xed CA\nrate, but neither does as well on achieving a high CA rate for a \ufb01xed CR rate. For the con-\nstrained-optimization algorithm, this result is not surprising as it was trained asymmetri-\ncally, with the CA rate as the constraint. However, for the genetic algorithm, we have little\nexplanation for its poor performance, other than the dif\ufb01culty faced in searching a contin-\nuous space without gradient information.\n\n5 DISCUSSION\nIn this paper, we have identi\ufb01ed an interesting, novel problem in classi\ufb01er design which is\nmotivated by our domain of churn prediction and real-world business considerations.\nRather than seeking a classi\ufb01er that maximizes discriminability between two classes, as\nmeasured by area under the ROC curve, we are concerned with optimizing performance at\ncertain points along the ROC curve. We presented four alternative approaches to prodding\nthe ROC curve, and found that all four have promise, depending on the speci\ufb01c goal.\n\nAlthough the magnitude of the gain is small\u2014an increase of about .01 in the CR rate\n\ngiven a target CA rate of .50\u2014the impro vement results in signi\ufb01cant dollar savings. Using\na framework for evaluating dollar savings to a service provider, based on estimates of sub-\nscriber retention and costs of intervention obtained in real world data collection (Mozer et\n\ne\nt\na\nr\n \n\nA\nC\n\n0.4\n\n0.395\n\n0.39\n\n0.385\n\n0.38\n\n0.375\n\n0.37\n\n0.365\n\n0\n\n5\n\n10 15 20 25 30 35 40 45 50\n\nIteration\n\ne\nt\na\nr\n \n\nR\nC\n\n0.845\n\n0.84\n\n0.835\n\n0.83\n\n0.825\n\n0.82\n\n0.815\n\n0.81\n\n0\n\n5\n\n10 15 20 25 30 35 40 45 50\n\nIteration\n\nFIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as\na function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of\n.5. Error bars indicate +/\u20131 standard error of the mean.\n\n\fISP\nTest Set\nData set 1\n\nWireless\nTest Set\nData set 2\n\n0.390\n\n0.385\n\n0.380\n\n0.375\n\n0.370\n\n0.365\n\n0.360\n\n0.355\n\n0.350\n\n0.375\n\n0.350\n\n0.325\n\n0.300\n\nt\n\ne\na\nr\n \n\nA\nC\n\ne\nt\na\nr\n \n\nA\nC\n\nstd\n\nemph\n\ndeemph\n\nconstr\n\nGA\n\nstd\n\nemph\n\ndeemph\n\nconstr\n\nGA\n\ne\n\nt\n\na\nr\n \n\nR\nC\n\n0.840\n\n0.835\n\n0.830\n\n0.825\n\n0.820\n\n0.815\n\n0.810\n\n0.805\n\n0.800\n\n0.900\n\n0.875\n\ne\nt\na\nr\n \n\nR\nC\n\n0.850\n\n0.825\n\n0.800\n\nstd\n\nemph\n\ndeemph\n\nconstr\n\nGA\n\nstd\n\nemph\n\ndeemph\n\nconstr\n\nGA\n\nFIGURE 4. Cross-validation performance on the two data sets for the standard training\nprocedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained\noptimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for\nCR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one\nstandard error of the mean over the 10 data splits.\n\nal., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates\ngo from (.50, .80) to (.50, .81), which amounts to an 8% increase in pro\ufb01tability of the\nsubscriber intervention effort.\n\nThese \ufb01gures are clearly promising. However, based on the data sets we have stud-\nied, it is dif\ufb01cult to know whether another algorithm might exist that achieves even greater\ngains. Interestingly, all algorithms we proposed yielded roughly the same gains when suc-\ncessful, suggesting that we may have milked the data for whatever gain could be had,\ngiven the model class evaluated. Our work clearly illustrate the dif\ufb01culty of the problem,\nand we hope that others in the NIPS community will be motivated by the problem to sug-\ngest even more powerful, theoretically grounded approaches.\n\n6 ACKNOWLEDGEMENTS\nNo white males were angered in the course of conducting this research. We thank Lian Yan and\nDavid Grimes for comments and assistance on this research. This research was supported in part by\nMcDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549\u201301A1.\n\n7 REFERENCES\nBertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic.\nChang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D.\nCowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems\n6 (1019\u20131026). San Mateo, CA: Morgan Kaufmann.\n\nFrederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic \ufb01ndings and patient history\n\ndata with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of\nthe SPIE, 3338, 241\u2013245.\n\nGreen, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley.\nMozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing rev-\nenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Net-\nworks, 11, 690\u2013696.\n\nWhitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of\nreproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Confer-\nence on Genetic Algorithms (pp. 116\u2013121). San Mateo, CA: Morgan Kaufmann.\n\n\f", "award": [], "sourceid": 2095, "authors": [{"given_name": "Michael", "family_name": "Mozer", "institution": null}, {"given_name": "Robert", "family_name": "Dodier", "institution": null}, {"given_name": "Michael", "family_name": "Colagrosso", "institution": null}, {"given_name": "Cesar", "family_name": "Guerra-Salcedo", "institution": null}, {"given_name": "Richard", "family_name": "Wolniewicz", "institution": null}]}