{"title": "Consistent Classification, Firm and Soft", "book": "Advances in Neural Information Processing Systems", "page_first": 326, "page_last": 332, "abstract": null, "full_text": "Consistent Classification, Firm and Soft \n\nYoram Baram* \n\nDepartment of Computer Science \n\nTechnion, Israel Institute of Technology \n\nHaifa 32000, Israel \n\nbaram@cs.technion.ac.il \n\nAbstract \n\nA classifier is called consistent with respect to a given set of class(cid:173)\nlabeled points if it correctly classifies the set. We consider classi(cid:173)\nfiers defined by unions of local separators and propose algorithms \nfor consistent classifier reduction. The expected complexities of the \nproposed algorithms are derived along with the expected classifier \nsizes. In particular, the proposed approach yields a consistent re(cid:173)\nduction of the nearest neighbor classifier, which performs \"firm\" \nclassification, assigning each new object to a class, regardless of \nthe data structure. The proposed reduction method suggests a \nnotion of \"soft\" classification, allowing for indecision with respect \nto objects which are insufficiently or ambiguously supported by \nthe data. The performances of the proposed classifiers in predict(cid:173)\ning stock behavior are compared to that achieved by the nearest \nneighbor method. \n\n1 \n\nIntroduction \n\nCertain classification problems, such as recognizing the digits of a hand written zip(cid:173)\ncode, require the assignment of each object to a class. Others, involving relatively \nsmall amounts of data and high risk, call for indecision until more data become \navailable. Examples in such areas as medical diagnosis, stock trading and radar \ndetection are well known. The training data for the classifier in both cases will \ncorrespond to firmly labeled members of the competing classes. (A patient may be \n\n\u00b7Presently a Senior Research Associate of the National Research Council at M. S. \n210-9, NASA Ames Research Center, Moffett Field, CA 94035, on sabbatical leave from \nthe Technion. \n\n\fConsistent Classification, Firm and Soft \n\n327 \n\neither ill or healthy. A stock price may increase, decrease or stay the same). Yet, \nthe classification of new objects need not be firm. (A given patient may be kept \nin hospital for further observation. A given stock need not be bought or sold every \nday). We call classification of the first kind \"firm\" and classification of the second \nkind \"soft\". The latter is not the same as training the classifier with a \"don't care\" \noption, which would be just another firm labeling option, as \"yes\" and \"no\", and \nwould require firm classification. A classifier that correctly classifies the training \ndata is called \"consistent\". Consistent classifier reductions have been considered in \nthe contexts of the nearest neighbor criterion (Hart, 1968) and decision trees (Holte, \n1993, Webb, 1996). \n\nIn this paper we present a geometric approach to consistent firm and soft classifi(cid:173)\ncation. The classifiers are based on unions of local separators, which cover all the \nlabeled points of a given class, and separate them from the others. We propose a \nconsistent reduction of the nearest neighbor classifier and derive its expected design \ncomplexity and the expected classifier size. The nearest neighbor classifier and its \nconsistent derivatives perform \"firm\" classification. Soft classification is performed \nby unions of maximal- volume spherical local separators. A domain of indecision \nis created near the boundary between the two sets of class-labeled points, and in \nregions where there is no data. We propose an economically motivated benefit func(cid:173)\ntion for a classifier as the difference between the probabilities of success and failure. \nEmploying the respective benefit functions, the advantage of soft classification over \nfirm classification is shown to depend on the rate of indecision. The performances \nof the proposed algorithms in predicting stock behavior are compared to those of \nthe nearest neighbor method. \n\n2 Consistent Firm Classification \n\nConsider a finite set of points X = {X(i), i = 1, ... , N} in some subset of Rn, \nthe real space of dimension n . Suppose that each point of X is assigned to one \nof two classes, and let the corresponding subsets of X, having N1 and N2 points, \nrespectively, be denoted Xl and X 2 \u2022 We shall say that the two sets are labeled L1 \nand L 2 , respectively. It is desired to divide Rn into labeled regions, so that new, . \nunlabeled points can be assigned to one of the two classes. \n\nWe define a local separator of a point x of Xl with respect to X 2 as a convex set, \ns(xI2), which contains x and no point of X2. A separator family is defined as a rule \nthat produces local separators for class- labeled points. \n\nWe call the set of those points of Rn that are closer to a point x E Xl than to any \npoint of X2 the minimum-distance local separator of x with respect to X2. \n\nWe define the local clustering degree, c, of the data as the expected fraction of data \npoints that are covered by a local minimum- distance separator. \n\nThe nearest neighbor criterion extends the class assignment of a point x E Xl to \nits minimum-distance local separator. It is clearly a consistent and firm classifier \nwhose memory size is O(N). \n\nHart's Condensed Nearest Neighbor (CNN) classifier (Hart, 1968) is a consis(cid:173)\ntent subset of the data points that correctly classifies the entire data by the nearest \nneighbor method. It is not difficult to show that the complexity of the algorithm \n\n\f328 \n\nY. Baram \n\nproposed by Hart for finding such a subset is O(N3). The expected memory re(cid:173)\nquirement (or classifier size) has remained an open question. \n\nWe propose the following Reduced Nearest Neighbor (RNN) classifier: include \na labeled point in the consistent subset only if it is not covered by the minimum(cid:173)\ndistance local separator of any of the points of the same class already in the subset. \n\nIt can be shown (Baram, 1996) that the complexity of the RNN algorithm is O(N2). \nand that the expected classifier size is O(IOgl/(I-C) N). It can also be shown that \nthe latter bounds the expected size of the CNN classifier as well. \n\nIt has been suggested that the utility of the Occam's razor in classification would \nbe (Webb, 1996): \n\n\"Given a choice between two plausible classifiers that perform identically on the \ndata set, the simpler classifier is expected to classify correctly more objects outside \nthe training set\". \n\nThe above statement is disproved by the CNN and the RNN classifiers, which are \nstrict consistent reductions of the nearest neighbor classifier, likely to produce more \nerrors. \n\n3 Soft Classification: Indecision Pays, Sometimes \n\nWhen a new, unlabeled, point is closely surrounded by many points of the same \nclass, its assignment to the same class can be said to be unambiguously supported \nby the data. When a new point is surrounded by points of different classes, or \nwhen it is relatively far from any of the labeled points, its assignment to either \nclass can be said to be unsupported or ambiguously supported by the data. In the \nlatter cases, it may be more desirable to have a certain indecision domain, where \nnew points will not be assigned to a class. This will translate into the creation of \nindecision domains near the boundary between the two sets of labeled points and \nwhere there is no data. \n\nWe define a separntor S(112) of Xl with respect to X2 as a set that includes Xl \nand excludes X2. \n\nGiven a separator family, the union of local separators S(x(i) 12) of the points \nX(i), i = 1 ... , N I , of Xl with respect to X 2 , \n\nis a separator of Xl with respect to X2. It consists of NI local separators. \n\nLet XI,c be a subset of Xl. The set \n\n(1) \n\n(2) \n\nwill be called a consistent separator of Xl with respect to X2 if it contains all the \npoints of X 1. The set XI,c will then be called a consistent subset with respect to \nthe given separator family. \n\nLet us extend the class assignment of each of the labeled points to a local separator \nof a given family and maximize the volume of each of the local separators without \n\n\fConsistent Classification, Finn and Soft \n\n329 \n\nincluding in it any point of the competing class. Let Sc(112) and Sc(211) be consis(cid:173)\ntent separators of the two sets, consisting of maximal-volume (or, simply, maximaQ \nlocal separators of labeled points of the corresponding classes. The intersection of \nSc(112) and Sc(211) defines a conflict and will be called a domain of ambiguity of \nthe first kind. A region uncovered by either separator will be called a domain of \nambiguity of the second kind. The union of the domains of ambiguity will be des(cid:173)\nignated the domain of indecision. The remainders of the two separators, excluding \ntheir intersection, define the conflict-free domains assigned to the two classes. \n\nThe resulting \"soft\" classifier rules out hard conflicts, where labeled points of one \nclass are included in the separator of the other. Yet, it allows for indecision in areas \nwhich are either claimed by both separators or claimed by neither. \n\nLet the true class be denoted y (with possible values, e.g., y=1 or y=2) and let the \nclassification outcome be denoted y. Let the probabilities of decision and indecision \nby the soft classifier be denoted Pd and Pid, respectively (of course, P id = 1 - Pd), \nand let the probabilities of correct and incorrect decisions by the firm and the soft \nclassifiers be denoted Pfirm {y = y}, Pfirm {y =P y}, P soft {y = y} and Psoft {y =P y}, \nrespectively. Finally, let the joint probabilities of a decision being made by the soft \nclassifier and the correctness or incorrectness of the decision be denoted, respec(cid:173)\ntively, Psoft { d, Y = y} and P soft { d, Y =P y} and let the corresponding conditional \nprobabilities be denoted Psoft {y = y I d} and Psoft {y =P y I d}, respectively. \nWe define the benefit of using the firm classifier as the difference between the prob(cid:173)\nability that a point is classified correctly by the classifier and the probability that \nit is misclassified: \n\nThis definition is motivated by economic consideration: the profit produced by \nan investment will be, on average, proportional to the benefit function. This will \nbecome more evident in a later section, were we consider the problem of stock \ntrading. \n\nFor a soft classifier, we similarly define the benefit as the difference between the \nprobability of a correct classification and that of an incorrect one (which, in an \neconomic context, assumes that indecision has no cost, other than the possible \nloss of profit). Now, however, these probabilities are for the joint events that a \nclassification is made, and that the outcome is correct or incorrect, respectively: \n\n(4) \n\nSoft classification will be more beneficial than firm classification if Bsoft > Bfirm' \nwhich may be written as \n\nPfirm{Y = y} - 0.5 \nPid < 1- Psoft{Y =y I d} -0.5 \n\n(5) \n\nFor the latter to be a useful condition, it is necessary that Pfirm {y = y} > 0.5, \nPsofdy = y I d} > 0.5 and Psoft {y = y I d} > Pfirm {y = y}. The latter will \nbe normally satisfied, since points of the same class can be expected to be denser \nunder the corresponding separator than in the indecision domain. In other words, \n\n\f330 \n\nY. Baram \n\nthe error ratio produced by the soft classifier on the decided cases can be expected \nto be smaller than the error ratio produced by the firm classifier, which decides on \nall the cases. The satisfaction of condition (5) would depend on the geometry of \nthe data. It will be satisfied for certain cases, and will not be satisfied for others. \nThis will be numerically demonstrated for the stock trading problem. \n\nThe maximal local spherical separator of x is defined by the open sphere centered \nat x, whose radius r(xI2) is the distance between x and the point of X2 nearest to \nx. Denoting by s(x, r) the sphere of radius r in Rn centered at x, the maximal local \nseparator is then sM(xI2) = s(x, r(xI2)). \n\nA separator construction algorithm employing maximal local spherical separators \nis described below. Its complexity is clearly O(N2). \nLet Xl = Xl. For each of the points xci) of Xl, find the minimal distance to \nthe points of X 2 \u2022 Call it r(x(i) 12). Select the point x(i) for which r(x(i) 12) 2: \nr(x(j) 12), j f: i, for the consistent subset. Eliminate from Xl all the points that \nare covered by SM(X(i) 12). Denote the remaining set Xl. Repeat the procedure \nwhile Xl is non-empty. The union of the maximal local spherical separators is a \nseparator for Xl with respect to X 2 . \n\n4 Example: Firm and soft prediction of stock behaviour \n\nGiven a sequence of k daily trading (\"close\") values of a stock, it is desired to predict \nwhether the next day will show an increase or a decrease with respect to the last \nday in the sequence. Records for ten different stocks, each containing, on average, \n1260 daily values were used. About 60 percent of the data were used for training \nand the rest for testing. The CNN algorithm reduced the data by 40% while the \nRNN algorithm reduced the data by 35%. Results are show in Fig. 1. It can be \nseen that, on average, the nearest neighbor method has produced the best results. \nThe performances of the CNN and the RNN classifiers (the latter producing only \nslightly better results) are somewhat lower. It has been argued that performance \nwithin a couple of percentage points by a reduced classifier supports the utility of \nOccam's razor (Holte, 1993). However, a couple of percentage points can be quite \nmeaningful in stock trading. \n\nIn order to evaluate the utility of soft classification in stock trading, let the predic(cid:173)\ntion success rate of a firm classifier, be denoted f and that of a soft classifier for the \ndecided cases s. For a given trade, let the gain or loss per unit invested be denoted \nq, and the rate of indecision of the soft classifier ir. Suppose that, employing the \nfirm classifier, a stock is traded once every day (say, at the \"close\" value), and that, \nemploying the soft classifier, it is traded on a given day only if a trade is decided \nby the classifier (that is, the input does not fall in the indecision domain). The \nexpected profit for M days per unit invested is 2(1 - 0.5)qM for the firm classifier \nand 2(s - 0.5)q(l- ir)M for the soft classifier (these values disregard possible com(cid:173)\nmission and slippage costs). The soft classifier will be preferred over the firm one if \nthe latter quantity is greater than the former, that is, if \n\n1 \n\nf - 0.5 \n. \n~r < - =---(cid:173)\ns - 0.5 \n\n(6) \n\nwhich is the sample representation of condition (5) for the stock trading problem. \n\n\fConsistent Classification, Firm and Soft \n\n331 \n\nxaip \n\nxaldnf \n\nxddddf \n\nxdssi \n\nxecilf \n\nxedusf \n\nxelbtf \n\nxetz \n\nxelrnf \n\nxelt \n\nAverage \n\nNN \n\n61.4% \n\n51.7% \n\n52.0% \n\n48.3% \n\n53.0% \n\n80.7% \n\n53.7% \n\n66.0% \n\n51.5% \n\n85.6% \n\n60.4% \n\nCNN \n\n58.4% \n52.3% \n\n49.6% \n\n47.7% \n\n50.9% \n\n74.7% \n\n55.6% \n\n61.0% \n\n49.0% \n\n82.7% \n\n58.2% \n\nRNN \n\n57.0% \n\n50.1% \n\n51.8% \n\n48.6% \n\n52.6% \n\n76.3% \n\n52.5% \n\n61.0% \n\n49.2% \n\n84.2% \n\n58.3% \n\n~ni . ~llI~~ifip.\u00b7 \n\n~llCce~~ bene.fit \n70.1% \n\n-\n-\n-\n+ \n-\n-\n-\n-\n+ \n-\n\n..... . . \n48.4% \n\n74.6% \n44.3% \n\n51.3% \n53.3% \n\n43.6% \n\n52.6% \n\n47.6% \n\n48.8% \n\n30.6% \n\n89.9% \n\n42.2% \n\n50.1% \n\n43.8% \n\n68.6% \n\n39.2% \n\n56.0% \n\n32.9% \n\n93.0% \n\n44.7% \n\n63.4% \n\nFigure 1: Success rates in the prediction of rize and fall in stock values. \n\nResults for the soft classifier, applied to the stock data, are presented in Fig. 1. \nThe indecision rates and the success rates in the decided cases are then specified \nalong with a benefit sign. A positive benefit represents a satisfaction of condition \n(6), with ir, f and s replaced by the corresponding sample values given in the table. \nThis indicates a higher profit in applying the soft classifier over the application of \nthe nearest neighbor classifier. A negative benefit indicates that a higher profit \nis produced by the nearest neighbor classifier. It can be seen that for two of the \nstocks (xdssi and xelrnf) soft classification has produced better results than firm \nclassification, and for the remaining eight stocks finn classification by the nearest \nneighbor method has produced better results. \n\n5 Conclusion \n\nSolutions to the consistent classification problem have been specified in tenns of \nlocal separators of data points of one class with respect to the other. The expected \ncomplexities of the proposed algorithms have been specified, along with the ex(cid:173)\npected sizes of the resulting classifiers. Reduced consistent versions of the nearest \nneighbor classifier have been specified and their expected complexities have been \nderived. A notion of \"soft\" classification has been introduced an algorithm for its \nimplementation have been presented and analyzed. A criterion for the utility of \nsuch classification has been presented and its application in stock trading has been \ndemonstrated. \n\nAcknowledgment \n\nThe author thanks Dr. Amir Atiya of Cairo University for providing the stock data \nused in the examples and for valuable discussions of the corresponding results. \n\n\f332 \n\nReferences \n\ny. Baram \n\nBaram Y. (1996) Consistent Classification, Firm and Soft, CIS Report No. 9627, \nCenter for Intelligent Systems, Technion, Israel Institute of Technology, Haifa 32000, \nIsrael. \n\nBaum, E. B. (1988) On the Capabilities of Multilayer Perceptrons, J. Complexity, \nVol. 4, pp. 193 - 215. \n\nHart, P. E. (1968) The Condensed Nearest Neighbor Rule, IEEE Trans. on Infor(cid:173)\nmation Theory, Vol. IT -14, No.3, pp. 515 - 516. \n\nHolte, R. C. (1993) Very Simple Classification Rules Perform Well on Most Com(cid:173)\nmonly Used databases, Machine Learning, Vol. 11, No. 1 pp. 63 - 90. \n\nRosenblatt, F. (1958) The Perceptron: A Probabilistic Model for Information Stor(cid:173)\nage and Organization in the Brain, Psychological Review, Vol. 65, pp. 386 - 408. \n\nWebb, G. 1. (1996) Further Experimental Evidence against the Utility of Occam's \nRazor, J. of Artificial Intelligence Research 4, pp. 397 - 147. \n\n\f", "award": [], "sourceid": 1311, "authors": [{"given_name": "Yoram", "family_name": "Baram", "institution": null}]}