{"title": "Direct Classification with Indirect Data", "book": "Advances in Neural Information Processing Systems", "page_first": 381, "page_last": 387, "abstract": null, "full_text": "Direct Classification with Indirect Data \n\nTimothy X Brown \n\nInterdisciplinary Telecommunications Program \nDept. of Electrical and Computer Engineering \nUniversity of Colorado, Boulder, 80309-0530 \n\ntimxb~colorado.edu \n\nAbstract \n\nWe classify an input space according to the outputs of a real-valued \nfunction. The function is not given, but rather examples of the \nfunction. We contribute a consistent classifier that avoids the un(cid:173)\nnecessary complexity of estimating the function. \n\n1 \n\nIntroduction \n\nIn this paper, we consider a learning problem that combines elements of regression \nand classification. Suppose there exists an unknown real-valued property of the \nfeature space, p(\u00a2), that maps from the feature space, \u00a2 ERn, to R. The property \nfunction and a positive set A c R, define the desired classifier as follows: \n\nC*(\u00a2) = { ~~ if p(\u00a2) E A \notherwise \n\n(1) \n\nThough p(\u00a2) is unknown, measurements, p\" associated with p(\u00a2) at different fea(cid:173)\ntures, \u00a2, are available in a data set X = {(\u00a2i,P,i)} of size IXI = N. Each sample \nis i.i.d. with unknown distribution f(\u00a2,p,). This data is indirect in that p, may be \nan input to a sufficient statistic for estimating p( \u00a2) but in itself does not directly \nindicate C*(\u00a2) in (1). Figure 1 gives a schematic of the problem. \nLet Cx(\u00a2) be a decision function mapping from Rn to {-I, I} that is estimated \nfrom the data X. The estimator, Cx(\u00a2) is consistent if, \n\nlim P{Cx (\u00a2) i- C*(\u00a2)} = O. \nIXI-+oo \n\n(2) \n\nwhere the probabilities are taken over the distribution f. \nThis problem arises in controlling data networks that provide quality of service \nguarantees such as a maximum packet loss rate [1]-[8]. A data network occasionally \ndrops packets due to congestion. The loss rate depends on the traffic carried by the \nnetwork (i.e. the network state). The network can not measure the loss rate directly, \nbut can collect data on the observed number of packets sent and lost at different \nnetwork states. Thus, the feature space, \u00a2, is the network state; the property \nfunction, p(\u00a2), is the underlying loss rate; the measurements, p\" are the observed \n\n\fp(if\u00bb \n\nA \n\nFigure 1: The classification problem. The \nclassifier indicates whether an unknown \nfunction, p(\u00a2), is within a set of interest, \nA. The learner is only given the data \"x\" . \n\nif> \n\npacket losses; the positive set, A, is the set of loss rates less than the maximum loss(cid:173)\nrate; and the distribution, f, follows from the arrival and departures processes of the \ntraffic sources. In words, this application seeks a consistent estimator of when the \nnetwork can and can not meet the packet loss rate guarantee based on observations \nof the network losses. Over time, the network can automatically collect a large set \nof observations so that consistency guarantees the classifier will be accurate. \nPrevious authors have approached this problem. In [6, 7], the authors estimate the \nproperty function from X as, N\u00a2) and then classify via \nC(\u00a2) = { ~~ if p(\u00a2) E A \notherwise. \n\n(3) \n\nThe approach suffers two related disadvantages. First, an accurate estimate of \nthe property function may require many more parameters than the corresponding \nclassifier in which only the decision boundary is important. Second, the regression \nrequires many samples over the entire range of \u00a2 to be accurate, while the fewer \nparameters in the classifier may require fewer samples for the same accuracy. \nA second approach, used in [4, 5, 8], makes a single sample estimate, p(\u00a2i) from fJ-i \nand estimates the desired output class as \n\nif p(\u00a2i) E A \notherwise. \n\n(4) \n\nThis forms a training set Y = {\u00a2i' oil for standard classification. This was shown \nto lead to an inconsistent estimator in the data network application in [1]. \nThis paper builds on earlier results by the author specific to the packet network \nproblem [1, 2, 3] and defines a general framework for mapping the indirect data \ninto a standard supervised learning task. It defines conditions on the training set, \nclassifier, and learning objective to yield consistency. The paper defines specific \nmethods based on these results and provides examples of their application. \n\n2 Estimator at a Single Feature \n\nIn this section, we consider a single feature vector \u00a2 and imagine that we can collect \nas much monitoring data as we like at \u00a2. We show that a consistent estimator of the \nproperty function, p(\u00a2), yields a consistent estimator of the optimal classification, \nC*(\u00a2), without directly estimating the property function. These results are a basis \nfor the next section where we develop a consistent classifier over the entire feature \nspace even if every \u00a2i in the data set is distinct. \n\n\fGiven the data set X = {\u00a2, ltd, we hypothesize that there is a mapping from data \nset to training set Y = {\u00a2, Wi, od such that IXI = IYI and \n\nIXI \n\nCx(\u00a2) = sign(L WiOi) \n\ni=1 \n\n(5) \n\nis consistent in the sense of (2). The Wi and 0i are both functions of /ti, but for \nsimplicity we will not explicitly denote this. \n\nDo any mappings from X to Y yield consistent estimators of the form (5)? We \nconsider only thresholds on p(\u00a2). That is, sets A in the form A = [-00,7) (or \nsimilarly A = (7,00]) for some threshold 7. Since most practical sets can be formed \nfrom finite union, intersection, and complements of sets in this form, this is sufficient. \nConsider an estimator fix that has the form \n\nfor some functions a > 0, and estimator (3. Suppose that fix is a consistent estimator \nof p(\u00a2), i.e. for every E > 0: \n\nlim P {Ifix - p(\u00a2)1 > E} = O. \n\nI XI~oo \n\n(7) \n\nFor threshold sets such as A = [-00,7), we can use (6) to construct the classifier: \n\n(6) \n\nCx(\u00a2) = sign(7 - fix (\u00a2)) = sign ~(a(/ti)7 -\n\n(\n\nIXI \n\n) \n(3(/ti)) \n\nIXI \n\n= sign(L WiOi) \n\n(8) \n\ni=1 \n\nwhere \n\nla(/ti)7 - (3 (/ti) I \nsign(a(/ti)7 - (3(/ti)) \n\n(9) \n(10) \n\nIf 17 - p(\u00a2)1 = E then the above estimator can be incorrect only if lfix - p(\u00a2)1 > E. \nThe consistency in (7) guarantees that (8)-(10) is consistent if E > O. \nThe simplest example of (6) is when /ti is a noisy unbiased sample of p(\u00a2i). The \nnatural estimator is just the average of all the /ti, i.e. a(/ti) = 1 and (3(/ti) = /ti. In \nthis case, Wi = 17 - ltd and 0i = sign(7 - /ti). A less trivial example will be given \nlater in the application section of the paper. \nWe now describe a range of objective functions for evaluating a classifier C( \u00a2; 0) \nparameterized by 0 and show a correspondence between the objective minimum and \n(5). Consider the class of weighted L-norm objective functions (L > 0): \n\nIXI \n\n)11L \n\nJ(X, 0) = ~ wiIC(\u00a2; 0) - oilL \n\n(\n\n(11) \n\nLet the 0 that minimizes this be denoted O(X). Let \nCx(\u00a2) = C(\u00a2; O(X)) \n\n(12) \nFor a single \u00a2, C(\u00a2;O) is a constant +1 or -1. We can simply try each value and \nsee which is the minimum to find Cx (\u00a2). This is carried out in [3] where we show: \n\nTheorem 1 When C(\u00a2;O) is a constant over X then the Cx(\u00a2) defined by {11} \nand (12) is equal to the Cx(\u00a2) defined by (5). \n\n\fThe definition in (5) is independent of L. So, we can choose any L-norm as conve(cid:173)\nnient without changing the solution. This follows since (11) is essentially a weighted \ncount of the errors. The L-norm has no significant effect. \n\nThis section has shown how regression estimators such as (6) can be mapped via \n(9) and (10) and the objective (11) to a consistent classifier at a single feature. The \nnext section considers general classifiers. \n\n3 Classification over All Features \n\nThis section addresses the question of whether there exist any general approach to \nsupervised learning that leads to a consistent estimator across the feature space. \nSeveral considerations are important. First, not all feature vectors, 4>, are rele(cid:173)\nvant. Some 4> may have zero probability associated with them from the distribution \nf (4), f..L). Such 4> we denote as unsupported. The optimal and learned classifier can \ndiffer on unsupported feature vectors without affecting consistency. Second, the \nclassifier function C(4),O) may not be able to represent the consistent estimator. \nFor instance, a linear classifier may never yield a consistent estimator if the op(cid:173)\ntimal classifier, C* (4)), decision boundary is non-linear. Classifier functions that \ncan represent the optimal classifier for all supported feature vectors we denote as \nrepresentative. Third, the optimal classifier is discontinuous at the decision bound(cid:173)\nary. A classifier that considers any small region around a feature on the decision \nboundary will have both positive and negative samples. In general, the resulting \nclassifier could be + 1 or -1 without regard to the underlying optimal classifier at \nthese points and consistency can not be guaranteed. These considerations are made \nmore precise in Appendix A. Taking these considerations into account and defining \nWi and 0i as in (9) and (10) we get the following theorem: \n\nTheorem 2 If the classifier (5) is a consistent estimator for every supported non(cid:173)\nboundary 4>, and C(4); 0) is representative, then the O(X)) that minimizes (11) yields \na consistent classifier over all supported 4> not on the decision boundary. \n\nTheorem 2 tells us that we can get consistency across the feature space. This result \nis proved in Appendix A. \n\n4 Application \n\nThis section provides an application of the results to better illustrate the method(cid:173)\nology. For brevity, we include only a simple stylized example (see [3] for a more \nrealistic application). We describe first how the data is created, then the form of \nthe consistent estimator, and then the actual application of the learning method. \nThe feature space is one dimensional with 4> uniformly distributed in (3,9). The \nunderlying property function is p(4)) = 10-4>. The measurement data is generated \nas follows. For a given 4>i, 8i is the number of successes in Ti = 105 Bernoulli trials \nwith success probability P(4)i). The monitoring data is thus, f..Li = (8i' Ti). The \npositive set is A = (0, r) with r = 10-6 , and IXI = 1000 samples. \nAs described in Section 1, this kind of data appears in packet networks where the \nunderlying packet loss rate is unknown and the only monitoring data is the number \nof packets dropped out of Ti trials. The Bernoulli trial successes correspond to \ndropped packets. The feature vector represents data collected concurrently that \nindicates the network state. Thus the classifier can decide when the network will \nand will not meet a packet loss rate guarantee. \n\n\f0.001 \n\n0.0001 \n\nle-05 \n\n~ \n~ le-06 \nen \nen \n0 \n.....:l \n\nle-07 \n\nle-08 \n\nle-09 \n\nsample loss rate \ntrue loss rate \nthreshold \n\n+ \n\n-\u00bb+H- + + \n+ -tItI-\"llnrnll ' III 111 111 \n\nII I \n\n++ \n\nsample-based \n.m . . . . . . . . . . . . . . . . . . . . . . . . m :::EJ<:())]!;ist~)]t ..... . \n\n. \". \n\n+ \n\n' . \n\n........................ , \n\n............... \n\n\" . ..... , .. \n'. \n\n\". '. \" . '. \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\n9 \n\nFeature \n\nFigure 2: Monitoring data, true property function, and learned classifiers in the \nloss-rate classification application. The monitoring data is shown as sample loss \nrate as a function of feature vector. Sample loss-rates of zero are arbitrarily set to \n10-9 for display purposes. The true loss rate is the underlying property function. \nThe consistent and sample-based classifier results are shown as a a range of thresh(cid:173)\nolds on the feature. An x and y error range is plotted as a box. The x error range \nis the 10th and 90th percentile of 1000 experiments. This is mapped via the under(cid:173)\nlying property function to a y-error range. The consistent classifier finds thresholds \naround the true value. The sample-based is off by a factor of 7. \n\nFigure 2 shows a sample of data. A consistent estimator in the form of (6) is: \n\nA \n\nLi Si \npx = Li Ti ' \n\n(13) \n\nDefining Wi and 0i as in (9) and (10) the classifier for our data set is the threshold \non the feature space that minimizes (11). This classifier is representative since p(\u00a2) \nis monotonic. \nThe results are shown in Figure 2 and labeled \"consistent\". This paper's methods \nfind a threshold on the feature that closely corresponds to the r = 10-6 threshold. \nAs a comparison we also include a classifier that uses Wi = 1 for all i and sets \n0i to the single-sample estimate, p(\u00a2i) = silTi , as in (4). The results are labeled \n\"sample-based\". This method misses the desired threshold by a factor of 7. \n\nThis application shows the features of the paper's methods. The classifier is a simple \nthreshold with one parameter. Estimating p(\u00a2) to derive a classifier required lO's \nof parameters in [6, 7]. The results are consistent unlike the approaches in [4, 5, 8]. \n\n5 Conclusion \n\nThis paper has shown that using indirect data we can define a classifier that directly \nuses the data without any intermediate estimate of the underlying property function. \nThe classifier is consistent and yields a simpler learning problem. The approach \nwas demonstrated on a problem from telecommunications. Practical details such \nas choosing the form of the parametric classifier, C(\u00a2i(}), or how to find the global \nminimum of the objective function (11) are outside the scope of this paper. \n\n\fTwo Dimensional Feature Space \n\nC(rjJ;(}~ \nDecision \nBoth Classifiiers Bounpar \n\n.\"IA\";-.-.-.-.~Fal-;T.-;-sei'-.-.-.-.-.-\\ \n\noun ary \n\nBoth Classifiers \n\nPositive \n\nFigure 3: A classifier 0 ( ; ()) and the \noptimal classifier 0* ( is supported by the distribution f if every neighborhood around \n has positive probability. \nA feature vector is on the decision boundary if in every neighborhood around \nthere exists supported ', \" such that 0* (') i' 0* (\"). \nA classifier function, 0 (; ()) is representative if there exists a ()* such that 0 (; ()*) = \nO*(. \nParameters () and ()' are equivalent if for all supported, non-boundary ; O(; ()) = \nO(; ()'). \nGiven a (), it is either equivalent to ()* or there are supported, non-boundary where \nO(j()) is not equal to the optimal classifier as in Figure 3. We will show that for \nany () not equivalent to ()*, \n\nlim P{J(X, ()) ::; J(X, ()*)} = 0 \n\nIXI--+oo \n\n(14) \n\nIn other words, such a () can not be the minimum of the objective in (11) and so \nonly a () equivalent to ()* is a possible minimum. \n\nTo prove Theorem 2, we need to introduce a further condition. An estimator of the \nform (5) has uniformly bounded variance if Var(wi) < B for some fixed B < 00 for \nall . \nLet E[w( where the expectation is from 1(\",,1. From (15), e( E cJ>. Let x be the probability measure of cJ>. Define the set \n\nsign(e(\u20ac. Choose E > 0 so that x\u20ac > O. \n\ncJ> \u20ac = {I E cJ> and e (. With ()*, C (\u00a2; ()*) = + 1 for all \u00a2 E cI>. Since the minimum of a \nconstant objective function satisfies (5) , we would incorrectly choose () if \n\nIXI \n\nlim LWioi < 0 \nIXI--+oo i=l \n\nFor the false negatives the expected number of examples in cI> and cI>e is xlXI and \nxelXI. By the definition of cI>e and the bounded variance of the weight, we get that \n\nIXI \n\nE[L WiOi] ~ ExelXI \n\ni=l \nIXI \n\nVar[L WiOi] < BxIXI\u00b7 \n\ni=l \n\n(16) \n\n(17) \n\nSince the expected value grows linearly with the sample size and the standard \ndeviation with the square root of the sample size, as IXI --t 00 the weighted sum will \nwith probability one be positive. Thus, as the sample size grows, + 1 will minimize \nthe objective function for the set of false negative samples and the decision boundary \nfrom ()* will minimize the objective. \n\nThe same argument applied to the false positives shows that ()* will minimize the \nfalse positives with probability one. Thus ()* will be chosen with probability one \nand the theorem is shown. \n\nAcknowledgments \n\nThis work was supported by NSF CAREER Award NCR-9624791. \n\nReferences \n\n[1] Brown, T.X (1995) Classifying loss rates with small samples, Proc. Inter. Work(cid:173)\n\nshop on Appl. of NN to Telecom (pp. 153- 161). Hillsdale, NJ: Erlbaum. \n\n[2] Brown, T.X (1997) Adaptive access control applied to ethernet data, Advances \n\nin Neural Information Processing Systems, 9 (pp. 932- 938). MIT Press. \n\n[3] Brown, T. X (1999) Classifying loss rates in broadband networks, INFOCOMM \n\n'99 (v. 1, pp. 361- 370). Piscataway, NJ: IEEE. \n\n[4] Estrella, A.D., et al. (1994). New training pattern selection method for ATM \n\ncall admission neural control, Elec. Let., v. 30, n. 7, pp. 577- 579. \n\n[5] Hiramatsu, A. (1990). ATM communications network control by neural net(cid:173)\n\nworks, IEEE T. on Neural Networks, v. 1, n. 1, pp. 122- 130. \n\n[6] Hiramatsu, A. (1995). Training techniques for neural network applications in \n\nATM, IEEE Comm. Mag., October, pp. 58-67. \n\n[7] Tong, H., Brown, T. X (1998). Estimating Loss Rates in an Integrated Services \nNetwork by Neural Networks, Proc. of Global Telecommunications Conference \n(GLOBECOM 98) (v. 1, pp. 19- 24) Piscataway, NJ: IEEE. \n\n[8] Tran-Gia, P., Gropp, O. (1992). Performance of a neural net used as admission \n\ncontroller in ATM systems, Proc. GLOBECOM 92 (pp. 1303- 1309). Piscat(cid:173)\naway, NJ: IEEE. \n\n\f", "award": [], "sourceid": 1919, "authors": [{"given_name": "Timothy", "family_name": "Brown", "institution": null}]}