{"title": "Limits on Learning Machine Accuracy Imposed by Data Quality", "book": "Advances in Neural Information Processing Systems", "page_first": 239, "page_last": 246, "abstract": "", "full_text": "Limits on Learning Machine Accuracy \n\nImposed by Data Quality \n\nCorinna Cortes, L. D. Jackel, and Wan-Ping Chiang \n\nAT&T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nAbstract \n\nRandom errors and insufficiencies in databases limit the perfor(cid:173)\nmance of any classifier trained from and applied to the database. \nIn this paper we propose a method to estimate the limiting perfor(cid:173)\nmance of classifiers imposed by the database. We demonstrate this \ntechnique on the task of predicting failure in telecommunication \npaths. \n\n1 \n\nIntroduction \n\nData collection for a classification or regression task is prone to random errors, \ne.g. inaccuracies in the measurements of the input or mis-labeling of the output. \nMissing or insufficient data are other sources that may complicate a learning task \nand hinder accurate performance of the trained machine. These insufficiencies of \nthe data limit the performance of any learning machine or other statistical tool \nconstructed from and applied to the data collection -\nno matter how complex the \nmachine or how much data is used to train it. \n\nIn this paper we propose a method for estimating the limiting performance of learn(cid:173)\ning machines imposed by the quality of the database used for the task. The method \ninvolves a series of learning experiments. The extracted result is, however, indepen(cid:173)\ndent of the choice of learning machine used for these experiments since the estimated \nlimiting performance expresses a characteristic of the data. The only requirements \non the learning machines are that their capacity (VC-dimension) can be varied and \ncan be made large, and that the learning machines with increasing capacity become \ncapable of implementing any function. \n\n\f240 \n\nCorinna Cortes. L. D. Jackel. Wan-Ping Chiang \n\nWe have applied the technique to data collected for the purpose of predicting failures \nin telecommunication channels of the AT&T network. We extracted information \nfrom one of AT&T's large databases that continuously logs performance parame(cid:173)\nters of the network. The character and amount of data comes to more material \nthan humans can survey. The processing of the extracted information is therefore \nautomated by learning machines. \nWe conjecture that the quality of the data imposes a limiting error rate on any \nlearning machine of,... 25%, so that even with an unlimited amount of data, and an \narbitrarily complex learning machine, the performance for this task will not exceed \n,... 75% correct. This conjecture is supported by experiments. \n\nThe relatively high noise-level of the data, which carries over to a poor performance \nof the trained classifier, is typical for many applications: the data collection was \nnot designed for the task at hand and proved inadequate for constructing high \nperformance classifiers. \n\n2 Basic Concepts of Machine Learning \n\nWe can picture a learning machine as a device that takes an unknown input vector \nand produces an output value. More formally, it performs some mapping from an \ninput space to an output space. The particular mapping it implements depends of \nthe setting of the internal parameters of the learning machine. These parameters \nare adjusted during a learning phase so that the labels produced on the training \nset match, as well as possible, the labels provided. The number of patterns that \nthe machine can match is loosely called the \"capacity\" of the machine. Generally, \nthe capacity of a machine increases with the number of free parameters. After \ntraining is complete, the generalization ability of of the machine is estimated by its \nperformance on a test set which the machine has never seen before. \nThe test and training error depend on both the the number of training examples \nI, the capacity h of the machine, and, of course, how well suited the machine is to \nimplement the task at hand. Let us first discuss the typical behavior of the test \nand training error for a noise corrupted task as we vary h but keep the amount I of \ntraining data fixed. This scenario can, e.g., be obtained by increasing the number \nof hidden units in a neural network or increasing the number of codebook vectors \nin a Learning Vector Quantization algorithm [6]. Figure la) shows typical training \nand test error as a function of the capacity of the learning machine. For h < I we \nhave many fewer free parameters than training examples and the machine is over \nconstrained. It does not have enough complexity to model the regularities of the \ntraining data, so both the training and test error are large (underfitting). As we \nincrease h the machine can begin to fit the general trends in the data which carries \nover to the test set, so both error measures decline. Because the performance of the \nmachine is optimized on only part of the full pattern space the test error will always \nbe larger than the training error. As we continue to increase the capacity of the \nlearning machine the error on the training set continues to decline, and eventually \nit reaches zero as we get enough free parameters to completely model the training \nset. The behavior of the error on the test set is different. Initially it decreases, \nbut at some capacity, h*, it starts to rise. The rise occurs because the now ample \nresources of the training machine are applied to learning vagaries of the training \n\n\fLimits on Learnillg Machine Accuracy Imposed by Data Quality \n\n241 \n\n% error \n\na) \n\n% error \n\nb) \n\n% error \n\nc) \n\n. \u2022. ~ . ..,..(cid:173)--\n--- .. \n\ntraining \n\nIe \" \n\n\u2022 \n\n.. \n\n.. \n\nintrinsic noise level \n\ncapacity ( h) \n\ntraining set size ( I) \n\ncapacity ( h) \n\nFigure 1: Errors as function of capacity and training set size. Figure la) shows \ncharacteristic plots of training and test error as a function of the learning machine \ncapacity for fixed training set size. The test error reaches a minimum at h = h* \nwhile the training error decreases as h increases. Figure Ib) shows the training and \ntest errors at fixed h for varying I. The dotted line marks the asymptotic error Eoo \nfor infinite I. Figure lc) shows the asymptotic error as a function of h. This error \nis limited from below by the intrinsic noise in the data. \n\nset, which are not reproduced in the test set (overfitting). Notice how in Figure la) \nthe optimal test error is achieved at a capacity h* that is smaller than the capacity \nfor which zero error is achieved on the training set. The learning machine with \ncapacity h* will typically commit errors on misclassified or outlying patterns of the \ntraining set. \nWe can alternatively discuss the error on the test and training set as a function \nof the training set size I for fixed capacity h of the learning machine. Typical \nbehavior is sketched in Figure Ib). For small I we have enough free parameters to \ncompletely model the training set, so the training error is zero. Excess capacity \nis used by the learning machine to model details in the training set, leading to a \nlarge test error. As we increase the training set size I we train on more and more \npatterns so the test error declines. For some critical size of the training set, Ie, the \nmachine can no longer model all the training patterns and the training error starts \nto rise. As we further increase I the irregularities of the individual training patterns \nsmooth out and the parameters of the learning machine is more and more used to \nmodel the true underlying function . The test error declines, and asymptotically \nthe training and test error reach the same error value Eoo . This error value is the \nlimiting performance of the given learning machine to the task . In practice we never \nhave the infinite amount of training data needed to achieve Eoo. However, recent \ntheoretical calculations [8, 1, 2, 7, 5] and experimental results [3] have shown that \nwe can estimate Eoo by averaging the training and test errors for I> Ie. This means \nwe can predict the optimal performance of a given machine. \n\nFor a given type of learning machine the value of the asymptotic error Eoo of \nthe machine depends on the quality of the data and the set of functions it can \nimplement. The set of available functions increases with the capacity of the machine: \n\n\f242 \n\nCorinna Cortes, L. D. Jackel, Wall-Ping Chiang \n\nlow capacity machines will typically exhibit a high asymptotic error due to a big \ndifference between the true noise-free function of the patterns and the function \nimplemented by the learning machine, but as we increase h this difference decreases. \nIf the learning machine with increasing h becomes a universal machine capable of \nmodeling any function the difference eventually reaches zero, so the asymptotic \nerror Eoo only measures the intrinsic noise level of the data. Once a capacity of \nthe machine has been reached that matches the complexity of the true function \nno further improvement in Eoo can be achieved. This is illustrated in Figure lc). \nThe intrinsic noise level of the data or the limiting performance of any learning \nmachine may hence be estimated as the asymptotic value of Eoo as obtained for \nasymptotically universal learning machines with increasing capacity applied to the \ntask. This technique will be illustrated in the following section. \n\n3 Experimental Results \n\nIn this section we estimate the limiting performance imposed by the data of any \nlearning machine applied to the particular prediction task. \n\n3.1 Task Description \n\nTo ensure the highest possible quality of service, the performance parameters of \nthe AT&T network are constantly monitored. Due to the high complexity of the \nnetwork this performance surveillance is mainly corrective: when certain measures \nexceed preset thresholds action is taken to maintain reliable, high quality service. \nThese reorganizations can lead to short, minor impairments of the quality of the \ncommunication path. In contrast, the work reported here is preventive: our ob(cid:173)\njective is to make use of the performance parameters to form predictions that are \nsufficiently accurate that preemptive repairs of the channels can be made during \nperiods of low traffic. \n\nIn our study we have examined the characteristics of long-distance, 45 Mbitsfs \ncommunication paths in the domestic AT&T network. The paths are specified from \none city to another and may include different kinds of physical links to complete \nthe paths. A path from New York City to Los Angeles might include both optical \nfiber and coaxial cable. To maintain high-quality service, particular links in a path \nmay be switched out and replaced by other, redundant links. \nThere are two primary ways in which performance degradation is manifested in the \npath. First is the simple bit-error rate, the fraction of transmitted bits that are not \ncorrectly received at the termination of the path. Barring catastrophic failure (like \na cable being cut), this error rate can be measured by examining the error-checking \nbits that are transmitted along with the data. The second instance of degrada(cid:173)\ntion, ''framing error\" , is the failure of synchronization between the transmitter and \nreceiver in a path. A framing error implies a high count of errored bits. \n\nIn order to better characterize the distribution of bit errors, several measures are \nhistorically used to quantify the path performance in a 15 minutes interval. These \nmeasures are: \n\nLow-Rate The number of seconds with exactly 1 error. \n\n\fLimits Oil Learning Machine Accllracy Imposed by Data Quality \n\n243 \n\n\"No-Trouble\" patterns: \n\nFrame-Error \n\nRate-High \n\nIn \n\n.. \n\nI Rate-Medlum \n\nRate-Low \n\no \n\ntime (days) \n\n21 \n\n\"Trouble\" pattems: \n\n\u2022 \n\u2022 \n\nII \n\n, I \n\nI J \n\nI , \n\nI ,,' \nl\"e \n\nFigure 2: Errors as function of time. The 3 top patterns are members of the \n\"No-Trouble\" class. The 3 bottom ones are members of the \"Trouble\" class. Errors \nare here plotted as mean values over hours. \n\nMedium-Rate The number of seconds with more than one but less than 45 errors. \nHigh-Rate The number of seconds with 45 or more errors, corresponding to a bit \n\nerror rate of at least 10-6 \u2022 \n\nFrame-Error The number of seconds with a framing error. A second with a frame(cid:173)\n\nerror is always accompanied by a second of High-Rate error. \n\nAlthough the number of seconds with the errors described above in principle could \nbe as high as 900, any value greater than 255 is automatically clipped back to 255 \nso that each error measure value can be stored in 8 bits. \nDaily data that include these measures are continuously logged in an AT&T \ndatabase that we call Perf(ormance)Mon(itor). Since a channel is error free most of \nthe time, an entry in the database is only made if its error measures for a 15 minute \nperiod exceed fixed low thresholds, e.g. 4 Low-Rate seconds, 1 Medium- or High(cid:173)\nRate second, or 1 Frame-Error. In our research we \"mined\" PerfMon to formulate a \nprediction strategy. We extracted examples of path histories 28 days long where the \npath at day 21 had at least 1 entry in the PerfMon database. We labeled the exam(cid:173)\nples according to the error-measures over the next 7 days. If the channel exhibited \na 15-minute period with at least 5 High-Rate seconds we labeled it as belonging to \nthe class \"Trouble\". Otherwise we labeled it as member of \"No-Trouble\" . \n\nThe length of the history- and future-windows are set somewhat arbitrarily. The \nhistory has to be long enough to capture the state of the path but short enough \nthat our learning machine will run in a reasonable time. Also the longer the history \nthe more likely the physical implementation of the path was modified so the error \nmeasures correspond to different media. Such error histories could in principle be \neliminated from the extracted examples using the record of the repairs and changes \n\n\f244 \n\nCorinna Cortes, L. D. lackel, Wan-Ping Chiang \n\nof the network. The complexity of this database, however, hinders this filtering of \nexamples. The future-window of7 days was set as a design criterion by the network \nsystem engineers. \nExamples of histories drawn from PerfMon are shown in Figure 2. Each group of \ntraces in the figure includes plots of the 4 error measures previously described. The \n3 groups at the top are examples that resulted in No-Trouble while the examples \nat the bottom resulted in Trouble. Notice how bursty and irregular the errors are, \nand how the overall level of Frame- and High-Rate errors for the Trouble class \nseems only slightly higher than for the No-Trouble class, indicating the difficulty of \nthe classification task as defined from the database PerfMon. PerfMon constitutes, \nhowever, the only stored information about the state of a given channel in its entirety \nand thus all the knowledge on which one can base channel end-to-end predictions: \nit is impossible to install extra monitoring equipment to provide other than the 4 \nmentioned end-to-end error measures. \n\nThe above criteria for constructing examples and labels for 3 months of PerfMon \ndata resulted in 16325 examples from about 900 different paths with 33.2% of the \nexamples in the class Trouble. This means, that always guessing the label of the \nlargest class, No-Trouble, would produce an error rate of about 33%. \n\n3.2 Estimating Limiting Performance \n\nThe 16325 path examples were randomly divided into a training set of 14512 ex(cid:173)\namples and a test set of 1813 examples. Care was taken to ensure that a path only \ncontributes to one of the sets so the two sets were independent, and that the two \nsets had similar statistical properties. \n\nOur input data has a time-resolution of 15 minutes. For the results reported here \nthe 4 error measures of the patterns were subsampled to mean values over days \nyielding an input dimensionality of 4 x 21. \n\nWe performed two sets of independent experiments. In one experiment we used \nfully connected neural networks with one layer of hidden units. In the other we \nused LVQ learning machines with an increasing number of codebook vectors. Both \nchoices of machine have two advantages: the capacity of the machine can easily be \nincreased by adding more hidden units, and by increasing the number of hidden \nunits or number of codebook vectors we can eventually model any mapping [4]. We \nfirst discuss the results with neural networks. \nBaseline performance was obtained from a threshold classifier by averaging all the \ninput signals and thresholding the result. The training data was used to adjust \nthe single threshold parameter. With this classifier we obtained 32% error on the \ntraining set and 33% error on the test set. The small difference between the two error \nmeasures indicate statistically induced differences in the difficulty of the training \nand test sets. An analysis of the errors committed revealed that the performance \nof this classifier is almost identical to always guessing the label of the largest class \n\"No-Trouble\": close to 100% of the errors are false negative. \nA linear classifier with about 200 weights (the network has two output units) ob(cid:173)\ntained 28% error on the training set and 32% error on the test set. \n\n\fLimits on Learning Machine Accuracy Imposed by Data Quality \n\n245 \n\nclassification error. % \n\ntest \n\n40 classification error. % \n\n30 \n\n20 \n\n40 \n\n30 \n\n20 \n\n10~ ____ - . __________ ~~ \n\n3 \nweights (log10) \n\n4 \n\ncodebook vectors \n\n10 ~-------------~ \n\n2 \n\ntraining \n\n3 \n(log 1 (} \n\nFigure 3: a) Measured classification errors for neural networks with increasing \nnumber of weights (capacity). The mean value between the test and training error \nestimates the performance of the given classifier trained with unlimited data. b) \nMeasured classification errors for LVQ classifiers with increasing number of code(cid:173)\nbook vectors. \n\nFurther experiments exploited neural nets with one layer of respectively 3, 5, 7, 10, \n15, 20, 30, and 40 hidden units. All our results are summarized in Figure 3a). This \nfigure illustrates several points mentioned in the text above. As the complexity of \nthe network increases, the training error decreases because the networks get more \nfree parameters to memorize the data. Compare to Figure 1a). The test error also \ndecreases at first, going through a minimum of 29% at the network with 5 hidden \nunits. This network apparently has a capacity that best matches the amount and \ncharacter of the available training data. For higher capacity the networks overfit \nthe data at the expense of increased error on the test set. \n\nFigure 3a) should also be compared to Figure 1c). In Figure 3a) we plotted approx(cid:173)\nimate values of Eoo for the various networks -\nthe minimal error of the network \nto the given task. The values of Eoo are estimated as the mean of the training and \ntest errors. The value of Eoo appears to flatten out around the network with 30 \nunits, asymptotically reaching a value of 24% error. \n\nAn asymptotic Eoo-value of 25% was obtained from LVQ-experiments with increas(cid:173)\ning number of codebook vectors. These results are summarized in Figure 3b). We \ntherefore conjecture that the intrinsic noise level of the task is about 25%, and this \nnumber is the limiting error rate imposed by the quality of the data on any learning \nmachine applied to the task. \n\n\f246 \n\nCorinna Cortes, L. D. Jackel, Wan-Ping Chiang \n\n4 Conclusion \n\nIn this paper we have proposed a method for estimating the limits on performance \nimposed by the quality of the database on which a task is defined. The method \ninvolves a series of learning experiments. The extracted result is, however, indepen(cid:173)\ndent of the choice of learning machine used for these experiments since the estimated \nlimiting performance expresses a characteristic of the data. The only requirements \non the learning machines are that their capacity can be varied and be made large, \nand that the machines with increasing capacity become capable of implementing \nany function. In this paper we have demonstrated the robustness of our method to \nthe choice of classifiers: the result obtained with neural networks is in statistical \nagreement with the result obtained for LVQ classifiers. \nUsing the proposed method we have investigated how well prediction of upcoming \ntrouble in a telecommunication path can be performed based on information ex(cid:173)\ntracted from a given database. The analysis has revealed a very high intrinsic noise \nlevel of the extracted information and demonstrated the inadequacy of the data to \nconstruct high performance classifiers. This study is typical for many applications \nwhere the data collection was not necessarily designed for the problem at hand. \nAcknowledgments \n\nWe gratefully acknowledge Vladimir Vapnik who brought this application to the \nattention of the Holmdel authors. One of the authors (CC) would also like to thank \nWalter Dziama, Charlene Paul, Susan Blackwood, Eric Noel, and Harris Drucker \nfor lengthy explanations and helpful discussions of the AT&T transport system. \n\nReferences \n[1] s. Bos, W. Kinzel, and M. Opper. The generalization ability of perceptrons \n\nwith continuous output. Physical Review E, 47:1384-1391, 1993. \n\n[2] Corinna Cortes. Prediction of Generalization Ability in Learning Machines. \n\nPhD thesis, University of Rochester, NY, 1993. \n\n[3] Corinna Cortes, L. D. Jackel, Sara A. So1la, V. Vapnik, and John S. Denker. \nLearning curves: Asymptotic value and rate of convergence. In Advances in \nNeural Information Processing Systems, volume 6. Morgan Kaufman, 1994. \n\n[4] G. Cybenko, K. Hornik, M. Stinchomb, and H. White. Multilayer feedforward \nneural networks are universal approximators. Neural Networks, 2:359-366, 1989. \n[5] T. L. Fine. Statistical generalization and learning. Technical Report EE577, \n\nCornell University, 1993. \n\n[6] Teuvo Kohonen, Gyorgy Barna, and Ronald Chrisley. Statistical pattern recog(cid:173)\n\nnition with neural networks: Benchmarking studies. In Proc. IEEE Int. Con! \non Neural Networks, IJCNN-88, volume 1, pages 1-61-1-68, 1988. \n\n[7] N. Murata, S. Yoshizawa, and S. Amari. Learning curves, model selection, and \ncomplexity of neural networks. In Advances in Neural Information Processing \nSystems, volume 5, pages 607-614. Morgan Kaufman, 1992. \n\n[8] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning \n\nfrom examples. Physical Review A, 45:6056-6091, 1992. \n\n\f", "award": [], "sourceid": 918, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "L.", "family_name": "Jackel", "institution": null}, {"given_name": "Wan-Ping", "family_name": "Chiang", "institution": null}]}