{"title": "A Linear Programming Approach to Novelty Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 401, "abstract": null, "full_text": "A Linear Programming Approach to \n\nNovelty Detection \n\nColin Campbell \n\nDept. of Engineering Mathematics, \n\nBristol University, Bristol \n\nBristol, BS8 1 TR, \nUnited Kingdon \n\nC. Campbell@bris.ac.uk \n\nKristin P. Bennett \n\nDept. of Mathematical Sciences \nRensselaer Polytechnic Institute \n\nTroy, New York 12180-3590 \n\nUnited States \nbennek@rpi.edu \n\nAbstract \n\nNovelty detection involves modeling the normal behaviour of a sys(cid:173)\ntem hence enabling detection of any divergence from normality. It \nhas potential applications in many areas such as detection of ma(cid:173)\nchine damage or highlighting abnormal features in medical data. \nOne approach is to build a hypothesis estimating the support of \nthe normal data i.e. constructing a function which is positive in the \nregion where the data is located and negative elsewhere. Recently \nkernel methods have been proposed for estimating the support of \na distribution and they have performed well in practice - training \ninvolves solution of a quadratic programming problem. In this pa(cid:173)\nper we propose a simpler kernel method for estimating the support \nbased on linear programming. The method is easy to implement \nand can learn large datasets rapidly. We demonstrate the method \non medical and fault detection datasets. \n\n1 \n\nIntroduction. \n\nAn important classification task is the ability to distinguish b etween new instances \nsimilar to m embers of the training set and all other instances that can occur. For \nexample, we may want to learn the normal running behaviour of a machine and \nhighlight any significant divergence from normality which may indicate onset of \ndamage or faults. This issue is a generic problem in many fields. For example, \nan abnormal event or feature in medical diagnostic data typically leads to further \ninvestigation. \n\nNovel events can be highlighted by constructing a real-valued density estimation \nfunction. However, here we will consider the simpler task of modelling the support \nof a data distribution i.e. creating a binary-valued function which is positive in those \nregions of input space where the data predominantly lies and negative elsewhere. \n\nRecently kernel methods have been applied to this problem [4]. In this approach \ndata is implicitly mapped to a high-dimensional space called feature space [13]. \nSuppose the data points in input space are X i (with i = 1, . . . , m) and the mapping \n\n\fis Xi --+ \u00a2;(Xi) then in the span of {\u00a2;(Xi)}, we can expand a vector w = Lj cr.j\u00a2;(Xj). \nHence we can define separating hyperplanes in feature space by w . \u00a2;(x;) + b = O. \nWe will refer to w . \u00a2;(Xi) + b as the margin which will be positive on one side of \nthe separating hyperplane and negative on the other. Thus we can also define a \ndecision function: \n\nwhere z is a new data point. The data appears in the form of an inner product \nin feature space so we can implicitly define feature space by our choice of kernel \nfunction: \n\n(1) \n\nA number of choices for the kernel are possible, for example, RBF kernels: \n\nWith the given kernel the decision function is therefore given by: \n\n(2) \n\n(3) \n\n(4) \n\nOne approach to novelty detection is to find a hypersphere in feature space with a \nminimal radius R and centre a which contains most of the data: novel test points \nlie outside the boundary of this hypersphere [3 , 12] . This approach to novelty \ndetection was proposed by Tax and Duin [10] and successfully used on real life \napplications [11] . The effect of outliers is reduced by using slack variables ei to \nallow for datapoints outside the sphere and the task is to minimise the volume of \nthe sphere and number of datapoints outside i.e. \n\nmIll \ns.t. \n\n[R2 + oX L i ei 1 \n(Xi - a) . (Xi - a) S R2 + ei, ei ~ a \n\n(5) \n\nSince the data appears in the form of inner products kernel substitution can be \napplied and the learning task can be reduced to a quadratic programming problem. \nAn alternative approach has been developed by Scholkopf et al. [7]. Suppose we \nrestricted our attention to RBF kernels (3) then the data lies on the surface of a \nhypersphere in feature space since \u00a2;(x) . \u00a2;(x) = K(x , x) = l. The objective is \ntherefore to separate off the surface region constaining data from the region con(cid:173)\ntaining no data. This is achieved by constructing a hyperplane which is maximally \ndistant from the origin with all datapoints lying on the opposite side from the ori(cid:173)\ngin and such that the margin is positive. The learning task in dual form involves \nminimisation of: \n\nmIll W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) \ns.t. a S cr.i S C, L::1 cr.i = l. \n\n(6) \n\n\fHowever, the origin plays a special role in this model. As the authors point out \n[9] this is a disadvantage since the origin effectively acts as a prior for where the \nclass of abnormal instances is assumed to lie. In this paper we avoid this problem: \nrather than repelling the hyperplane away from an arbitrary point outside the data \ndistribution we instead try and attract the hyperplane towards the centre of the \ndata distribution. \n\nIn this paper we will outline a new algorithm for novelty detection which can be \neasily implemented using linear programming (LP) techniques. As we illustrate in \nsection 3 it performs well in practice on datasets involving the detection of abnor(cid:173)\nmalities in medical data and fault detection in condition monitoring. \n\n2 The Algorithm \n\nFor the hard margin case (see Figure 1) the objective is to find a surface in input \nspace which wraps around the data clusters: anything outside this surface is viewed \nas abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear \nfunction. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hy(cid:173)\nperplane which is pulled onto the mapped datapoints with the restriction that the \nmargin always remains positive or zero. We make the fit of this nonlinear function \nor hyperplane as tight as possible by minimizing the mean value of the output of \nthe function, i.e., Li J(x;). This is achieved by minimising: \n\nsubject to: \n\nm \n\nLO'.jK(x;,Xj) + b 2:: 0 \nj=l \n\nm L 0'.; = 1, 0'.; 2:: 0 \n\n;=1 \n\n(7) \n\n(8) \n\n(9) \n\nThe bias b is just treated as an additional parameter in the minimisation process \nthough unrestricted in sign. The added constraints (9) on 0'. bound the class of \nmodels to be considered - we don't want to consider simple linear rescalings of the \nmodel. These constraints amount to a choice of scale for the weight vector normal \nto the hyperplane in feature space and hence do not impose a restriction on the \nmodel. Also, these constraints ensure that the problem is well-posed and that an \noptimal solution with 0'. i- 0 exists. Other constraints on the class of functions are \npossible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. \nMany real-life datasets contain noise and outliers. To handle these we can introduce \na soft margin in analogy to the usual approach used with support vector machines. \nIn this case we minimise: \n\n(10) \n\n\fsubject to: \n\nm \n\nLO:jJ{(Xi , Xj)+b~-ei' ei~O \nj=l \n\n(11) \n\nand constraints (9). The parameter). controls the extent of margin errors (larger \n). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). \nThe above problem can be easily solved for problems with thousands of points us(cid:173)\ning standard simplex or interior point algorithms for linear programming. With the \naddition of column generation techniques, these same approaches can be adopted \nfor very large problems in which the kernel matrix exceeds the capacity of main \nmemory. Column generation algorithms incrementally add and drop columns each \ncorresponding to a single kernel function until optimality is reached. Such ap(cid:173)\nproaches have been successfully applied to other support vector problems [6 , 2]. \nBasic simplex algorithms were sufficient for the problems considered in this paper, \nso we defer a listing of the code for column generation to a later paper together \nwith experiments on large datasets [1]. \n\n3 Experiments \n\nArtificial datasets. Before considering experiments on real-life data we will first \nillustrate the performance of the algorithm on some artificial datasets. In Figure \n1 the algorithm places a boundary around two data clusters in input space: a \nhard margin was used with RBF kernels and (J\" = 0.2. In Figure 2 four outliers \nlying outside a single cluster are ignored when the system is trained using a soft \nmargin. In Figure 3 we show the effect of using a modified RBF kernel J{ (Xi, Xj) = \ne- ix,-xji/ 2,,2. This kernel and the one in (3) use a measure X - y, thus J{(x, x) is \nconstant and the points lie on the surface of a hypersphere in feature space. As a \nconsequence a hyperplane slicing through this hypersphere gives a closed boundary \nseparating normal and abnormal in input space: however, we found other choices \nof kernels may not produce closed boundaries in input space. \n\n01 \n\n\u00b702 \n\n.. -.~. \n\n\u00b703 '--~\"-\"--''--~-'----'-~'----'-----''----'-----'---' \n02 \n\n-035 \n\n-005 \n\n\u00b7015 \n\n-01 \n\n015 \n\n-03 \n\n-025 \n\n-02 \n\nODS \n\n01 \n\nFigure 1: The solution in input space for the hyperplane minimising W(o:, b) III \nequation (7). A hard margin was used with RBF kernels trained using (J\" = 0.2 \n\nMedical Diagnosis. For detection of abnormalities in medical data we investi(cid:173)\ngated performance on the Biomed dataset [5] from the Statlib data archive [14]. \n\n\f04 \n\n02 \n\n\u00b706 \n\n\u00b706 '---~-~-~--'--~-~--~-' \n\n-DB \n\n-06 \n\n-0 4 \n\n02 \n\n04 \n\n06 \n\n08 \n\nFigure 2: In this example 4 outliers are ignored by using a soft margin (with A = \n10.0). RBF kernels were used with (J\" = 0.2 \n\n-005 \n\n-015 \n\n-025 Do \n\n.03 L..:::'---'-_~_~~_-,-_~_~~_~-' \n\n\u00b703 \n\n01 \n\nFigure 3: The solution in input space for a modified RBF kernel K (Xi, Xj) \ne- 1x,-xjl/2a 2 with (J\" = 0.5 \n\nThis dataset consisted of 194 observations each with 4 attributes corresponding to \nmeasurements made on blood samples (15 observations with missing values were \nremoved). We trained the system on 100 randomly chosen normal observations \nfrom healthy patients. The system was then tested on 27 normal observations and \n67 observations which exhibited abnormalities due to the presense of a rare genetic \ndisease. \n\nIn Figure 4 we plot the results for training the novelty detector using a hard margin \nand with RBF kernels. This plot gives the error rate (as a percentage) on the y(cid:173)\naxis, versus (J\" on the x-axis with the solid curve giving the performance on normal \nobservations in the test data and the dashed curve giving performance on abnormal \nobservations. Clearly, when (J\" is very small the system puts a Gaussian of narrow \nwidth around each data point and hence all test data is labelled as abnormal. As (J\" \nincreases the model improves and at (J\" = 1.1 all but 2 of the normal test observations \nare correctly labelled and 57 of the 67 abnormal observations are correctly labelled. \nAs (J\" increases to (J\" = 10.0 the solution has 1 normal test observation incorrectly \nlabelled and 29 abnormal observations correctly labelled. \n\nThe kernel parameter (J\" is therefore crucial is determining the balance between \n\n\fI~ \n\n80 \n\n40 \n\n20 \n\nFigure 4: The error rate (as a percentage) on the y-axis, versus (J\" on the x-axis. \nThe solid curve giving the performance on normal observations in the test data and \nthe dashed curve giving performance on abnormal observations. \n\nnormality and abnormality. Future research on model selection may indicate a \ngood choice for the kernel parameter. However, if the dataset is large enough and \nsome abnormal events are known then a validation study can be used to determine \nthe kernel parameter - as we illustrate with the application below. Interestingly, if \nwe use an ensemble of models instead, with (J\" chosen across a range, then the relative \nproportion indicating abnormality gives an approximate measure of the confidence \nin the novelty of an observation: 29 observations are abnormal for all (J\" in Figure 4 \nand hence must be abnormal with high confidence. \n\nCondition Monitoring. Fault detection is an important generic problem in the \ncondition monitoring of machinery: failure to detect faults can lead to machine \ndamage, while an oversensitive fault detection system can lead to expensive and \nunnecessary downtime. An an example we will consider detection of 4 classes of \nfault in ball-bearing cages, which are often safety critical components in machines, \nvehicles and other systems such as aircraft wing flaps. \n\nIn this study we used a dataset from the Structural Integrity and Damage Assess(cid:173)\nment Network [15] . Each instance consisted of 2048 samples of acceleration taken \nwith a Bruel and Kjaer vibration analyser. After pre-processing with a discrete Fast \nFourier Transform each such instance had 32 attributes characterising the measured \nsignals. \n\nThe dataset consisted of 5 categories: normal data corresponding to measurements \nmade from new ball-bearings and 4 types of abnormalities which we will call type \n1 (outer race completely broken), type 2 (broken cage with one loose element), type \n3 (damaged cage with four loose elements) and type 4 (a badly worn ball-bearing \nwith no evident damage) . To train the system we used 913 normal instances on new \nball-bearings. Using RBF kernels the best value of (J\" ((J\" = 320.0) was found using \na validation study consisting of 913 new normal instances, 747 instances of type 1 \nfaults and 996 instances of type 2 faults. On new test data 98.7% of normal instances \nwere correctly labelled (913 instances), 100% of type 1 instances were correctly \nlabelled (747 instances) and 53.3% of type 2 instances were correctly labelled (996 \ninstances). Of course, with ample normal and abnormal data this problem could \nalso be approached using a binary classifier instead. Thus to evaluate performance \non totally unseen abnormalities we tested the novelty detector on type 3 errors and \ntype 4 errors (with 996 instances of each). The novelty detector labelled 28.3% of \n\n\ftype 3 and 25.5% oftype 4 instances as abnormal- which was statistically significant \nagainst a background of 1.3% errors on normal data. \n\n4 Conclusion \n\nIn this paper we have presented a new kernelised novelty detection algorithm which \nuses linear programming techniques rather than quadratic programming. The al(cid:173)\ngorithm is simple, easy to implement with standard LP software packages and it \nperforms well in practice. The algorithm is also very fast in execution: for the 913 \ntraining examples used in the experiments on condition monitoring the model was \nconstructed in about 4 seconds using a Silicon Graphics Origin 200. \n\nReferences \n\n[1] K. Bennett and C. Campbell. A Column Generation Algorithm for Novelty \n\nDetection. Preprint in preparation. \n\n[2] K. Bennett, A. Demiriz and J. Shawe-Taylor, A Column Generation Algorithm \nfor Boosting. In Proceed. of Intl. Conf on Machine Learning. Stanford, CA, \n2000. \n\n[3] C. Burges. A tutorial on support vector machines for pattern recognition. Data \n\nMining and Knowledge Discovery, 2, p. 121-167, 1998. \n\n[4] C. Campbell. An Introduction to Kernel Methods. In: Radial Basis Function \nNetworks: Design and Applications. R.J. Howlett and L.C. Jain (eds). Physica \nVerlag, Berlin , to appear. \n\n[5] L. Cox, M. Johnson and K. Kafadar . Exposition of Statistical Graphics Tech(cid:173)\n\nnology. ASA Proceedings of the Statistical Computation Section, p. 55-56 , 1982. \n[6] O. L. Mangasarian and D. Musicant. Massive Support Vector Regression. Data \nMining Institute Technical Report 99-02, University of Wisconsin-Madison, \n1999. \n\n[7] B. Scholkopf, J.C. Platt , J. Shawe-Taylor, A.J. Smola, R.C. Williamson. Es(cid:173)\ntimating the support of a high-dimensional distribution. Microsoft Research \nCorporation Technical Report MSR-TR-99-87, 1999 , 2000 \n\n[8] B. Scholkopf, R. Williamson, A. Smola, and J. Shawe-Taylor. SV estimation \nof a distribution 's support. In Neural Information Processing Systems, 2000, to \nappear. \n\n[9] B. Scholkopf, J. Platt and A. Smola. Kernel Method for Percentile Feature \n\nExtraction. Microsoft Technical Report MSR-TR-2000-22. \n\n[10] D. Tax and R. Duin. Data domain description by Support Vectors. In Pro(cid:173)\n\nceedings of ESANN99, ed. M Verleysen, D. Facto Press, Brussels, p . 251-256, \n1999. \n\n[11] D. Tax, A. Ypma, and R. Duin. Support vector data description applied to \n\nmachine vibration analysis. In: M. Boasson , J . Kaandorp , J.Tonino, M. Vossel(cid:173)\nman (eds.), Proc. 5th Annual Conference of the Advanced School for Computing \nand Imaging (Heijen, NL, June 15-17), 1999 , 398-405. \n\n[12] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995. \n[13] V. Vapnik. Statistical Learning Theory. Wiley, 1998. \n[14] cf. http) /lib.stat.cmu.edu/datasets \n[15] http://www .sidanet .org \n\n\f", "award": [], "sourceid": 1822, "authors": [{"given_name": "Colin", "family_name": "Campbell", "institution": null}, {"given_name": "Kristin", "family_name": "Bennett", "institution": null}]}