{"title": "Density Level Detection is Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1337, "page_last": 1344, "abstract": null, "full_text": " Density Level Detection is Classification\n\n\n\n Ingo Steinwart, Don Hush and Clint Scovel\n Modeling, Algorithms and Informatics Group, CCS-3\n Los Alamos National Laboratory\n {ingo,dhush,jcs}@lanl.gov\n\n\n\n\n\n Abstract\n\n We show that anomaly detection can be interpreted as a binary classifi-\n cation problem. Using this interpretation we propose a support vector\n machine (SVM) for anomaly detection. We then present some theoret-\n ical results which include consistency and learning rates. Finally, we\n experimentally compare our SVM with the standard one-class SVM.\n\n\n\n1 Introduction\n\nOne of the most common ways to define anomalies is by saying that anomalies are not\nconcentrated (see e.g. [1, 2]). To make this precise let Q be our unknown data-generating\ndistribution on the input space X. Furthermore, to describe the concentration of Q we need\na known reference distribution on X. Let us assume that Q has a density h with respect\nto . Then, the sets {h > }, > 0, describe the concentration of Q. Consequently, to\ndefine anomalies in terms of the concentration we only have to fix a threshold level > 0,\nso that an x X is considered to be anomalous whenever x {h }. Therefore our\ngoal is to find the density level set {h }, or equivalently, the -level set {h > }. Note\nthat there is also a modification of this problem where is not known but can be sampled\nfrom. We will see that our proposed method can handle both problems.\n\nFinding density level sets is an old problem in statistics which also has some interesting ap-\nplications (see e.g. [3, 4, 5, 6]) other than anomaly detection. Furthermore, a mathematical\nframework similar to classical PAC-learning has been proposed in [7]. Despite this effort,\nno efficient algorithm is known, which is a) consistent, i.e. it always finds the level set of\ninterest asymptotically, and b) learns with fast rates under realistic assumptions on h and\n. In this work we propose such an algorithm which is based on an SVM approach.\n\nLet us now introduce some mathematical notions. We begin with emphasizing that--as in\nmany other papers (see e.g. [5] and [6])--we always assume ({h = }) = 0. Now, let\nT = (x1, . . . , xn) Xn be a training set which is i.i.d. according to Q. Then, a density\nlevel detection algorithm constructs a function fT : X R such that the set {fT > 0}\nis an estimate of the -level set {h > } of interest. Since in general {fT > 0} does not\nexactly coincide with {h > } we need a performance measure which describes how well\n{fT > 0} approximates the set {h > }. Probably the best known performance measure\n(see e.g. [6, 7] and the references therein) for measurable functions f : X R is\n S,h,(f) := {f > 0} {h > } ,\n\n\f\nwhere denotes the symmetric difference. Obviously, the smaller S,h,(f) is, the more\n{f > 0} coincides with the -level set of h, and a function f minimizes S,h, if and\nonly if {f > 0} is -almost surely identical to {h > }. Furthermore, for a sequence of\nfunctions fn : X R with S,h,(fn) 0 we easily see that sign fn 1{h>} both\n-almost and Q-almost surely if 1A denotes the indicator function of a set A. Finally, it\nis important to note, that the performance measure S,h, is somehow natural in that it is\ninsensitive to -zero sets.\n\n\n2 Detecting density levels is a classification problem\n\nIn this section we show how the density level detection (DLD) problem can be formulated\nas a binary classification problem. To this end we write Y := {-1, 1} and define:\nDefinition 2.1 Let and Q be probability measures on X and s (0, 1). Then the proba-\nbility measure Q s on X Y is defined by\n Q s (A) := sExQ1A(x, 1) + (1 - s)Ex1A(x, -1)\nfor all measurable A X Y . Here we used the shorthand 1A(x, y) := 1A((x, y)).\n\nObviously, the measure P := Q s can be associated with a binary classification problem\nin which positive samples are drawn from Q and negative samples are drawn from . In-\nspired by this interpretation let us recall that the binary classification risk for a measurable\nfunction f : X R and a distribution P on X Y is defined by\n RP(f) = P {(x,y) : signf(x) = y} ,\nwhere we define sign t := 1 if t > 0 and sign t = -1 otherwise. Furthermore, we denote\nthe Bayes risk of P by RP := inf{RP(f) f : X R measurable}. We will show that\nlearning with respect to S,h, is equivalent to learning with respect to RP(.). To this end\nwe begin with the following easy to prove but fundamental proposition:\n\n\nProposition 2.2 Let and Q be probability measures on X such that Q has a density h\nwith respect to , and let s (0, 1). Then the marginal distribution of P := Q s on X\nis PX = sQ + (1 - s). Furthermore, we PX-a.s. have\n sh(x)\n P (y = 1|x) = .\n sh(x) + 1 - s\nNote that the above formula for PX implies that the -zero sets of X are exactly the PX -\nzero sets of X. Furthermore, Proposition 2.2 shows that every distribution P := Q s \nwith dQ := hd and s (0, 1) determines a triple (, h, ) with := (1 - s)/s and\nvice-versa. In the following we therefore use the shorthand SP(f) := S,h,(f). Let us\nnow compare RP(.) with SP(.). To this end we first observe that h(x) > = 1-s is\n s\nequivalent to sh(x) > 1 . By Proposition 2.2 the latter is -almost surely equivalent\n sh(x)+1-s 2\nto (x) := P (y = 1|x) > 1/2 and hence ({ > 1/2} {h > }) = 0. Now recall,\nthat binary classification aims to discriminate { > 1/2} from { < 1/2}. Thus it is no\nsurprise that RP(.) can serve as a performance measure as the following theorem shows:\nTheorem 2.3 Let and Q be distributions on X such that Q has a density h with respect to\n. Let > 0 satisfy ({h = }) = 0. We write s := 1 and define P := Q\n 1+ s . Then for\nall sequences (fn) of measurable functions fn : X R the following are equivalent:\ni) SP(fn) 0.\n\n\f\nii) RP(fn) RP.\nIn particular, for measurable f : X R we have SP(f) = 0 if and only if RP(f) = RP.\nProof: For n N we define En := {fn > 0} {h > }. Since ({h > } { >\n1\n2 }) = 0 it is easy to see that the classification risk of fn can be computed by\n\n RP(fn) = RP + |2 - 1|dPX . (1)\n En\n\nNow, {|2-1| = 0} is a -zero set and hence a PX-zero set. This implies that the measures\n|2 - 1|dPX and PX are absolutely continuous with respect to each other. Furthermore,\nwe have already observed after Proposition 2.2 that PX and are absolutely continuous\nwith respect to each other. Now, the assertion follows from SP(fn) = (En).\nTheorem 2.3 shows that instead of using SP(.) as a performance measure for the DLD\nproblem one can alternatively use the classification risk RP(.). Therefore, we will establish\nsome basic properties of this performance measure in the following. To this end we write\nI(y, t) := 1(-,0](yt), y Y and t R, for the standard classification loss function.\nWith this notation we can easily compute RP(f):\nProposition 2.4 Let and Q be probability measures on X. For > 0 we write s := 1\n 1+\nand define P := Q s . Then for all measurable f : X R we have\n 1 \n RP(f) = E E\n 1 + QI(1, sign f ) + 1 + I(-1, sign f ) .\n\n\nIt is interesting that the classification risk RP(.) is strongly connected with another ap-\nproach for the DLD problem which is based on the so-called excess mass (see e.g. [4], [5],\n[6], and the references therein). To be more precise let us first recall that the excess mass\nof a measurable function f : X R is defined by\n EP(f) := Q({f > 0}) - ({f > 0}),\nwhere Q, and have the usual meaning. The following proposition, that shows that\nRP(.) and EP(.) are essentially the same, can be easily checked:\nProposition 2.5 Let and Q be probability measures on X. For > 0 we write s := 1\n 1+\nand define P := Q s . Then for all measurable f : X R we have\n EP(f) = 1 - (1 + )RP(f).\n\nIf Q is an empirical measure based on a training set T in the definition of EP(.) we obtain\nthe empirical excess mass which we denote by ET(.). Then given a function class F the\n(empirical) excess mass approach chooses a function fT F which maximizes ET(.)\nwithin F. Since the above proposition shows\n 1 n\n ET(f) = 1 - I(1, sign f (x\n n i)) - EI(-1, sign f) .\n i=1\n\nwe see that this approach is actually a type of empirical risk minimization (ERM).\n\nIn the above mentioned papers the analysis of the excess mass approach needs an additional\nassumption on the behaviour of h around the level . Since this condition can be used to\nestablish a quantified version of Theorem 2.3 we will recall it now.\n\n\f\nDefinition 2.6 Let be a distribution on X and h : X [0, ) be a measurable function\nwith hd = 1, i.e. h is a density with respect to . For > 0 and 0 q we say\nthat h is of -exponent q if there exists a constant C > 0 such that for all sufficiently small\nt > 0 we have\n {|h - | t} Ctq . (2)\n\nCondition (2) was first considered in [5, Thm. 3.6.]. This paper also provides an example\nof a class of densities on Rd, d 2, which has exponent q = 1. Later, Tsybakov [6, p. 956]\nused (2) for the analysis of a DLD method which is based on a localized version of the\nempirical excess mass approach. Surprisingly, (2) is satisfied if and only if P := Q s \nhas Tsybakov exponent q in the sense of [8], i.e.\n\n PX |2 - 1| t C tq (3)\nfor some constant C > 0 and all sufficiently small t > 0 (see the proof of Theorem 2.7\nfor (2) (3) and [9] for the other direction). Recall that recently (3) has played a crucial\nrole for establishing learning rates faster than n- 12 for ERM algorithms and SVM's (see\ne.g. [10] and [8]). Remarkably, it was already observed in [11], that the classification\nproblem can be analyzed by methods originally developed for the DLD problem. However,\nto our best knowledge the exact relation between the DLD problem and binary classification\nhas not been presented, yet. In particular, it has not been observed yet, that this relation\nopens a non-heuristic way to use classification methods for the DLD problem as we will\ndemonstrate by example in the next section.\n\nLet us now use the -exponent to establish inequalities between SP(.) and RP(.):\nTheorem 2.7 Let > 0 and and Q be probability measures on X such that Q has a\ndensity h with respect to . For s := 1 we write P := Q\n 1+ s . Then we have\n\ni) If h is bounded there is a c > 0 such that for all measurable f : X R we have\n RP(f) - RP cSP(f).\nii) If h has -exponent q there is a c > 0 such that for all measurable f : X R we have\n q\n S 1+q\n P (f ) c RP (f) - RP .\nSketch of the proof: The first assertion directly follows from (1) and Proposition 2.2. For\nthe second assertion we first show (2) (3). To this end we observe that for 0 < t < 1 we\n 2\nhave Q |h - | t (1 + ) |h - | t . Thus there exists a ~C > 0 such that\nPX {|h - | t} ~Ctq for all 0 < t < 1. Furthermore, implies\n 2 |2 - 1| = h-\n h+\n\n 1 1 + t\n |2 - 1| t = - t ,\n 1 + t h 1 - t\nwhenever 0 < t < 1 . Let us now define t and t . This gives 1\n 2 l := 2t\n 1+t r := 2t\n 1-t - tl = 1-t\n 1+t\nand 1 + tr = 1+t . Furthermore, we obviously also have t\n 1-t l tr. Therefore we find\n 1 - t 1 + t\n \n 1 + t h 1 - t |h - | tr ,\nwhich shows (3). Now the assertion follows from [10, Prop. 1].\n\n\n3 A support vector machine for density level detection\n\nOne of the benefits of interpreting the DLD problem as a classification problem is that we\ncan construct an SVM for the DLD problem. To this end let k : X X R be a positive\n\n\f\ndefinite kernel with reproducing kernel Hilbert space (RKHS) H. Furthermore, let be\na known probability measure on X and l : Y R [0, ) be the hinge loss function,\ni.e. l(y, t) := max{0, 1 - yt}, y Y , t R. Then for a training set T = (x1, . . . , xn) \nXn, a regularization parameter > 0, and > 0 our SVM for the DLD problem chooses\na pair (fT,,, bT,,) H R which minimizes\n 1 n \n f 2 E\n H + l(1, f (x\n (1 + )n i) + b) + 1 + xl(-1, f(x) + b) (4)\n i=1\n\nin H R. The corresponding decision function of this SVM is fT,, + bT,, : X R.\nAlthough the measure is known, almost always the expectation Exl(-1, f(x)) can\nbe only numerically approximated by using finitely many function evaluations of f . Un-\nfortunately, since the hinge loss is not differentiable we do not know a deterministic\nmethod to choose these function evaluations efficiently. Therefore in the following we\nwill use points T := (z1, . . . , zm) which are randomly sampled from in order to ap-\nproximate Exl(-1, f(x)). We denote the corresponding approximate solution of (4) by\n(fT,T ,, bT,T ,). Since this modification of (4) is identical to the standard SVM formula-\ntion besides the weighting factors in front of the empirical l-risk terms we do not discuss\nalgorithmic issues. However note that this approach simultaneously addresses the original\n\" is known\" and the modified \" can be sampled from\" problems described in the in-\ntroduction. Furthermore it is also closely related to some heuristic methods for anomaly\ndetection that are based on artificial samples (see [9] for more information).\n\nThe fact that the SVM for DLD essentially coincides with the standard L1-SVM also allows\nus to modify many known results for these algorithms. For simplicity we will only state\nresults for the Gaussian RBF kernel with width 1/, i.e. k(x, x ) = exp(-2 x - x 22),\nx, x Rd, and the case m = n. More general results can be found in [12, 9]. We begin\nwith a consistency result with respect to the performance measure RP(.). Recall that by\nTheorem 2.3 this is equivalent to consistency with respect to SP(.):\nTheorem 3.1 Let X Rd be compact and k be the Gaussian RBF kernel with width 1/\non X. Furthermore, let > 0, and and Q be distributions on X such that Q has a\ndensity h with respect to . For s := 1 we write P := Q\n 1+ s . Then for all positive\nsequences (n) with n 0 and n1+\n n for some > 0, and for all > 0 we have\n lim (Q\n n )n (T,T ) (X X)n : RP(fT,T , + bT,T ,) > RP + = 0.\nSketch of the proof: Let us introduce the shorthand = Q for the product measure\nof Q and . Moreover, for a measurable function f : X R we define the function\ng f : X X R by 1 \n g f(x,x ) := l(1, f (x)) + l(\n 1 + 1 + -1, f (x )) , x, x X.\nFurthermore, we write l f(x, y) := l(y, f(x)), x X, y Y . Then it is easy to check\nthat we always have Eg f = EPl f. Analogously, we see ETT g f = ET sT l f\nif T T denotes the product measure of the empirical measures based on T and T . Now,\nusing Hoeffding's inequality for it is easy to establish a concentration inequality in the\nsense of [13, Lem. III.5]. The rest of the proof is analogous to the steps in [13] since these\nsteps are independent of the specific structure of the data-generating measure.\n\nIn general, we cannot obtain convergence rates in the above theorem without assuming\nspecific conditions on h, , and . We will now present such a condition which can be used\nto establish rates. To this end we write\n d(x,\n {h > }) if x {h < }\n x := d(x, {h < }) if x {h },\n\n\f\nwhere d(x, A) denotes the Euclidian distance between x and a set A. Now we define:\n\n\nDefinition 3.2 Let be a distribution on X Rd and h : X [0, ) be a measurable\nfunction with hd = 1, i.e. h is a density with respect to . For > 0 and 0 < \nwe say that h has geometric -exponent if\n\n -d\n x |h - |d < .\n X\n\n\nSince {h > } and {h } are the classes which have to be discriminated when in-\nterpreting the DLD problem as a classification problem it is easy to check by Proposi-\ntion 2.2 that h has geometric -exponent if and only if for P := Q s we have\n(x -1\n x ) Ld(|2 - 1|dPX). The latter is a sufficient condition for P to have geo-\nmetric noise exponent in the sense of [8]. We can now state our result on learning rates\nwhich is proved in [12].\n\n\nTheorem 3.3 Let X be the closed unit ball of the Euclidian space Rd, and and Q be\ndistributions on X such that dQ = hd. For fixed > 0 assume that the density h has\nboth -exponent 0 < q and geometric -exponent 0 < < . We define\n n- +1\n 2+1 if \n q+2\n 2q\n n :=\n n- 2(+1)(q+1)\n 2(q+2)+3q+4 otherwise ,\n\n - 1\nand (+1)d\n n := n in both cases. For s := 1 we write P := Q\n 1+ s . Then for all\n > 0 there exists a constant C > 0 such that for all x 1 and all n 1 the SVM using\nn and Gaussian RBF kernel with width 1/n satisfies\n\n(Q )n (T,T ) (X X)n : R +\n P (fT,T , + bT,T ,) > RP + Cx2n- \n 2+1 e-x\nif q+2 and\n 2q\n\n(Q)n (T,T ) X2n : R +\n P (fT,T , + bT,T ,) > RP +Cx2n- 2(q+1)\n 2(q+2)+3q+4 e-x\notherwise. If = the latter holds if n = is a constant with > 2d.\nRemark 3.4 With the help of Theorem 2.7 we immediately obtain rates with respect to the\nperformance measure SP(.). It turns out that these rates are very similar to those in [5] and\n[6] for the empirical excess mass approach.\n\n\n4 Experiments\n\nWe present experimental results for anomaly detection problems where the set X is a subset\nof Rd. Two SVM type learning algorithms are used to produce functions f which declare\nthe set {x : f(x) < 0} anomalous. These algorithms are compared based on their risk\nRP(f). The data in each problem is partitioned into three pairs of sets; the training sets\n(T, T ), the validation sets (V, V ) and the test sets (W, W ). The sets T , V and W contain\nsamples drawn from Q and the sets T ,V and W contain samples drawn from . The\ntraining and validation sets are used to design f and the test sets are used to estimate its\nperformance by computing an empirical version of RP(f) that we denote R(W,W )(f).\nThe first learning algorithm is the density level detection support vector machine (DLD\nSVM) with Gaussian RBF kernel described in the previous section. With and 2 fixed\n\n\f\nand the expected value Exl(-1, f(x) + b) in (4) replaced with an empirical estimate\nbased on T this formulation can be solved using, for example, the CSVC option in the\nLIBSVM software [14] by setting C = 1 and setting the class weights to w1 = 1/ |T|(1 +\n) and w-1 = / |T |(1 + ) . The regularization parameters and 2 are chosen to\n(approximately) minimize the empirical risk R(V,V )(f) on the validation sets. This is\naccomplished by employing a grid search over and a combined grid/iterative search over\n2. In particular, for each fixed grid value of we seek a minimizer over 2 by evaluating\nthe validation risk at a coarse grid of 2 values and then performing a Golden search over\nthe interval defined by the two 2 values on either side of the coarse grid minimum. As the\noverall search proceeds the (, 2) pair with the smallest validation risk is retained.\n\nThe second learning algorithm is the oneclass support vector machine (1CLASSSVM)\nintroduced by Sch olkopf et al. [15]. Due to its \"oneclass\" nature this method does not\nuse the set T in the production of f . Again we employ the Gaussian RBF kernel with\nparameter 2. The oneclass algorithm in Sch olkopf et al. contains a parameter which\ncontrols the size of the set {x T : f(x) 0} (and therefore controls the measure\nQ(f 0) through generalization). To make a valid comparison with the DLDSVM we\ndetermine automatically as a function of . In particular both and 2 are chosen to\n(approximately) minimize the validation risk using the search procedure described above\nfor the DLDSVM where the grid search for is replaced by a Golden search (over [0, 1])\nfor .\n\nData for the first experiment are generated as follows. Samples of the random variable\nx Q are generated by transforming samples of the random variable u that is uniformly\ndistributed over [0, 1]27. The transform is x = Au where A is a 10by27 matrix whose\nrows contain between m = 2 and m = 5 non-zero entries with value 1/m. Thus the\nsupport of Q is the hypercube [0, 1]10 and Q is concentrated about its centers. Partial\noverlap in the nonzero entries across the rows of A guarantee that the components of x are\npartially correlated. We chose to be the uniform distribution over [0, 1]10. Data for the\nsecond experiment are identical to the first except that the vector (0, 0, 0, 0, 0, 0, 0, 0, 0, 1)\nis added to the samples of x with probability 0.5. This gives a bi-modal distribution Q and\nsince the support of the last component is extended to [0, 2] the corresponding component\nof is also extended to this range. The training and validation set sizes are |T| = 1000,\n|T | = 2000, |V | = 500, and |V | = 2000. The test set sizes |W| = 100,000 and\n|W | = 100,000 are large enough to provide very accurate estimates of risk. The grid\nfor the DLDSVM method consists of 15 values ranging from 10-7 to 1 and the coarse 2\ngrid for the DLDSVM and 1CLASSSVM methods consists of 9 values that range from\n10-3 to 102. The learning algorithms are applied for values of ranging from 10-2 to\n102. Figure 1(a) plots the risk R(W,W ) versus for the two learning algorithms. In both\nexperiments the performance of DLDSVM is superior to 1CLASSSVM at smaller values\nof . The difference in the bimodal case is substantial. Comparisons for larger sizes of\n|T| and |V | yield similar results, but at smaller sample sizes the superiority of DLDSVM\nis even more pronounced. This is significant because 1 appears to have little utility\nin the general anomaly detection problem since it defines anomalies in regions where the\nconcentration of Q is much larger than the concentration of , which is contrary to our\npremise that anomalies are not concentrated.\n\nThe third experiment considers a real world application in cybersecurity. The goal is to\nmonitor the network traffic of a computer and determine when it exhibits anomalous be-\nhavior. The data for this experiment was collected from an active computer in a normal\nworking environment over the course of 16 months. Twelve features were computed over\neach 1 hour time frame to give a total of 11664 12dimensional feature vectors. These fea-\ntures are normalized to the range [0, 1] and treated as samples from Q. We chose to be the\nuniform distribution over [0, 1]12. The training, validation and test set sizes are |T| = 4000,\n|T | = 10000, |V | = 2000, |V | = 100,000, |W| = 5664 and |W | = 100,000. The \n\n\f\n 0.1 0.0025\n DLD-SVM-1 DLD-SVM\n 1CLASS-SVM-1 1CLASS-SVM\n DLD-SVM-2\n 0.08 1CLASS-SVM-2 0.002\nPSfrag replacements ) PSfrag replacements )\n ,W ,W\n 0.06\n DLD-SVM 0.0015\n (W (W\n 1CLASS-SVM R R\n 0.04 0.001\n DLD-SVM-1\n\n DLD-SVM-2\n\n 0.02 0.0005\n 1CLASS-SVM-1\n\n 1CLASS-SVM-2\n\n 0 0\n 0.01 0.1 1 10 100 0.01 0.1 1 10 100\n \n (a) Experiments 1 & 2 (b) Cybersecurity Experiment\n\n Figure 1: Comparison of DLDSVM and 1CLASSSVM. The curves with extension -1\n and -2 in Figure 1(a) correspond to experiments 1 and 2 respectively.\n\n\n grid for the DLDSVM method consists of 7 values ranging from 10-7 to 10-1 and the\n coarse 2 grid for the DLDSVM and 1CLASSSVM methods consists of 6 values that\n range from 10-3 to 102. The learning algorithms are applied for values of ranging from\n 0.05 to 50.0. Figure 1(b) plots the risk R(W,W ) versus for the two learning algorithms.\n The performance of DLDSVM is superior to 1CLASSSVM at all values of .\n\n\n References\n\n [1] B.D. Ripley. Pattern recognition and neural networks. Cambridge Univ. Press, 1996.\n\n [2] B. Sch\n olkopf and A.J. Smola. Learning with Kernels. MIT Press, 2002.\n\n [3] J.A. Hartigan. Clustering Algorithms. Wiley, New York, 1975.\n\n [4] J.A. Hartigan. Estimation of a convex density contour in 2 dimensions. J. Amer. Statist. Assoc.,\n 82:267270, 1987.\n\n [5] W. Polonik. Measuring mass concentrations and estimating density contour clusters--an excess\n mass aproach. Ann. Stat., 23:855881, 1995.\n\n [6] A.B. Tsybakov. On nonparametric estimation of density level sets. Ann. Statist., 25:948969,\n 1997.\n\n [7] S. Ben-David and M. Lindenbaum. Learning distributions by their density levels: a paradigm\n for learning without a teacher. J. Comput. System Sci., 55:171182, 1997.\n\n [8] C. Scovel and I. Steinwart. Fast rates for support vector machines. Ann. Statist., submitted,\n 2003. http://www.c3.lanl.gov/~ingo/publications/ann-03.ps.\n\n [9] I. Steinwart, D. Hush, and C. Scovel. A classification framework for anomaly detection. Tech-\n nical report, Los Alamos National Laboratory, 2004.\n\n [10] A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32:135\n 166, 2004.\n\n [11] E. Mammen and A. Tsybakov. Smooth discrimination analysis. Ann. Statist., 27:18081829,\n 1999.\n\n [12] C. Scovel, D. Hush, and I. Steinwart. Learning rates for support vector machines for density\n level detection. Technical report, Los Alamos National Laboratory, 2004.\n\n [13] I. Steinwart. Consistency of support vector machines and other regularized kernel machines.\n IEEE Trans. Inform. Theory, to appear, 2005.\n\n [14] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2004.\n\n [15] B. Sch\n olkopf, J.C. Platt, J. Shawe-Taylor, and A.J. Smola. Estimating the support of a high-\n dimensional distribution. Neural Computation, 13:14431471, 2001.\n\n\f\n", "award": [], "sourceid": 2609, "authors": [{"given_name": "Ingo", "family_name": "Steinwart", "institution": null}, {"given_name": "Don", "family_name": "Hush", "institution": null}, {"given_name": "Clint", "family_name": "Scovel", "institution": null}]}