{"title": "Consistency of one-class SVM and related algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1409, "page_last": 1416, "abstract": null, "full_text": "Consistency of one-class SVM and related algorithms\n\n Regis Vert Laboratoire de Recherche en Informatique Universite Paris-Sud 91405, Orsay Cedex, France Masagroup ^ 24 Bd de l'Hopital 75005, Paris, France Regis.Vert@lri.fr\n\nJean-Philippe Vert Geostatistics Center Ecole des Mines de Paris - ParisTech 77300 Fontainebleau, France Jean-Philippe.Vert@ensmp.fr\n\nAbstract\nWe determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator.\n\n1\n\nIntroduction\n\nGiven n i.i.d. copies (X1 , Y1 ), . . . , (Xn , Yn ) of a random variable (X, Y ) Rd {-1, 1}, we study in this paper the limit and consistency of learning algorithms that solve the following problem: 1n , i arg min (1) (Yi f (Xi )) + f H 2 n =1 f H where : R R is a convex loss function and H is the reproducing kernel Hilbert space (RKHS) of the normalized Gaussian radial basis function kernel (denoted simply Gaussian kernel below): - , 1 x - x 2 k (x, x ) = >0. (2) d exp 2 2 2 This framework encompasses in particular the classical support vector machine (SVM) [1] when (u) = max(1 - u, 0). Recent years have witnessed important theoretical advances\n\n\f\naimed at understanding the behavior of such regularized algorithms when n tends to infinity and decreases to 0. In particular the consistency and convergence rates of the two-class SVM (see, e.g., [2, 3, 4] and references therein) have been studied in detail, as well as the shape of the asymptotic decision function [5, 6]. All results published so far study the case where decreases as the number of points tends to infinity (or, equivalently, where -d converges to 0 if one uses the classical non-normalized version of the Gaussian kernel instead of (2)). Although it seems natural to reduce regularization as more and more training data are available even more than natural, it is the spirit of regularization [7, 8], there is at least one important situation where is typically held fixed: the one-class SVM [9]. In that case, the goal is to estimate an -quantile, that is, a subset of the input space X of given probability with minimum volume. The estimation is performed by thresholding the function output by the one-class SVM, that is, the SVM (1) with only positive examples; in that case is supposed to determine the quantile level1 . Although it is known that the fraction of examples in the selected region converges to the desired quantile level [9], it is still an open question whether the region converges towards a quantile, that is, a region of minimum volume. Besides, most theoretical results about the consistency and convergence rates of two-class SVM with vanishing regularization constant do not translate to the one-class case, as we are precisely in the seldom situation where the SVM is used with a regularization term that does not vanish as the sample size increases. The main contribution of this paper is to show that Bayes consistency can be obtained for algorithms that solve (1) without decreasing , if instead the bandwidth of the Gaussian kernel decreases at a suitable rate. We prove upper bounds on the convergence rate of the classification error towards the Bayes risk for a variety of functions and of distributions P , in particular for SVM (Theorem 6). Moreover, we provide an explicit description of the function asymptotically output by the algorithms, and establish converge rates towards this limit for the L2 norm (Theorem 7). In particular, we show that the decision function output by the one-class SVM converges towards the density to be estimated, truncated at the level 2 (Theorem 8); we finally show that this implies the consistency of one-class SVM as a density level estimator for the excess-mass functional [10] (Theorem 9). Due to lack of space we limit ourselves in this extended abstract to the statement of the main results (Section 2) and sketch the proof of the main theorem (Theorem 3) that underlies all other results in Section 3. All detailed proofs are available in the companion paper [11].\n\n2\n\nNotations and main results\n\nLet (X, Y ) be a pair of random variables taking values in Rd {-1, 1}, with distribution P . We assume throughout this paper that the marginal distribution of X is absolutely continuous with respect to Lebesgue measure with density : Rd R, and that is has a support included in a compact set X Rd . We denote : Rd [0, 1] a measurable version of the conditional distribution of Y = 1 given X . The normalized Gaussian radial basis function (RBF) kernel k with bandwidth parameter > 0 is defined for any (x, x ) Rd Rd by: , - 1 x - x 2 k (x, x ) = exp d 2 2 2 \n\n1 While the original formulation of the one-class SVM involves a parameter , there is asymptotically a one-to-one correspondance between and \n\nand the corresponding reproducing kernel Hilbert space (RKHS) is denoted by H . We -d 2 note = the normalizing constant that ensures that the kernel integrates to 1.\n\n\f\nDenoting by M the set of measurable real-valued functions on Rd , we define several risks for functions f M: The classification error rate, usually refered to as (true) risk of f , when Y is predicted by the sign of f (X ), is denoted by R (f ) = P (sign (f (X )) = Y ) . For a scalar > 0 fixed throughout this paper and a convex function : R R, the -risk regularized by the RKHS norm is defined, for any > 0 and f H , by R, (f ) = EP [ (Y f (X ))] + f H 2 Furthermore, for any real r 0, we denote by L (r) the Lipschitz constant of the restriction of to the interval [-r, r]. For example, for the hinge loss (u) = max(0, 1 - u) one can take L(r) = 1, and for the squared hinge loss (u) = max(0, 1 - u)2 one can take L(r) = 2(r + 1). Finally, the L2 -norm regularized -risk is, for any f M: R,0 (f ) = EP [ (Y f (X ))] + f 2 2 L f L2 = 2 R f (x)2 dx [0, +]. where,\n\nd\n\nThe minima of the three risk functionals defined above over their respective domains are denoted by R , R, and R,0 respectively. Each of these risks has an empirical counterpart where the expectation with respect to P is replaced by an average over an i.i.d. sample T = {(X1 , Y1 ) , . . . , (Xn , Yn )}. In particular, the following empirical version of R, will be used n 1i > 0, f H , R, (f ) = (Yi f (Xi )) + f 2 . H n =1\n\nThe main focus of this paper is the analysis of learning algorithms that minimize the empirical -risk regularized by the RKHS norm R, , and their limit as the number of points tends to infinity and the kernel width decreases to 0 at a suitable rate when n tends to , being kept fixed. Roughly speaking, our main result shows that in this situation, if is a convex loss function, the minimization of R, asymptotically amounts to minimizing R,0 . This stems from the fact that the empirical average term in the definition of R, converges to its corresponding expectation, while the norm in H of a function f decreases to its L2 norm when decreases to zero. To turn this intuition into a rigorous statement, we need a few more assumptions about the minimizer of R,0 and about P . First, we observe that the minimizer of R,0 is indeed well-defined and can often be explicitly computed: Lemma 1 For any x Rd , let\nR\n\nf,0 (x) = arg min\n\nThen f,0 is measurable and satisfies:\n\n (x) [ (x)() + (1 - )(-)] + 2\nf M\n\n.\n\nR,0 (f,0 ) = inf R,0 (f ) Second, we provide below a general result that shows how to control the excess R,0 -risk of the empirical minimizer of the R, -risk, for which we need to recall the notion of modulus of continuity [12].\n\n\f\nDefinition 2 (Modulus of continuity) Let f be a Lebesgue measurable function from Rd to R. Then its modulus of continuity in the L1 -norm is defined for any 0 as follows (f , ) = sup\n0 t \n\nf (. + t) - f (.) L1 ,\n\n(3)\n\nwhere t is the Euclidian norm of t Rd . Our main result can now be stated as follows: ^ Theorem 3 (Main Result) Let 1 > > 0, 0 < p < 2, > 0, and let f, denote a R, risk over H . Assume that the marginal density is bounded, and minimizer of the let M = supxRd (x). Then there exist constants (Ki )i=1...4 (depending only on p, , , d, and M ) such that, for any x > 0, the following holds with probability greater than 1 - e-x over the draw of the training data: 4 ( 1 [2+(2-p)p1+)]d 1 2+p 2 2+p 2+ (0) ^ , ) - R K 1 L R,0 (f ,0 n 21d x (0) + K2 L (4) n + K3 2 2 1 + K4 (f,0 , 1 ) .\n\nThe first two terms in the r.h.s. of (4) bound the estimation error associated with the gaussian RKHS, which naturally tends to be small when the number of training data increases and when the RKHS is 'small', i.e., when is large. As is usually the case in such variance/bias splitings, the variance term here depends on the dimension d of the input space. Note that it is also parametrized by both p and . The third term measures the error due to penalizing the L2 -norm of a fixed function in H1 by its . H -norm, with 0 < < 1 . This is a price to pay to get a small estimation error. As for the fourth term, it is a bound on the approximation error of the Gaussian RKHS. Note that, once and have been fixed, 1 remains a free variable parameterizing the bound itself. In order to highlight the type of convergence rates one can obtain from Theorem 3, let us assume that the loss function is Lipschitz on R (e.g., take the hinge loss), and suppose that for some 0 1, c1 > 0, and for any h 0, the function f,0 satisfies the following inequality (f,0 , h) c1 h . (5) Then we can optimize the right hand side of (4) w.r.t. 1 , , p and by balancing the four terms. This eventually leads to: 1 2 - , - ^ 4 +(2+ )d R,0 = OP R,0 f, (6) n for any > 0. This rate is achieved by choosing 2 1 4+(2+)d - 1 = , n 1 4+2(++)d - (2+) 2 2 2+ 2 = 1 = , n\n\n(7)\n\n(8)\n\n\f\np = 2 and as small as possible (that is why an arbitray small quantity appears in the rate). Theorem 3 shows that minimizing the R, risk for well-chosen width is a an algorithm consistant for the R,0 -risk. In order to relate this consistency with more traditional measures of performance of learning algorithms, the next theorem shows that under a simple additionnal condition on , R,0 -risk-consistency implies Bayes consistency: Theorem 4 If is convex, differentiable at 0, with (0) < 0, then for every sequence of functions (fi )i1 M,\ni+ lim R,0 (fi ) = R,0 = i+\n\nlim R (fi ) = R\n\nThis theorem results from a more general quantitative analysis of the relationship between the excess R,0 -risk and the excess R-risk, in the spirit of [13]. In order to state a refined version in the particular case of the support vector machine algorithm, we first need the following definition: Definition 5 We say that a distribution P with as marginal density of X w.r.t. Lebesgue 2 measure has a low density exponent 0 if there exists (c2 , 0 ) (0, +) such that x [0, 0 ], P Rd : (x) c2 .\n\nWe are now in position to state a quantitative relationship between the excess R,0 -risk and the excess R-risk in the case of support vector machines: Theorem 6 Let 1 () = max (1 - , 0) be the hinge loss function, and 2 () = 2 max (1 - , 0) , be the squared hinge loss function. Then for any distribution P with low density exponent , there exist constant (K1 , K2 , r1 , r2 ) (0, +)4 such that for any f M with an excess R1 ,0 -risk upper bounded by r1 the following holds: R(f ) - R K1 R(f ) - R K2 R\n1 ,0 (f ) - R1 ,0\n2 + 1\n\nand if the excess regularized R2 ,0 -risk upper bounded by r2 the following holds: R\n2 ,0 (f ) - R2 ,0\n2 + 1\n\n\n\n,\n\nThis result can be extended to any loss function through the introduction of variational arguments, in the spirit of [13]; we do not further explore this direction, but the reader is invited to consult [11] for more details. Hence we have proved the consistency of SVM, together with upper bounds on the convergence rates, in a situation where the effect of regularization does not vanish asymptotically. Another consequence of the R,0 -consistency of an algorithm is the L2 -convergence of the function output by the algorithm to the minimizer of the R,0 -risk: Lemma 7 For any f M, the following holds: f - f,0 2 2 L\n\n\n\n,\n\n. 1R ,0 (f ) - R,0 \n\nThis result is particularly relevant to study algorithms whose objective are not binary classification. Consider for example the one-class SVM algorithm, which served as the initial motivation for this paper. Then we claim the following:\n\n\f\nTheorem 8 Let denote the density truncated as follows: (x) if (x) 2, 2 (x) = 1 otherwise.\n\n(9)\n\n^ Let f denote the function output by the one-class SVM, that is the function that solves (1) in the case is the hinge-loss function and Yi = 1 for all i {1, . . . , n}. Then, under the general conditions of Theorem 3, for choosen as in Equation (8),\nn+\n\nlim\n\nf - L2 = 0 . ^\n\nAn interesting by-product of this theorem is the consistency of the one-class SVM algorithm for density level set estimation: Theorem 9 Let 0 < < 2 < M , let C be the level set of the density function at ^ ^ level , and C be the level set of 2f at level , where f is still the function outptut by the one-class SVM. For any distribution Q, for any subset C of Rd , define the excess-mass of C with respect to Q as follows: HQ (C ) = Q (C ) - Leb (C ) , (10) where Leb is the Lebesgue measure. Then, under the general assumptions of Theorem 3, we have C= lim HP (C ) - HP 0, (11) for choosen as in Equation (8).\nn+\n\nThe excess-mass functional was first introduced in [10] to assess the quality of density level set estimators. It is maximized by the true density level set C and acts as a risk functional in the one-class framework. The proof ef Theorem 9 is based on the following result: if is a density estimator converging to the true density in the L2 sense, then ^ for any fixed 0 < < sup {}, the excess mass of the level set of at level converges ^ to the excess mass of C . In other words, as is the case in the classification framework, plug-in estimators built on L2 -consistent density estimators are consistent with respect to the excess mass.\n\n3\n\nProof of Theorem 3 (sketch)\n\n+ [R, (k1 f,0 ) - R,0 (k1 f,0 )] . R + ,0 (k1 f,0 ) - R,0 It can be shown that k1 f,0 H21 H L2(Rd ) which justifies the introduction of R, (k1 f,0 ) and R,0 (k1 f,0 ). By studying the relationship between the Gaussian RKHS norm and the L2 norm, it can be shown that = ^ ^ - ^ 0, f, L2 - f, H 2 ^ 2 R, f, R,0 f,\n\nIn this section we sketch the proof of the main learning theorem of this contribution, which underlies most other results stated in Section 2 The proof of Theorem 3 is based on the ^ following dec position of the excess R,0 -risk for the minimizer f, of R, , valid for om any 0 < < 21 and any sample (xi , yi )i=1,...,n : R - ^ + ^ ^ R,0 (f, ) - R,0 = R, f, ,0 f, +R ^ , (f, ) - R, ( R 12) , - R, (k1 f,0 )\n\n\f\n while the following stems from the definition of R, : R, - R, (k1 f,0 ) 0.\n\n ^ Hence, controlling R,0 (f, ) - R,0 boils down to controlling each of the remaining three terms in (12).\n\n The second term in (12) is usually referred to as the sample error or estimation error. The control of such quantities has been the topic of much research recently, including for example [14, 15, 16, 17, 18, 4]. Using estimates of local Rademacher complexities through covering numbers for the Gaussian RKHS due to [4], the following result can be shown: ^ Lemma 10 For any > 0 small enough, let f, be the minimizer of the R, risk on a sample of size n, where is a convex loss function. For any 0 < p < 2, > 0, and x 1, the following holds with probability at least 1 - ex over the draw of the sample: 4 ( 1 [2+(2-p)p1+)]d 1 2+p 2 2+p 2+ (0) ^ R, (f, ) - R, (f, ) K1 L n 21d x (0) , + K2 L n where K1 and K2 are positive constants depending neither on , nor on n. In order to upper bound the fourth term in (12), the analysis of the convergence of the Gaussian RKHS norm towards the L2 norm when the bandwidth of the kernel tends to 0 leads to: R, (k1 f,0 ) - R,0 (k1 f,0 ) = k1 f,0 2 - k1 f,0 2 2 L H 2 L 2 f,0 2 2 21 (0) 2 2. 21\n\n The fifth term in (12) corresponds to the approximation error. It can be shown that for any bounded function in L1 (Rd ) and all > 0, the following holds: (13) k f - f L1 (1 + d) (f , ) , where (f , .) denotes the modulus of continuity of f in the L1 norm. From this the following inequality can be derived: R,0 (k1 f,0 ) - R,0 (f,0 ) (2 f,0 L + L ( f,0 L ) M ) 1 \n\n+\n\nd\n\n(f,0 , 1 ) .\n\n4\n\nConclusion\n\nWe have shown that consistency of learning algorithms that minimize a regularized empirical risk can be obtained even when the so-called regularization term does not asymptotically vanish, and derived the consistency of one-class SVM as a density level set estimator. Our method of proof is based on an unusual decomposition of the excess risk due to the presence of the regularization term, which plays an important role in the determination of the asymptotic limit of the function that minimizes the empirical risk. Although the upper bounds on the convergence rates we obtain are not optimal, they provide a first step toward the analysis of learning algorithms in this context.\n\n\f\nAcknowledgments The authors are grateful to Stephane Boucheron, Pascal Massart and Ingo Steinwart for fruitful discussions. This work was supported by the ACI \"Nouvelles interfaces des Mathematiques\" of the French Ministry for Research, and by the IST Program of the European Community, under the Pascal Network of Excellence, IST-2002-506778.\n\nReferences\n[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the 5th annual ACM workshop on Computational Learning Theory, pages 144152. ACM Press, 1992. [2] I. Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768791, 2002. [3] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat., 32:56134, 2004. [4] I. Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels. Technical report, Los Alamos National Laboratory, 2004. submitted to Annals of Statistics. [5] I. Steinwart. Sparseness of support vector machines. J. Mach. Learn. Res., 4:10711105, 2003. [6] P. L. Bartlett and A. Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. In Lecture Notes in Computer Science, volume 3120, pages 564578. Springer, 2004. [7] A.N. Tikhonov and V.Y. Arsenin. Solutions of ill-posed problems. W.H. Winston, Washington, D.C., 1977. [8] B. W. Silverman. On the estimation of a probability density function by the maximum penalized likelihood method. Ann. Stat., 10:795810, 1982. [9] B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Comput., 13:14431471, 2001. [10] J. A. Hartigan. Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc., 82(397):267270, 1987. [11] R. Vert and J.-P. Vert. Consistency and convergence rates of one-class svm and related algorithms. J. Mach. Learn. Res., 2006. To appear. [12] R. A. DeVore and G. G. Lorentz. Constructive Approximation. Springer Grundlehren der Mathematischen Wissenschaften. Springer Verlag, 1993. [13] P.I. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification and risk bounds. Technical Report 638, UC Berkeley Statistics, 2003. [14] A. B. Tsybakov. On nonparametric estimation of density level sets. Ann. Stat., 25:948969, June 1997. [15] E. Mammen and A. Tsybakov. Smooth discrimination analysis. Ann. Stat., 27(6):18081829, 1999. [16] P. Massart. Some applications of concentration inequalities to statistics. Ann. Fac. Sc. Toulouse, IX(2):245303, 2000. [17] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Annals of Statistics, 2005. To appear. [18] V. Koltchinskii. Localized rademacher complexities. Manuscript, september 2003.\n\n\f\n", "award": [], "sourceid": 2756, "authors": [{"given_name": "R\u00e9gis", "family_name": "Vert", "institution": null}, {"given_name": "Jean-philippe", "family_name": "Vert", "institution": null}]}