{"title": "Fast Rates to Bayes for Kernel Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1345, "page_last": 1352, "abstract": null, "full_text": " Fast Rates to Bayes for Kernel Methods\n\n\n\n Ingo Steinwart and Clint Scovel\n Modeling, Algorithms and Informatics Group, CCS-3\n Los Alamos National Laboratory\n {ingo,jcs}@lanl.gov\n\n\n\n\n Abstract\n\n We establish learning rates to the Bayes risk for support vector machines\n (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF\n kernels we propose a geometric condition for distributions which can be\n used to determine approximation properties of these kernels. Finally, we\n compare our methods with a recent paper of G. Blanchard et al..\n\n\n1 Introduction\n\nIn recent years support vector machines (SVM's) have been the subject of many theoretical\nconsiderations. In particular, it was recently shown ([1], [2], and [3]) that SVM's can learn\nfor all data-generating distributions. However, these results are purely asymptotic, i.e. no\nperformance guarantees can be given in terms of the number n of samples. In this paper\nwe will establish such guarantees. Since by the no-free-lunch theorem of Devroye (see [4])\nperformance guarantees are impossible without assumptions on the data-generating distri-\nbution we will restrict our considerations to specific classes of distributions. In particular,\nwe will present a geometric condition which describes how distributions behave close to\nthe decision boundary. This condition is then used to establish learning rates for SVM's.\nTo obtain learning rates faster than n-1/2 we also employ a noise condition of Tsybakov\n(see [5]). Combining both concepts we are in particular able to describe distributions such\nthat SVM's with Gaussian kernel learn almost linearly, i.e. with rate n-1+ for all > 0,\neven though the Bayes classifier cannot be represented by the SVM.\n\nLet us now formally introduce the statistical classification problem. To this end assume that\nX is a set. We write Y := {-1,1}. Given a training set T = (x1,y1),...,(xn,yn) \n(X Y )n the classification task is to predict the label y of a new sample (x,y). In the\nstandard batch model it is assumed that T is i.i.d. according to an unknown probability\nmeasure P on X Y . Furthermore, the new sample (x, y) is drawn from P independently\nof T . Given a classifier C that assigns to every training set T a measurable function fT :\nX R the prediction of C for y is signfT(x), where we choose a fixed definition of\nsign(0) {-1,1}. In order to \"learn\" from the samples of T the decision function fT\nshould guarantee a small probability for the misclassification of the example (x, y). To\nmake this precise the risk of a measurable function f : X R is defined by\n RP(f) := P {(x,y) : signf(x) = y} .\nThe smallest achievable risk RP := inf RP(f) | f : X R measurable is called the\nBayes risk of P . A function fP : X Y attaining this risk is called a Bayes decision func-\ntion. Obviously, a good classifier should produce decision functions whose risks are close\n\n\f\nto the Bayes risk. This leads to the definition: a classifier is called universally consistent if\n\n ET P nRP(fT) - RP 0 (1)\nholds for all probability measures P on X Y . The next naturally arising question is\nwhether there are classifiers which guarantee a specific rate of convergence in (1) for all\ndistributions. Unfortunately, this is impossible by the so-called no-free-lunch theorem of\nDevroye (see [4, Thm. 7.2]). However, if one restricts considerations to certain smaller\nclasses of distributions such rates exist for various classifiers, e.g.:\n\n Assuming that the conditional probability (x) := P(1|x) satisfies certain\n smoothness assumptions Yang showed in [6] that some plug-in rules (cf. [4])\n achieve rates for (1) which are of the form n- for some 0 < < 1/2 depend-\n ing on the assumed smoothness. He also showed that these rates are optimal in\n the sense that no classifier can obtain faster rates under the proposed smoothness\n assumptions.\n\n It is well known (see [4, Sec. 18.1]) that using structural risk minimization over a\n sequence of hypothesis classes with finite VC-dimension every distribution which\n has a Bayes decision function in one of the hypothesis classes can be learned with\n rate n-1/2.\n\n Let P be a noise-free distribution, i.e. RP = 0 and F be a class with finite VC-\n dimension. If F contains a Bayes decision function then up to a logarithmic factor\n the convergence rate of the ERM classifier over F is n-1 (see [4, Sec. 12.7]).\nRestricting the class of distributions for classification always raises the question of whether\nit is likely that these restrictions are met in real world problems. Of course, in general\nthis question cannot be answered. However, experience shows that the assumption that the\ndistribution is noise-free is almost never satisfied. Furthermore, it is rather unrealistic to\nassume that a Bayes decision function can be represented by the algorithm. Finally, as-\nsuming that the conditional probability is smooth, say k-times continuously differentiable,\nseems to be unjustifiable for many real world classification problems. We conclude that the\nabove listed rates are established for situations which are rarely met in practice.\n\nConsidering the ERM classifier and hypothesis classes F containing a Bayes decision func-\ntion there is a large gap in the rates for noise-free and noisy distributions. In [5] Tsybakov\nproposed a condition on the noise which describes intermediate situations. In order to\npresent this condition we write (x) := P (y = 1|x), x X, for the conditional proba-\nbility and PX for the marginal distribution of P on X. Now, the noise in the labels can\nbe described by the function |2 - 1|. Indeed, in regions where this function is close to\n1 there is only a small amount of noise, whereas function values close to 0 only occur in\nregions with a high noise. We will use the following modified version of Tsybakov's noise\ncondition which describes the size of the latter regions:\n\n\nDefinition 1.1 Let 0 q and P be a distribution on X Y . We say that P has\nTsybakov noise exponent q if there exists a constant C > 0 such that for all sufficiently\nsmall t > 0 we have\n PX |2 - 1| t C tq . (2)\n\nAll distributions have at least noise exponent 0. In the other extreme case q = the\nconditional probability is bounded away from 1 . In particular this means that noise-free\n 2\ndistributions have exponent q = . Finally note, that Tsybakov's original noise condition\n q\nassumed PX (f = fP ) c(RP(f) - RP)1+q for all f : X Y which is satisfied if\ne.g. (2) holds (see [5, Prop. 1]).\n\n\f\nIn [5] Tsybakov showed that if P has a noise exponent q then ERM-type classifiers can\n\nobtain rates in (1) which are of the form n- q+1\n q+pq+2 , where 0 < p < 1 measures the com-\nplexity of the hypothesis class. In particular, rates faster than n-1/2 are possible whenever\nq > 0 and p < 1. Unfortunately, the ERM-classifier he considered is usually hard to imple-\nment and in general there exists no efficient algorithm. Furthermore, his classifier requires\nsubstantial knowledge on how to approximate the Bayes decision rules of the considered\ndistributions. Of course, such knowledge is rarely present in practice.\n\n\n2 Results\n\nIn this paper we will use the Tsybakov noise exponent to establish rates for SVM's which\nare very similar to the above rates of Tsybakov. We begin by recalling the definition of\nSVM's. To this end let H be a reproducing kernel Hilbert space (RKHS) of a kernel\nk : X X R, i.e. H is a Hilbert space consisting of functions from X to R such\nthat the evaluation functionals are continuous, and k is symmetric and positive definite (see\ne.g. [7]). Throughout this paper we assume that X is a compact metric space and that k\nis continuous, i.e. H contains only continuous functions. In order to avoid cumbersome\nnotations we additionally assume k 1. Now given a regularization parameter > 0\nthe decision function of an SVM is\n\n 1 n\n (fT,, bT,) := arg min f 2H + l yi(f (xi) + b) , (3)\n f H n\n bR i=1\n\nwhere l(t) := max{0, 1 - t} is the so-called hinge loss. Unfortunately, only a few results\non learning rates for SVM's are known: In [8] it was shown that SVM's can learn with\nlinear rate if the distribution is noise-free and the two classes can be strictly separated by\nthe RKHS. For RKHS which are dense in the space C(X) of continuous functions the\nlatter condition is satisfied if the two classes have strictly positive distance in the input\nspace. Of course, these assumptions are far too strong for almost all real-world problems.\nFurthermore, Wu and Zhou (see [9]) recently established rates under the assumption that \nis contained in a Sobolev space. In particular, they proved rates of the form (log n)-p for\nsome p > 0 if the SVM uses a Gaussian kernel. Obviously, these rates are much too slow\nto be of practical interest and the difficulties with smoothness assumptions have already\nbeen discussed above.\n\nFor our first result, which is much stronger than the above mentioned results, we need to\nintroduce two concepts both of which deal with the involved RKHS. The first concept de-\nscribes how well a given RKHS H can approximate a distribution P . In order to introduce\nit we define the l-risk of a function f : X R by Rl,P(f) := E(x,y)Pl(yf(x)). The\nsmallest possible l-risk is denoted by Rl,P := inf{Rl,P(f) | f : X R}. Furthermore,\nwe define the approximation error function by\n\n a() := inf f 2H + Rl,P(f) - Rl,P , 0. (4)\n f H\n\nThe function a(.) quantifies how well an infinite sample SVM with RKHS H approximates\nthe minimal l-risk (note that we omit the offset b in the above definition for simplicity). If\nH is dense in the space of continuous functions C(X) then for all P we have a() 0 if\n 0 (see [3]). However, in non-trivial situations no rate of convergence which uniformly\nholds for all distributions P is possible. The following definition characterizes distributions\nwhich guarantee certain polynomial rates:\n\nDefinition 2.1 Let H be a RKHS over X and P be a distribution on X Y . Then H\napproximates P with exponent (0, 1] if there is a C > 0 such that for all > 0:\n a() C .\n\n\f\nIt can be shown (see [10]) that the extremal case = 1 is equivalent to the fact that the\nminimal l-risk can be achieved by an element of H. Because of the specific structure of the\napproximation error function values > 1 are only possible for distributions with 1.2\nFinally, we need a complexity measure for RKHSs. To this end let A E be a subset of a\nBanach space E. Then the covering numbers of A are defined by\n n\nN(A,,E) := min n 1 : x1,...,xn E with A (xi + BE) , > 0,\n i=1\n\nwhere BE denotes the closed unit ball of E. Now our complexity measure is:\n\nDefinition 2.2 Let H be a RKHS over X and BH its closed unit ball. Then H has com-\nplexity exponent 0 < p 2 if there is an ap > 0 such that for all > 0 we have\n log N(BH,,C(X)) ap-p .\nNote, that in [10] the complexity exponent was defined in terms of N(BH, , L2(TX)),\nwhere L2(TX ) is the L2-space with respect to the empirical measure of (x1, . . . , xn). Since\nwe always have N(BH, , L2(T)) N(BH, , C(X)) the Definition 2.2 is stronger than\nthe one in [10]. Here, we only used Definition 2.2 since it enables us to compare our results\nwith [11]. However, all results remain true if one uses the original definition of [10].\n\nFor many RKHSs bounds on the complexity exponents are known (see e.g. [3] and [10]).\nFurthermore, many SVMs use a parameterized family of RKHSs. For such SVMs the\nconstant ap may play a crucial role. We will see below, that this is in particular true for\nSVMs using a family of Gaussian RBF kernels. Let us now formulate our first result on\nrates:\n\nTheorem 2.3 Let H be a RKHS of a continuous kernel on X with complexity exponent\n0 < p < 2, and let P be a probability measure on X Y with Tsybakov noise exponent\n0 < q . Furthermore, assume that H approximates P with exponent 0 < 1. We\ndefine n := n- 4(q+1)\n (2q+pq+4)(1+) . Then for all > 0 there is a constant C > 0 such that for\nall x 1 and all n 1 we have\nPr T (X Y )n : R +\n P (fT, + b )\n n T,n RP +Cx2n- 4(q+1)\n (2q+pq+4)(1+) 1-e-x .\nHere Pr denotes the outer probability of P n in order to avoid measurability considera-\ntions.\n\nRemark 2.4 With a tail bound of the form of Theorem 2.3 one can easily get rates for (1).\nIn the case of Theorem 2.3 these rates have the form n- 4(q+1) +\n (2q+pq+4)(1+) for all > 0.\n\nRemark 2.5 For brevity's sake our major aim was to show the best possible rates using\nour techniques. Therefore, Theorem 2.3 states rates for the SVM under the assumption that\n(n) is optimally chosen. However, we emphasize, that the techniques of [10] also give\nrates if (n) is chosen in a different (and thus sub-optimal) way. This is also true for our\nresults on SVM's using Gaussian kernels which we will establish below.\n\nRemark 2.6 In [5] it is assumed that a Bayes classifier is contained in the function class\nthe algorithm minimizes over. This assumption corresponds to a perfect approximation of\n\nP by H, i.e. = 1. In this case our rate is (essentially) of the form n- 2(q+1)\n 2q+pq+4 . If we\nrescale the complexity exponent p from (0, 2) to (0, 1) and write p for the new complexity\n\nexponent this rate becomes n- q+1\n q+p q+2 . This is exactly the form of Tsybakov's result in [5].\nHowever, as far as we know our complexity measure cannot be compared to Tsybakov's.\n\n\f\nRemark 2.7 By the nature of Theorem 2.3 it suffices that P satisfies Tsybakov's noise\nassumption for every q < q. It also suffices to suppose that H approximates P with\nexponent for all < , and that H has complexity exponent p for all p > p. Now,\nit is shown in [10] that the RKHS H has an approximation exponent = 1 if and only if\nH contains a minimizer of the l-risk. In particular, if H has approximation exponent for\nall < 1 but not for = 1 then H does not contain such a minimizer but Theorem 2.3\ngives the same result as for = 1. If in addition the RKHS consists of C functions we\ncan choose p arbitrarily close to 0, and hence we can obtain rates up to n-1 even though H\ndoes not contain a minimizer of the l-risk, that means e.g. a Bayes decision function.\n\n\nIn view of Theorem 2.3 and the remarks concerning covering numbers it is often only\nnecessary to estimate the approximation exponent. In particular this seems to be true for\nthe most popular kernel, that is the Gaussian RBF kernel k(x, x ) = exp(-2 x-x 22),\nx, x X on (compact) subsets X of Rd with width 1/. However, to our best knowledge\nno non-trivial condition on or fP = sign (2 - 1) which ensures an approximation\nexponent > 0 for fixed width has been established and [12] shows that Gaussian kernels\npoorly approximate smooth functions. Hence plug-in rules based on Gaussian kernels may\nperform poorly under smoothness assumptions on . In particular, many types of SVM's\nusing other loss functions are plug-in rules and therefore, their approximation properties\nunder smoothness assumptions on may be poor if a Gaussian kernel is used. However, our\nSVM's are not plug-in rules since their decision functions approximate the Bayes decision\nfunction (see [13]). Intuitively, we therefore only need a condition that measures the cost of\napproximating the \"bump\" of the Bayes decision function at the \"decision boundary\". We\nwill now establish such a condition for Gaussian RBF kernels with varying widths 1/n.\nTo this end let X-1 := {x X : < 12} and X1 := {x X : > 12}. Recall that\nthese two sets are the classes which have to be learned. Since we are only interested in\ndistributions P having a Tsybakov exponent q > 0 we always assume that X = X-1 X1\nholds PX -almost surely. Now we define\n\n d(x, X1), if x X-1,\n x := \n d(x,X-1), ifxX1, (5)\nHere, d(x, A) denotes the distance0, otherwise .\n of x to a set A with respect to the Euclidian norm. Note\nthat roughly speaking x measures the distance of x to the \"decision boundary\". With the\nhelp of this function we can define the following geometric condition for distributions:\n\nDefinition 2.8 Let X Rd be compact and P be a probability measure on X Y . We\nsay that P has geometric noise exponent (0, ] if we have\n -d\n x |2(x) - 1|PX(dx) < . (6)\n X\n\nFurthermore, P has geometric noise exponent if (6) holds for all > 0.\nIn the above definition we make neither any kind of smoothness assumption nor do we\nassume a condition on PX in terms of absolute continuity with respect to the Lebesgue\nmeasure. Instead, the integral condition (6) describes the concentration of the measure\n|2-1|dPX near the decision boundary. The less the measure is concentrated in this region\nthe larger the geometric noise exponent can be chosen. In particular, we have x -1\n x \nL |2-1|dPX if and only if the two classes X-1 and X1 have strictly positive distance!\nIf (6) holds for some 0 < < then the two classes may \"touch\", i.e. the decision\nboundary X-1 X1 is nonempty. Using this interpretation we easily can construct\ndistributions which have geometric noise exponent and touching classes. In general for\nthese distributions there is no Bayes classifier in the RKHS H of k for any > 0.\n\n\f\nExample 2.9 We say that is H older about 1 with exponent > 0 on X\n 2 Rd if there is\na constant c > 0 such that for all x X we have\n |2(x) - 1| cx . (7)\nIf is Holder about 1/2 with exponent > 0, the graph of 2(x) - 1 lies in a multiple\nof the envelope defined by \n x at the top and by -x at the bottom. To be Holder about\n1/2 it is sufficient that is Holder continuous, but it is not necessary. A function which is\nHolder about 1/2 can be very irregular away from the decision boundary but it cannot jump\nacross the decision boundary discontinuously. In addition a H older continuous function's\nexponent must satisfy 0 < 1 where being Holder about 1/2 only requires > 0.\nFor distributions with Tsybakov exponent such that is H older about 1/2 we can bound\nthe geometric noise exponent. Indeed, let P be a distribution which has Tsybakov noise\nexponent q 0 and a conditional probability which is Holder about 1/2 with exponent\n > 0. Then (see [10]) P has geometric noise exponent for all < q+1 .\n d\n\n\nFor distributions having a non-trivial geometric noise exponent we can now bound the\napproximation error function for Gaussian RBF kernels:\n\nTheorem 2.10 Let X be the closed unit ball of the Euclidian space Rd, and H be the\nRKHS of the Gaussian RBF kernel k on X with width 1/ > 0. Furthermore, let P\nbe a distribution with geometric noise exponent 0 < < . We write a(.) for the\napproximation error function with respect to H. Then there is a C > 0 such that for all\n > 0, > 0 we have\n a() C d + -d . (8)\n\nIn order to let the right hand side of (8) converge to zero it is necessary to assume both\n 0 and . An easy consideration shows that the fastest rate of convergence can\n \nbe achieved if () := - 1\n (+1)d . In this case we have a()() 2C+1 . Roughly\nspeaking this states that the family of spaces H() approximates P with exponent .\n +1\nNote, that we can obtain approximation rates up to linear order in for sufficiently benign\ndistributions. The price for this good approximation property is, however, an increasing\ncomplexity of the hypothesis class H() for , i.e. 0. The following theorem\nestimates this in terms of the complexity exponent:\n\nTheorem 2.11 Let H be the RKHS of the Gaussian RBF kernel k on X. Then for all\n0 < p 2 and > 0, there is a cp,d, > 0 such that for all > 0 and all 1 we have\n sup log N(B )(1+)d\n H , , L -p.\n 2(TX )) cp,d, (1-p2\n T Zn\n\n\nHaving established both results for the approximation and complexity exponent we can\nnow formulate our main result for SVM's using Gaussian RBF kernels:\n\nTheorem 2.12 Let X be the closed unit ball of the Euclidian space Rd, and P be a distri-\nbution on X Y with Tsybakov noise exponent 0 < q and geometric noise exponent\n0 < < . We define\n n- +1\n 2+1 if \n q+2\n 2q\n n :=\n n- 2(+1)(q+1)\n 2(q+2)+3q+4 otherwise ,\n\n - 1\nand (+1)d\n n := n in both cases. Then for all > 0 there is a C > 0 such that for all\nx 1 and all n 1 the SVM using n and Gaussian RBF kernel with width 1/n satisfies\n Pr T (X Y )n : R +\n P (fT, + b )\n n T,n RP + Cx2n- \n 2+1 1 - e-x\n\n\f\nif q+2 and\n 2q\n\nPr T (X Y )n : R +\n P (fT, + b )\n n T,n RP + Cx2n- 2(q+1)\n 2(q+2)+3q+4 1 - e-x\n \notherwise. If = the latter holds if n = is a constant with > 2 d.\n\nMost of the remarks made after Theorem 2.3 also apply to the above theorem up to obvious\nmodifications. In particular this is true for Remark 2.4, Remark 2.5, and Remark 2.7.\n\n\n3 Discussion of a modified support vector machine\n\nLet us now discuss a recent result (see [11]) on rates for the following modification of the\noriginal SVM:\n 1 n\n f \n T, := arg min f H + l yif (xi) . (9)\n f H n i=1\nNote that unlike in (3) the norm of the regularization term is not squared in (9). To describe\nthe result of [11] we need the following modification of the approximation error function:\n\n a() := inf f H + Rl,P(f) - Rl,P , 0. (10)\n f H\n\nObviously, a(.) plays the same role for (9) as a(.) does for (3). Moreover, it is easy to\nsee that for all > 0 with fP, 1 we have a() a(). Now, a slightly simplified\nversion of the result in [11] reads as follows:\n\n\nTheorem 3.1 Let H be a RKHS of a continuous kernel on X with complexity exponent\n0 < p < 2, and let P be a distribution on X Y with Tsybakov noise exponent . We\ndefine n := n- 2\n 2+p . Then for all x 1 there is a Cx > 0 such that for all n 1 we have\n Pr T (X Y )n : RP(fT, )\n n RP + Cx a(n) + n- 22+p 1 - e-x .\n\nBesides universal constants the exact value of Cx is given in [11]. Also note, that the orig-\ninal result of [11] used the eigenvalue distribution of the integral operator Tk : L2(PX ) \nL2(PX) as a complexity measure. If H has complexity exponent p it can be shown that\nthese eigenvalues decay at least as fast as n-2/p. Since we only want to compare Theorem\n3.1 with our results we do not state the eigenvalue version of Theorem 3.1.\n\nIt was also mentioned in [11] that using the techniques therein it is possible to derive rates\nfor the original SVM. In this case a(n) has to be replaced by a(n) and the stochastic\nterm n- 2\n 2+p has to be replaced by \"some more involved term\" (see [11, p.10]). Since\ntypically a(.) decreases faster than a(.) the authors conclude that using a regularization\nterm . instead of the original . 2 will \"necessarily yield an improved convergence rate\"\n(see [11, p.11]). Let us now show that this conclusion is not justified. To this end let us\nsuppose that H approximates P with exponent 0 < 1, i.e. a() C for some\nC > 0 and all > 0. It was shown in [10] that this equivalent to\n\n \n inf Rl,P(f) - Rl,P c11- (11)\n f -1/2\n\nfor some constant c1 > 0 and all > 0. Furthermore, using the techniques in [10] it\n 2\nis straightforward to show that (11) is equivalent to a() c21- . Therefore, if H\napproximates P with exponent then the rate in Theorem 3.1 becomes n- 4\n (2+p)(1+) which\n\n\f\nis the rate we established in Theorem 2.3 for the original SVM. Although the original SVM\n(3) and the modification (9) learn with the same rate there is a substantial difference in the\nway the regularization parameter has to be chosen in order to achieve this rate. Indeed,\nfor the original SVM we have to use n = n- 4\n (2+p)(1+) while for (9) we have to choose\nn = n- 2\n 2+p . In other words, since p is known for typical RKHS's but is not, we know\nthe asymptotically optimal choice of n for (9) while we do not know the corresponding\noptimal choice for the standard SVM. It is naturally to ask whether a similar observation\ncan be made if we have a Tsybakov noise exponent which is smaller than . The answer\nto this question is \"yes\" and \"no\". More precisely, using our techniques in [10] one can\nshow that for 0 < q the optimal choice of the regularization parameter in (9) is\nn = n- 2(q+1)\n 2q+pq+4 leading to the rate n- 4(q+1)\n (2q+pq+4)(1+) . As for q = this rate coincides\nwith the rate we obtained for the standard SVM. Furthermore, the asymptotically optimal\nchoice of n is again independent of the approximation exponent . However, it depends on\nthe (typically unknown) noise exponent q. This leads to the following important questions:\n\nQuestion 1: Is it easier to find an almost optimal choice of for (9) than for the standard\nSVM? And if so, what are the computational requirements of solving (9)?\n\nQuestion 2: Can a similar observation be made for the parametric family of Gaussian RBF\nkernels used in Theorem 2.12 if P has a non-trivial geometric noise exponent ?\n\n\nReferences\n\n [1] I. Steinwart. Support vector machines are universally consistent. J. Complexity,\n 18:768791, 2002.\n\n [2] T. Zhang. Statistical behaviour and consistency of classification methods based on\n convex risk minimization. Ann. Statist., 32:56134, 2004.\n\n [3] I. Steinwart. Consistency of support vector machines and other regularized kernel\n machines. IEEE Trans. Inform. Theory, to appear, 2005.\n\n [4] L. Devroye, L. Gy orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.\n Springer, New York, 1996.\n\n [5] A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist.,\n 32:135166, 2004.\n\n [6] Y. Yang. Minimax nonparametric classification--part I and II. IEEE Trans. Inform.\n Theory, 45:22712292, 1999.\n\n [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.\n Cambridge University Press, 2000.\n\n [8] I. Steinwart. On the influence of the kernel on the consistency of support vector\n machines. J. Mach. Learn. Res., 2:6793, 2001.\n\n [9] Q. Wu and D.-X. Zhou. Analysis of support vector machine classification. Tech.\n Report, City University of Hong Kong, 2003.\n\n[10] C. Scovel and I. Steinwart. Fast rates for support vector machines. Ann. Statist.,\n submitted, 2003. http://www.c3.lanl.gov/~ingo/publications/\n ann-03.ps.\n\n[11] G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector\n machines. Ann. Statist., submitted, 2004.\n\n[12] S. Smale and D.-X. Zhou. Estimating the approximation error in learning theory.\n Anal. Appl., 1:1741, 2003.\n\n[13] I. Steinwart. Sparseness of support vector machines. J. Mach. Learn. Res., 4:1071\n 1105, 2003.\n\n\f\n", "award": [], "sourceid": 2630, "authors": [{"given_name": "Ingo", "family_name": "Steinwart", "institution": null}, {"given_name": "Clint", "family_name": "Scovel", "institution": null}]}