{"title": "Improving the Accuracy and Speed of Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 375, "page_last": 381, "abstract": null, "full_text": "Improving the Accuracy and Speed of \n\nSupport Vector Machines \n\nChris J.C. Burges \n\nBell Laboratories \n\nLucent Technologies, Room 3G429 \n\n101 Crawford's Corner Road \n\nHolmdel, NJ 07733-3030 \n\nburges@bell-Iabs.com \n\nBernhard Scholkopf\" \nMax-Planck-Institut fur \nbiologische Kybernetik , \n\nSpemannstr. 38 \n\n72076 Tubingen, Germany \n\nbs@mpik-tueb.mpg.de \n\nAbstract \n\nSupport Vector Learning Machines (SVM) are finding application \nin pattern recognition, regression estimation , and operator inver(cid:173)\nsion for ill-posed problems. Against this very general backdrop , \nany methods for improving the generalization performance, or for \nimproving the speed in test phase, of SVMs are of increasing in(cid:173)\nterest. In this paper we combine two such techniques on a pattern \nrecognition problem. The method for improving generalization per(cid:173)\nformance (the \"virtual support vector\" method) does so by incor(cid:173)\nporating known invariances of the problem. This method achieves \na drop in the error rate on 10,000 NIST test digit images of 1.4% \nto 1.0%. The method for improving the speed (the \"reduced set\" \nmethod) does so by approximating the support vector decision sur(cid:173)\nface. We apply this method to achieve a factor of fifty speedup in \ntest phase over the virtual support vector machine. The combined \napproach yields a machine which is both 22 times faster than the \noriginal machine, and which has better generalization performance, \nachieving 1.1 % error. The virtual support vector method is appli(cid:173)\ncable to any SVM problem with known invariances. The reduced \nset method is applicable to any support vector machine. \n\n1 \n\nINTRODUCTION \n\nSupport Vector Machines are known to give good results on pattern recognition \nproblems despite the fact that they do not incorporate problem domain knowledge. \n\n\u00b7Part of this work was done while B.S. was with AT&T Research, Holmdel, NJ. \n\n\f376 \n\nC. 1. Burges and B. SchOlkopf \n\nHowever, they exhibit classification speeds which are substantially slower than those \nof neural networks (LeCun et al., 1995). \n\nThe present study is motivated by the above two observations. First, we shall \nimprove accuracy by incorporating knowledge about invariances of the problem at \nhand. Second, we shall increase classification speed by reducing the complexity of \nthe decision function representation. This paper thus brings together two threads \nexplored by us during the last year (Scholkopf, Burges & Vapnik, 1996; Burges, \n1996). \n\nThe method for incorporating invariances is applicable to any problem for which \nthe data is expected to have known symmetries. The method for improving the \nspeed is applicable to any support vector machine. Thus we expect these methods \nto be widely applicable to problems beyond pattern recognition (for example, to \nthe regression estimation problem (Vapnik, Golowich & Smola, 1996)). \n\nAfter a brief overview of Support Vector Machines in Section 2, we describe how \nproblem domain knowledge was used to improve generalization performance in Sec(cid:173)\ntion 3. Section 4 contains an overview of a general method for improving the \nclassification speed of Support Vector Machines. Results are collected in Section 5. \nWe conclude with a discussion. \n\n2 SUPPORT VECTOR LEARNING MACHINES \n\nThis Section summarizes those properties of Support Vector Machines (SVM) which \nare relevant to the discussion below. For details on the basic SVM approach, the \nreader is referred to (Boser, Guyon & Vapnik, 1992; Cortes & Vapnik, 1995; Vapnik, \n1995). We end by noting a physical analogy. \nLet the training data be elements Xi E C, C = R d, i = 1, ... ,f, with corresponding \nclass labels Yi E {\u00b11}. An SVM performs a mapping 4> : C ---+ 1i, x t-+ X into a \nhigh (possibly infinite) dimensional Hilbert space 1i. In the following , vectors in \n1i will be denoted with a bar. In 1i, the SVM decision rule is simply a separating \nhyperplane: the algorithm constructs a decision surface with normal ~ E 1i which \nseparates the Xi into two classes: \n\n(1) \n(2) \nwhere the c'j are positive slack variables, introduced to handle the non-separable \ncase (Cortes & Vapnik...l 1995), and where ko and kl are typically defined to be +1 \nand -1, respectively. W is computed by minimizing the objective function \n\n~\u00b7xi+b ~ kO-c'i, Yi=+1 \n1j, . Xi + b < kl + c'i, Yi = -1 \n\n-\n\n-\nW\u00b7W \n-2- + C(L..J C,i)P \n\nl \n\"\"' \n\n(3) \n\ni=l \n\nsubject to (1), (2), where C is a constant, and we choose p = 2. In the separable case, \nthe SVM algorithm constructs that separating hyperplane for which the margin \nbetween the positive and negative examples in 1i is maximized. A test vector x E C \nis then assigned a class label {+ 1, -' 1} depending on whether 1j, . 4>( x) + b is greater \nor less than (ko + kt)/2. Support vectors Sj E C are defined as training samples \nfor which one of Equations (1) or (2) is an equality. (We name the suppo!:t vectors \nS to distinguish them from the rest of the training data) . The solution W may be \nexpressed \n\nNs \n\n1j, = I: O'jYj4>(Sj) \n\nj=1 \n\n(4) \n\n\fImproving the Accuracy and Speed of Support Vector Machines \n\nwhere Cl:j ~ \u00b0 are the positive weights, determined during training , Yj E {\u00b11} the \n\nclass labels of the Sj , and N s the number of support vectors. Thus in order to \nclassify a test point x one must compute \n\n377 \n\nNs \n\nq, . X = 2:' Cl:jYj Sj . x = 2: Cl:jYj4>(Sj) . 4>(x) = 2: Cl:jYj J\u00abSj, x) . \n\nNs \n\nN s \n\n(5) \n\nj=l \n\ni=l \n\nj=l \n\nOne of the key properties of support vector machines ,is the use of the kernel J< to \ncompute dot products in 1-l without having to explicitly compute the mapping 4> . \n\nIt is interesting to note that the solution has a simple physical interpretation in \nthe high dimensional space 1-l . If we assume that each support vector Sj exerts a \nperpendicular force of size Cl:j and sign Yj on a solid plane sheet lying along the \nhyperplane ~ . x + b = (ko + kd/2 , then the solution satisfies the requirements of \nmechanical stability. At the solution , the Cl:j can be shown to satisfy 2:7';1 Cl:iYj = 0, \nwhich translates into the forces on the sheet summing to zero; and Equation (4) \nimplies that the torques also sum to zero. \n\n3 \n\nIMPROVING ACCURACY \n\nThis section follows the reasoning of (Scholkopf, Burges, & Vapnik , 1996). Problem \ndomain knowledge can be incorporated in two different ways: the knowledge can \nbe directly built into the algorithm , or it can be used to generate artificial training \nexamples (\"virtual examples\") . The latter significantly slows down training times, \ndue to both correlations in the artificial data and to the increased training set size \n(Simard et aI. , 1992) ; however it has the advantage of being readily implemented for \nany learning machine and for any invariances. For instance, if instead of Lie groups \nof symmetry transformations one is dealing with discrete symmetries, such as the \nbilateral symmetries of Vetter, Poggio, & Biilthoff (1994) , then derivative-based \nmethods (e.g. Simard et aI. , 1992) are not applicable. \n\nFor support vector machines, an intermediate method which combines the advan(cid:173)\ntages of both approaches is possible. The support vectors characterize the solution \nto the problem in the following sense: If all the other training data were removed , \nand the system retrained , then the solution would be unchanged . Furthermore, \nthose support vectors Si which are not errors are close to the decision boundary \nin 1-l, in the sense that they either lie exactly on the margin (ei = 0) or close to \nit (ei < 1). Finally, different types of SVM , built using different kernels , tend to \nproduce the same set of support vectors (Scholkopf, Burges, & Vapnik , 1995). This \nsuggests the following algorithm: first , train an SVM to generate a set of support \nvectors {Sl, .. . , SN. }; then , generate the artificial examples (virtual support vec(cid:173)\ntors) by applying the desired invariance transformations to {Sl , ... , SN.} ; finally, \ntrain another SVM on the new set. To build a ten-class classifier, this procedure is \ncarried out separately for ten binary classifiers. \n\nApart from the increase in overall training time (by a factor of two , in our ex(cid:173)\nperiments) , this technique has the disadvantage that many of the virtual support \nvectors become support vectors for the second machine, increasing the number of \nsummands in Equation (5) and hence decreasing classification speed. However, the \nlatter problem can be solved with the reduced set method , which we describe next. \n\n\f378 \n\nC. J. Burges and B. SchOlkopf \n\n4 \n\nIMPROVING CLASSIFICATION SPEED \n\nThe discussion in this Section follows that of (Burges, 1996). Consider a set of \nvectors Zk E C, k = 1, ... , Nz and corresponding weights rk E R for which \n\nNz \n\n~I == L rk4>(Zk) \n\nk=l \n\nminimizes (for fixed N z) the Euclidean distance to the original solution: \n\np = II~ - ~/II\u00b7 \n\n(6) \n\n(7) \n\nNote that p, expressed here in terms of vectors in 1i, can be expressed entirely \nin terms of functions (using the kernel K) of vectors in the input space C. The \n{( rk, Zk) I k = 1, ... , N z} is called the reduced set. To classify a test point x, the \nexpansion in Equation (5) is replaced by the approximation \n\nNz \n\nNz \n\n~/\u00b7X = 2:rkZk\u00b7X = LrkK(Zk'X). \n\n(8) \n\nk=l \n\nk=l \n\nThe goal is then to choose the smallest N z ~ N s, and corresponding reduced \nset, such that any resulting loss in generalization performance remains acceptable. \nClearly, by allowing N z = N s, P can be made zero. Interestingly, there are non(cid:173)\ntrivial cases where N z < Ns and p = 0, in which case the reduced set leads to \nan increase in classification speed with no loss in generalization performance. Note \nthat reduced set vectors are not support vectors, in that they do not necessarily lie \non the separating margin and, unlike support vectors, are not training samples. \n\nWhile the reduced set can be found exactly in some cases, in general an uncon(cid:173)\nstrained conjugate gradient method is used to find the Zk (while the corresponding \noptimal rk can be found exactly, for all k). The method for finding the reduced set \nis computationally very expensive (the final phase constitutes a conjugate gradient \ndescent in a space of (d + 1) . N z variables, which in our case is typically of order \n50,000). \n\n5 EXPERIMENTAL RESULTS \n\nIn this Section, by \"accuracy\" we mean generalization performance, and by \"speed\" \nwe mean classification speed. In our experiments, we used the MNIST database of \n60000+ 10000 handwritten digits, which was used in the comparison investigation \nof LeCun et al (1995). In that study, the error rate record of 0.7% is held by a \nboosted convolutional neural network (\"LeNet4\"). \n\nWe start by summarizing the results of the virtual support vector method. We \ntrained ten binary classifiers using C = 10 in Equation (3). We used a polynomial \nkernel K(x, y) = (x\u00b7 y)5. Combining classifiers then gave 1.4% error on the 10,000 \ntest set; this system is referred to as ORIG below. We then generated new train(cid:173)\ning data by translating the resulting support vectors by one pixel in each of four \ndirections, and trained a new machine (using the same parameters). This machine, \nwhich is referred to as VSV below, achieved 1.0% error on the test set. The results \nfor each digit are given in Table 1. \n\nNote that the improvement in accuracy comes at a cost in speed of approximately \na factor of 2. Furthermore, the speed of ORIG was comparatively slow to start \nwith (LeCun et al., 1995), requiring approximately 14 million multiply adds for one \n\n\fImproving the Accuracy and Speed of Support Vector Machines \n\n379 \n\nTable 1: Generalization Performance Improvement by Incorporating Invariances. \nN E and N sv are the number of errors and number of support vectors respec(cid:173)\ntively; \"ORIG\" refers to the original support vector machine, \"vsv\" to the machine \ntrained on virtual support vectors. \n\nDigit NE ORIG NE VSV N sv ORIG Nsv VSV \n\n1206 \n757 \n2183 \n2506 \n1784 \n2255 \n1347 \n1712 \n3053 \n2720 \n\n2938 \n1887 \n5015 \n4764 \n3983 \n5235 \n3328 \n3968 \n6978 \n6348 \n\n0 \n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n\n17 \n15 \n34 \n32 \n30 \n29 \n30 \n43 \n47 \n56 \n\n15 \n13 \n23 \n21 \n30 \n23 \n18 \n39 \n35 \n40 \n\nTable 2: Dependence of Performance of Reduced Set System on Threshold. The \nnumbers in parentheses give the corresponding number of errors on the test set. \nNote that Thrsh Test gives a lower bound for these numbers. \n\nDigit \n\n0 \n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n\nThrsh VSV \n1.39606 (9) \n3.98722 (24) \n1.27175 (31) \n1.26518 (29) \n2.18764 (37) \n2.05222 (33) \n0.95086 (25) \n3.0969 (59) \n-1.06981 (39) \n1.10586 (40) \n\nThrsh Bayes \n1.48648 (8) \n4.43154 (12) \n1.33081 (30) \n1.42589 (27) \n2.3727 (35) \n2.21349 (27) \n1.06629 (24) \n3.34772 (57) \n-1.19615 (40) \n1.10074 (40) \n\nThrsh Test \n1.54696 (7) \n4.32039 (10) \n1.26466 (29) \n1.33822 (26) \n2.30899 (33) \n2.27403 (24) \n0.790952 (20) \n3.27419 (54) \n-1.26365 (37) \n1.13754 (39) \n\nclassification (this can be reduced by caching results of repeated support vectors \n(Burges, 1996)). \nIn order to become competitive with systems with comparable \naccuracy, we will need approximately a factor of fifty improvement in speed. We \ntherefore approximated VSV with a reduced set system RS with a factor of fifty \nfewer vectors than the number of support vectors in VSV. \n\nSince the reduced set method computes an approximation to the decision surface in \nthe high dimensional space, it is likely that the accuracy of RS could be improved \nby choosing a different threshold b in Equations (1) and (2). We computed that \nthreshold which gave the empirical Bayes error for the RS system, measured on \nthe training set. This can be done easily by finding the maximum of the difference \nbetween the two un-normalized cumulative distributions of the values of the dot \nproducts q, . Xi, where the Xi are the original training data. Note that the effects of \nbias are reduced by the fact that VSV (and hence RS) was trained only on shifted \ndata, and not on any of the original data. Thus, in the absence of a validation \nset, the original training data provides a reasonable means of estimating the Bayes \nthreshold. This is a serendipitous bonus of the VSV approach. Table 2 compares \nresults obtained using the threshold generated by the training procedure for the \nVSV system; the estimated Bayes threshold for the RS system; and, for comparison \n\n\f380 \n\nC. 1. Burges and B. SchOlkopf \n\nTable 3: Speed Improvement Using the Reduced Set method. The second through \nfourth columns give numbers of errors on the test set for the original system, the \nvirtual support vector system, and the reduced set system. The last three columns \ngive, for each system, the number of vectors whose dot product must be computed \nin test phase. \n\nDigit ORIG Err VSV Err RS Err ORIG # SV VSV # SV #RSV \n\n15 \n13 \n23 \n21 \n30 \n23 \n18 \n39 \n35 \n40 \n\n18 \n12 \n30 \n27 \n35 \n27 \n24 \n57 \n40 \n40 \n\n1206 \n757 \n2183 \n2506 \n1784 \n2255 \n1347 \n1712 \n3053 \n2720 \n\n0 \n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n\n17 \n15 \n34 \n32 \n30 \n29 \n30 \n43 \n47 \n56 \n\n2938 \n1887 \n5015 \n4764 \n3983 \n5235 \n3328 \n3968 \n6978 \n6348 \n\n59 \n38 \n100 \n95 \n80 \n105 \n67 \n79 \n140 \n127 \n\npurposes only (to see the maximum possible effect of varying the threshold) , the \nBayes error computed on the test set . \nTable 3 compares results on the test set for the three systems, where the Bayes \nthreshold (computed with the training set) was used for RS. The results for all ten \ndigits combined are 1.4% error for ORIG, 1.0% for VSV (with roughly twice as \nmany multiply adds) and 1.1% for RS (with a factor of 22 fewer multiply adds than \nORIG). \nThe reduced set conjugate gradient algorithm does not reduce the objective function \np2 (Equation (7)) to zero. For example , for the first 5 digits, p2 is only reduced \non average by a factor of 2.4 (the algorithm is stopped when progress becomes too \nslow). It is striking that nevertheless, good results are achieved. \n\n6 DISCUSSION \n\nThe only systems in LeCun et al (1995) with better than 1.1% error are LeNet5 \n(0.9% error, with approximately 350K multiply-adds) and boosted LeNet4 (0.7% \nerror, approximately 450K mUltiply-adds). Clearly SVMs are not in this league yet \n(the RS system described here requires approximately 650K multiply-adds). \nHowever, SVMs present clear opportunities for further improvement. (In fact , we \nhave since trained a VSV system with 0.8% error, by choosing a different kernel) . \nMore invariances (for example, for the pattern recognition case, small rotations, \nor varying ink thickness) could be added to the virtual support vector approach. \nFurther, one might use only those virtual support vectors which provide new infor(cid:173)\nmation about the decision boundary, or use a measure of such information to keep \nonly the most important vectors. Known invariances could also be built directly \ninto the SVM objective function. \nViewed' as an approach to function approximation, the reduced set method is cur(cid:173)\nrently restricted in that it assumes a decision function with the same functional \nform as the original SVM. In the case of quadratic kernels, the reduced set can be \ncomputed both analytically and efficiently (Burges, 1996). However, the conjugate \ngradient descent computation for the general kernel is very inefficient. Perhaps re-\n\n\fImproving the Accuracy and Speed of Support Vector Machines \n\n381 \n\nlaxing the above restriction could lead to analytical methods which would apply to \nmore complex kernels also. \n\nAcknowledgements \n\nWe wish to thank V. Vapnik, A. Smola and H. Drucker for discussions. C. Burges \nwas supported by ARPA contract N00014-94-C-0186. B. Sch6lkopf was supported \nby the Studienstiftung des deutschen Volkes. \n\nReferences \n\n[1] Boser, B. E., Guyon, I. M., Vapnik, V., A Training Algorithm for Optimal \nMargin Classifiers, Fifth Annual Workshop on Computational Learning Theory, \nPittsburgh ACM (1992) 144-152. \n\n[2] Bottou, 1., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., Le \nCun, Y., Muller, U. A., Sackinger, E., Simard, P., Vapnik, V., Comparison of \nClassifier Methods: a Case Study in Handwritten Digit Recognition, Proceed(cid:173)\nings of the 12th International Conference on Pattern Recognition and Neural \nNetworks, Jerusalem (1994) \n\n[3] Burges, C. J. C., Simplified Support Vector Decision Rules, 13th International \n\nConference on Machine Learning (1996), pp. 71 - 77. \n\n[4] Cortes, C., Vapnik, V., Support Vector Networks, Machine Learning 20 (1995) \n\npp. 273 - 297 \n\n[5] LeCun, Y., Jackel, 1., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, \nH., Guyon, I., Muller, U., Sackinger, E., Simard, P., and Vapnik, V., Compar(cid:173)\nison of Learning Algorithms for Handwritten Digit Recognition, International \nConference on Artificial Neural Networks, Ed. F. Fogelman, P. Gallinari, pp. \n53-60, 1995. \n\n[6] Sch6lkopf, B., Burges, C.J.C., Vapnik, V., Extracting Support Data for a Given \nTask, in Fayyad, U. M., U thurusamy, R. (eds.), Proceedings, First International \nConference on Knowledge Discovery & Data Mining, AAAI Press, Menlo Park, \nCA (1995) \n\n[7] Sch6lkopf, B., Burges, C.J.C., Vapnik, V., Incorporating Invariances in Support \n\nVector Learning Machines, in Proceedings ICANN'96 -\nence on Artificial Neural Networks. Springer Verlag, Berlin, (1996) \n\nInternational Confer(cid:173)\n\n[8] Simard, P., Victorri, B., Le Cun, Y., Denker, J., Tangent Prop -\n\na Formalism \nfor Specifying Selected Invariances in an Adaptive Network, in Moody, J. E., \nHanson, S. J:, Lippmann, R. P., Advances in Neural Information Processing \nSystems 4, Morgan Kaufmann, San Mateo, CA (1992) \n\n[9] Vapnik, V., Estimation of Dependences Based on Empirical Data, [in Russian] \nNauka, Moscow (1979); English translation: Springer Verlag, New York (1982) \n[10] Vapnik, V., The Nature of Statistical Learning Theory, Springer Verlag, New \n\nYork (1995) \n\n[11] Vapnik, V., Golowich, S., and Smola, A., Support Vector Method for Function \nApproximation, Regression Estimation, and Signal Processing, Submitted to \nAdvances in Neural Information Processing Systems, 1996 \n\n[12] Vetter, T., Poggio, T., and Bulthoff, H., The Importance of Symmetry and Vir(cid:173)\ntual Views in Three-Dimensional Object Recognition, Current Biology 4 (1994) \n18-23 \n\n\f", "award": [], "sourceid": 1253, "authors": [{"given_name": "Christopher", "family_name": "Burges", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}