{"title": "Fast Training of Support Vector Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 734, "page_last": 740, "abstract": null, "full_text": "Fast Training of Support Vector Classifiers \n\nF. Perez-Cruzt, P. L. Alarc6n-Dianat, A. Navia-V azquez:j:and A. Artes-Rodriguez:j:. \n\ntDpto. Teoria de la Seiial y Com., Escuela Politecnica, Universidad de Alcala. \n\n28871-Alcala de Henares (Madrid) Spain. e-mail: fernando@tsc.uc3m.es \n:j:Dpto. Tecnologias de las comunicaciones, Escuela Politecnica Superior, \n\nUniversidad Carlos ill de Madrid, Avda. Universidad 30, 28911-Leganes (Madrid) Spain. \n\nAbstract \n\nIn this communication we present a new algorithm for solving Support \nVector Classifiers (SVC) with large training data sets. The new algorithm \nis based on an Iterative Re-Weighted Least Squares procedure which is \nused to optimize the SVc. Moreover, a novel sample selection strategy \nfor the working set is presented, which randomly chooses the working \nset among the training samples that do not fulfill the stopping criteria. \nThe validity of both proposals, the optimization procedure and sample \nselection strategy, is shown by means of computer experiments using \nwell-known data sets. \n\n1 INTRODUCTION \n\nThe Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition prob(cid:173)\nlems [13, 14] in such a way that the solution is completely described as a linear combination \nof several training samples, named the Support Vectors. The training procedure for solving \nthe SVC is usually based on Quadratic Programming (QP) which presents some inherent \nlimitations, mainly the computational complexity and memory requirements for large train(cid:173)\ning data sets. This problem is typically avoided by dividing the QP problem into sets of \nsmaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for \nthe whole set of training samples. These schemes rely on an optimizing engine, QP, and in \nthe sample selection strategy for each sub-problem, in order to obtain a fast solution for the \nSVC. \n\nAn Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as \nan alternative solver for the SVC [10] and the Support Vector Regressor [9], being compu(cid:173)\ntationally efficient in absolute terms. In this communication, we will show that the IRWLS \nalgorithm can replace the QP one in any chunking scheme in order to find the SVC solution \nfor large training data sets. Moreover, we consider that the strategy to decide which training \nsamples must j oin the working set is critical to reduce the total number of iterations needed \nto attain the SVC solution, and the runtime complexity as a consequence. To aim for this \nissue, the computer program SV cradit have been developed so as to solve the SVC for \nlarge training data sets using IRWLS procedure and fixed-size working sets. \nThe paper is organized as follows. In Section 2, we start by giving a summary of the \nIRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme \nto obtain an overall implementation which efficiently deals with large training data sets. \n\n\fWe present in Section 3 a novel strategy to make up the working set. Section 4 shows the \ncapabilities of the new implementation and they are compared with the fastest available \nSVC implementation, SV Mlight [6]. We end with some concluding remarks. \n\n2 IRWLS-SVC \n\nIn order to solve classification problems, the SVC has to minimize \n\nLp = ~llwI12+CLei- LJliei- LQi(Yi(\u00a2(xifw+b)-l+ei) \n\ni i i \n\n(1) \n\nwith respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, \nwhere \u00a2(.) is a nonlinear transformation (usually unknown) to a higher dimensional space \nand C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker \n(KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey \nby Burges [2] and to the work ofVapnik [13, 14]. \n\nIn order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that \nthe terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi \n(one of the KKT conditions [2]) must hold. \n\nLp = \n\n1 \n211wl12 + L Qi(l- Yi(\u00a2T(Xi)W + b)) \n\ni \n\n= \n\nwhere \n\n(2) \n\nThe weighted least square nature of (2) can be understood if ei is defined as the error on \neach sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The \nminimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we \nneed to apply an IRWLS procedure [4], summarized below in tree steps: \n\n1. Considering the ai fixed, minimize (2). \n\n2. Recalculate ai from the solution on step 1. \n\n3. Repeat until convergence. \n\nIn order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP proce(cid:173)\ndure does, we require that w = Ei (JiYi\u00a2(Xi) and in order to obtain a non-zero b, that \nEi {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed \nset of ai is found by solving the following linear equation systeml \n\n(3) \n\nIThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. \n\n\fwhere \n\ny = [Yl, Y2, ... Yn]T \n\n(H)ij = YiYj\u00a2T(Xi)\u00a2(Xj) = YiyjK(Xi,Xj) \n(Da)ij = aio[i - j] \n\n(4) \n(5) \n(6) \n(7) \nand 0[\u00b7] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange \nmultipliers is eliminated using the KKT conditions, obtaining \n\n'r/i,j = 1, ... ,n \n'r/i,j = 1, ... ,n \n\n13 = [,81, ,82, ... , ,8n]T \n\neiYi < \u00b0 \nai = ~ .e . > \u00b0 \n\n{ a, \nei Yi' Yt t -\n\n(8) \n\n2.1 \n\nIRWLS ALGORITHMIC IMPLEMENTATION \n\nThe SVC solution with the IRWLS procedure can be simplified by dividing the training \n\nsamples into three sets. The first set, SI, contains the training samples verifying \u00b0 < \n,8i < C, which have to be determined by solving (3). The second one, S2, includes every \ntraining sample whose,8i = 0. And the last one, S3, is made up of the training samples \nwhose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm \nis shown in Table 1. \n\n0. Initialization: \n\nSI will contain every training sample, S2 = 0 and S3 = 0. Compute H. \ne_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . \n(Y)Sl 1 [ (f3)Sl ] = [ 1 - G 13 ] \n\u00b0 \n\n(y ) ~1 \n\nG b3 \n\nb \n\n' \n\n1 Solve [ (H)SbS1 + D(al S1 \n. \n(13) S2 = \u00b0 and (13) Ss = C \n2. e = e-lt - DyH(f3 - f3_a) -\n(b - b_a)1 \neiYi < \u00b0 . \n3. ai = ~ e- _ > O'r/Z E SI U S2 U S3 \n4. Sets reordering: \n\n{a, \n\ntYt -\n\nei Yi ' \n\nb. Move every sample in SI with ,8i = C to S3. \n\na. Move every sample in S3 with eiYi < \u00b0 to S2. \nc. Move every sample in SI with ai = \u00b0 to S2 . \nd. Move every sample in S2 with ai :I \u00b0 to SI. \n5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (Gin )Sl' \nb-lt = band Gb3 = -y~s (f3)ss + Gbin \u00b7 \n6. Go to step 1 and repeat until convergence. \n\nTable 1: IRWLS-SVC algorithm. \n\nThe IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:(cid:173)\ning scheme as the one proposed in [8, 6], such that it can be directly applied in the one \nproposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large \nto fit into memory. In those cases, several SVC with a reduced set of training samples are \niteratively solved until the solution for the whole set is found. The samples are divide into \na working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there \nare support vectors in the inactive set, as it might be, the inactive set modifies the IRWLS(cid:173)\nSVC procedure, adding a contribution to the independent term in the linear equation system \n(3). Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is \n\n\fnot zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin \nand Gbin ) will be calculated as G 13 and Gb3 are (Table 1, 5th step), before calling the \nIRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider \nGin and Gbin , which must be set to zero if the Hessian matrix, H, fits into memory for the \nwhole set of training samples. \n\nThe resolution of the SVC for large training data sets, employing as minimization engine \nthe IRWLS procedure, is summarized in the following steps: \n\n1. Select the samples that will form the working set. \n2. Construct Gin = (H)Sw,Sin (f3)s.n and Gbin = -yIin (f3)Sin \n3. Solve the IRWLS-SVC procedure, following the steps in Table 1. \n4. Compute the error of every training sample. \n5. If the stopping conditions \nYiei < C \neiYi> -c \nleiYil < C \n\n'Vii \n'Vii \n'Vii 0 < (Ji < C \n\n(Ji = 0 \n(Ji = C \n\n(9) \n(10) \n(11) \n\nare fulfilled, the SVC solution has been reached. \n\nThe stopping conditions are the ones proposed in [6] and C must be a small value around \n10 - 3, a full discussion concerning this topic can be found in [6]. \n\n3 SAMPLE SELECTION STRATEGY \n\nThe selection of the training samples that will constitute the working set in each iteration \nis the most critical decision in any chunking scheme, because such decision is directly \ninvolved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the \nnumber of reproducing kernel evaluations to be made, which are, by far, the two most time \nconsuming operations in any chunking schemes. \n\nIn order to solve the SVC efficiently, we first need to define a candidate set of training \nsamples to form the working set in each iteration. The candidate set will be made up, as \nit could not be otherwise, with all the training samples that violate the stopping conditions \n(9)-(11); and we will also add all those training samples that satisfy condition (11) but a \nsmall variation on their error will make them violate such condition. \n\nThe strategies to select the working set are as numerous as the number of problems to be \nsolved, but one can think three different simple strategies: \n\n\u2022 Select those samples which do not fulfill the stopping criteria and present the \n\nlargest I ei I values. \n\n\u2022 Select those samples which do not fulfill the stopping criteria and present the \n\nsmallest I ei I values. \n\n\u2022 Select them randomly from the ones that do not fulfill the stopping conditions. \n\nThe first strategy seems the more natural one and it was proposed in [6]. If the largest leil \nsamples are selected we guanrantee that attained solution gives the greatest step towards the \nsolution of (1). But if the step is too large, which usually happens, it will cause the solution \nin each iteration and the (Ji values to oscillate around its optimal value. The magnitude of \nthis effect is directly proportional to the value of C and q (size of the working set), so in \nthe case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. \n\n\fThe second one is the most conservative strategy because we will be moving towards the \nsolution of (1) with small steps. Its drawback is readily discerned if the starting point is \ninappropriate, needing too many iterations to reach the SVC solution. \n\nThe last strategy, which has been implemented together with the IRWLS-SVC procedure, \nis a mid-point between the other two, but if the number of samples whose 0 < (3i < C \nincreases above q there might be some iterations where we will make no progress (working \nset is only made up of the training samples that fulfill the stopping condition in (11)). This \nsituation is easily avoided by introducing one sample that violates each one of the stopping \nconditions per class. Finally, if the cardinality of the candidate set is less than q the working \nset is completed with those samples that fulfil the stopping criteria conditions and present \nthe least leil. \nIn summary, the sample selection strategy proposed is2 : \n\n1. Construct the candidate set, Se with those samples that do not fulfill stopping \n\nconditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. \n\n2. IfISel < ngot05. \n3. Choose a sample per class that violates each one of the stopping conditions and \n\nmove them from Se to the working set, SW. \n\nISw I samples from Se and move then to SW. Go to Step 6. \n4. Choose randomly n -\n5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping \nconditions (9) and (10) and present the lowest leil values are used to complete SW . \n\n6. Go on, obtaining Gin and Gbin. \n\n4 BENCHMARK FOR THE IRWLS-SVC \n\nWe have prepared two different experiments to test both the IRWLS and the sample selec(cid:173)\ntion strategy for solving the SVc. The first one compares the IRWLS against QP and the \nsecond one compares the samples selection strategy, together with the IRWLS, against a \ncomplete solving procedure for SVC, the SV Mlight. \n\nIn the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht \nversion 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing en(cid:173)\ngines with equal samples selection strategy. The comparison has been made over a Pentium \nill-450MHz with 128Mb running on Window98 and the programs have been compiled us(cid:173)\ning Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first \n\nAdult44781 \n\nSplice 2175 \n\nCPU time \n\nOptimize Time \n\nCPU time \n\nq \n20 \n40 \n70 \n\nLOQO \n21.25 \n20.60 \n21.15 \n\nIRWLS LOQO \n20.70 \n19.22 \n18.72 \n\n0.61 \n1.01 \n2.30 \n\nIRWLS LOQO \n46.19 \n71.34 \n53.77 \n\n0.39 \n0.17 \n0.46 \n\nIRWLS LOQO \n21.94 \n30.76 \n46.26 \n24.93 \n20.32 \n34.24 \n\nOptimize Time \nIRWLS \n\n4.77 \n8.07 \n7.72 \n\nTable 2: CPU Time indicates the consume time in seconds for the whole procedure. The \nOptimize Time indicates the consume time in second for the LOQO or IRWLS procedure. \n\none, containing 4781 training samples, needs most CPU resources to compute the RKHS \nand the second one, containing 2175 training samples, uses most CPU resources to solve \nthe SVC for each Sw, where q indicates the size of the working set. The value of C has \n\n2In what follows, I . I represents absolute value for numbers and cardinality for sets \n\n\fbeen set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has \nbeen employed, where its parameter a has been set, respectively, to 10 and 70. \n\nAs it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO pro(cid:173)\ncedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for \nboth procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC \nsolver. \n\nFor the second trial, we have compiled a computer program that uses the IRWLS-SVC \nprocedure and the working set selection in Section 3, we will refer to it as svcradit \nfrom now on. We have borrowed the chunking and shrinking ideas from the SV Mlight \n[6] for our computer program. To test these two programs several data sets have been \nused. The Adult and Web data sets have been obtained from 1. Platt's web page \nhttp://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimen(cid:173)\nsional classification problem proposed in [3] to test neural networks, which comprises a \ngaussian random variable for each class, which highly overlap. The Banana, Diabetes and \nSplice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der \nraetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult \nand Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice \ndata sets. In Table 3, we show the runtime complexity for each data set, where the value of \nq has been elected as the one that reduces the runtime complexity. \n\nDatabase Dim \n\nAdult6 \nAdult9 \nAdult! \nWeb 1 \nWeb7 \n\nGauss-M \nGauss-M \nBanana \nBanana \nDiabetes \nSplice \n\n123 \n123 \n123 \n300 \n300 \n2 \n2 \n2 \n2 \n8 \n69 \n\nN \n\nSampl. \n11221 \n32562 \n1605 \n2477 \n24693 \n4000 \n4000 \n400 \n4900 \n768 \n2175 \n\nC \n\n1 \n1 \n\n1000 \n\n5 \n5 \n1 \n100 \n316.2 \n316.2 \n\n10 \n1000 \n\na \n\n10 \n10 \n10 \n10 \n10 \n1 \n1 \n1 \n1 \n2 \n70 \n\nSV \n\n4477 \n12181 \n630 \n224 \n1444 \n1736 \n1516 \n80 \n1084 \n409 \n525 \n\nq \n\nradit \n150 \n130 \n100 \n100 \n150 \n70 \n100 \n40 \n70 \n40 \n150 \n\nlight \n40 \n70 \n10 \n10 \n10 \n10 \n10 \n70 \n40 \n10 \n20 \n\nradit \n118.2 \n1093.29 \n25.98 \n2.42 \n158.13 \n12.69 \n61.68 \n0.33 \n22.46 \n2.41 \n14.06 \n\nCPU time \n\nlight \n124.46 \n1097.09 \n113.54 \n2.36 \n124.57 \n48.28 \n3053.20 \n\n0.77 \n\n1786.56 \n\n6.04 \n49.19 \n\nTable 3: Several data sets runtime complexity, when solved with the svcradit , radit for \nshort, and SV Mlight, light for short. \n\nOne can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For \nthe Web data set, which is the only data set the SV Mlight is sligthly faster, the value \nof C is low and most training samples end up as support vector with (3i < C. In such \ncases the best strategy is to take the largest step towards the solution in every iteration, as \nthe SV Mlig ht does [6], because most training samples (3i will not be affected by the others \ntraining samples (3j value. But in those case the value of C increases the SV c radit samples \nselection strategy is a much more appropriate strategy than the one used in SV Mlight. \n\n5 CONCLUSIONS \n\nIn this communication a new algorithm for solving the SVC for large training data sets \nhas been presented. Its two major contributions deal with the optimizing engine and the \nsample selection strategy. An IRWLS procedure is used to solve the SVC in each step, \nwhich is much faster that the usual QP procedure, and simpler to implement, because the \n\n\fmost difficult step is the linear equation system solution that can be easily obtained by LU \ndecomposition means [12]. The random working set selection from the samples not fulfill(cid:173)\ning the KKT conditions is the best option if the working is be large, because it reduces the \nnumber of chunks to be solved. This strategy benefits from the IRWLS procedure, which \nallows to work with large training data set. All these modifications have been concreted in \nthe svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. \n\n6 ACKNOWLEDGEMENTS \n\nWe are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to \nuse his SV Mlight to test our IRWLS procedure, comparisons which could not have been \nproperly done otherwise. \n\nReferences \n\n[1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin \nclassifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, \nU.S.A., 1992. \n\n[2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data \n\nMining and Knowledge Discovery, 2(2):121-167, 1998. \n\n[3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. \n[4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least \n\nsquares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. \n\n[5] T. Joachims. \n\nhttp://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren \n\nIsvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Infor(cid:173)\nmatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multi(cid:173)\nvariate Data', 1998. \n\n[6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Ker(cid:173)\n\nnel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and \nSmola, A. 1., pages 169-184. M.I.T. Press, 1999. \n\n[7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support \nvector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal \nProcessing, pages 276-285, Amelia Island, U.S.A, 1997. \n\n[8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector ma(cid:173)\n\nchines. In ICPR'98, Brisbane, Australia, August 1998. \n\n[9] F. Perez-Cruz, A. Navia-Vazquez\" P. L. Alarcon-Diana, and A. Artes-Rodriguez. An \nirwls proceure for svr. In the Proceedings of the EUSIPCO'OO, Tampere, Finland, 9 \n2000. \n\n[10] F. Perez-Cruz, A. N avia-Vazquez, J. L. Rojo-Alvarez, and A. Artes-Rodriguez. A new \ntraining algorithm for support vector machines. In Proceedings of the Fifth Bayona \nWorkshop on Emerging Technologies in Telecommunications, volume 1, pages 116-\n120, Baiona, Spain, 91999. \n\n[11] 1. C. Platt. Sequential Minimal Optimization: A Fast Algorithm for Training Suppor \nVector Machines, In Advances in Kernel Methods- Support Vector Learning, Editors \nSchOlkopf, B., Burges, C. J. C. and Smola, A. J., pages 185-208. M.I.T. Press, 1999. \n[12] w. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes \n\nin C. Cambridge University Press, Cambridge, UK, 2 edition, 1994. \n\n[13] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. \n[14] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. \n\n\f", "award": [], "sourceid": 1855, "authors": [{"given_name": "Fernando", "family_name": "P\u00e9rez-Cruz", "institution": null}, {"given_name": "Pedro", "family_name": "Alarc\u00f3n-Diana", "institution": null}, {"given_name": "Angel", "family_name": "Navia-V\u00e1zquez", "institution": null}, {"given_name": "Antonio", "family_name": "Art\u00e9s-Rodr\u00edguez", "institution": null}]}