{"title": "Breaking SVM Complexity with Cross-Training", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 88, "abstract": null, "full_text": "Breaking SVM Complexity\n\nwith Cross-Training\n\nG\u00a4okhan H. Bak(cid:17)r\nMax Planck Institute\n\nL\u00b7eon Bottou\n\nNEC Labs America\nPrinceton NJ, USA\n\nleon@bottou.org\n\nJason Weston\n\nNEC Labs America\nPrinceton NJ, USA\n\njasonw@nec-labs.com\n\nfor Biological Cybernetics,\n\nT\u00a4ubingen, Germany\n\ngb@tuebingen.mpg.de\n\nAbstract\n\nWe propose to selectively remove examples from the training set using\nprobabilistic estimates related to editing algorithms (Devijver and Kittler,\n1982). This heuristic procedure aims at creating a separable distribution\nof training examples with minimal impact on the position of the decision\nboundary. It breaks the linear dependency between the number of SVs\nand the number of training examples, and sharply reduces the complexity\nof SVMs during both the training and prediction stages.\n\nIntroduction\n\n1\nThe number of Support Vectors (SVs) has a dramatic impact on the ef(cid:2)ciency of Support\nVector Machines (Vapnik, 1995) during both the learning and prediction stages. Recent\nresults (Steinwart, 2004) indicate that the number k of SVs increases linearly with the\nnumber n of training examples. More speci(cid:2)cally,\n\nk=n (cid:0)! 2BK\n\n(1)\nwhere n is the number of training examples and BK is the smallest classi(cid:2)cation error\nachievable with the SVM kernel K. When using a universal kernel such as the Radial Basis\nFunction kernel, BK is the Bayes risk B, i.e. the smallest classi(cid:2)cation error achievable with\nany decision function.\nThe computational requirements of modern SVM training algorithms (Joachims, 1999;\nChang and Lin, 2001) are very largely determined by the amount of memory required to\nstore the active segment of the kernel matrix. When this amount exceeds the available\nmemory, the training time increases quickly because some kernel matrix coef(cid:2)cients must\nbe recomputed multiple times. During the (cid:2)nal phase of the training process, the active\nsegment always contains all the k(k + 1)=2 dot products between SVs. Steinwart\u2019s result\n(1) then suggests that the critical amount of memory scales at least like B 2n2. This can be\npractically prohibitive for problems with either big training sets or large Bayes risk (noisy\nproblems). Large numbers of SVs also penalize SVMs during the prediction stage as the\ncomputation of the decision function requires a time proportional to the number of SVs.\nWhen the problem is separable, i.e. B = 0, equation (1) suggests1 that the number k of\nSVs increases less than linearly with the number n of examples. This improves the scaling\nlaws for the SVM computational requirements.\n\n1See also (Steinwart, 2004, remark 3.8)\n\n\fIn this paper, we propose to selectively remove examples from the training set using prob-\nabilistic estimates inspired by training set editing algorithms (Devijver and Kittler, 1982).\nThe removal procedure aims at creating a separable set of training examples without mod-\nifying the location of the decision boundary. Making the problem separable breaks the\nlinear dependency between the number of SVs and the number of training examples.\n\n2 Related work\n\n2.1 Salient facts about SVMs\nWe focus now on the C-SVM applied to the two-class pattern recognition problem. See\n(Burges, 1998) for a concise reference. Given n training patterns xi and their associated\nclasses yi = (cid:6)1, the SVM decision function is:\n\nf (x) =\n\nn\n\nX\n\ni=1\n\ni yiK(xi; x) + b(cid:3)\n(cid:11)(cid:3)\n\nThe coef(cid:2)cient (cid:11)(cid:3)\n\ni in (2) are obtained by solving a quadratic programing problem:\n\n(cid:11)(cid:3) = arg max\n\n(cid:11)\n\nX\n\n(cid:11)i (cid:0)\n\ni\n\n1\n2 X\n\ni;j\n\n(cid:11)i(cid:11)jyiyjK(xi; xj)\n\nsubject to 8i; 0 (cid:20) (cid:11)i (cid:20) C and X\n\n(cid:11)iyi = 0\n\ni\n\n(2)\n\n(3)\n\nThis optimization yields three categories of training examples depending on (cid:11)(cid:3)\ni . Within\neach category, the possible values of the margins yif (xi) are prescribed by the Karush-\nKuhn-Tucker optimality conditions.\n\n- Examples such that (cid:11)(cid:3)\n\ni = C are called bouncing SVs or margin errors and satisfy\nyif (xi) < 1. The set of bouncing SVs includes all training examples misclassi(cid:2)ed\nby the SVM, i.e. those which have a negative margin yif (xi) < 0.\n\n- Examples such that 0 < (cid:11)(cid:3)\n- Examples such that (cid:11)(cid:3)\n\ni < C are called ordinary SVs and satisfy yif (xi) = 1.\ni = 0 satisfy relation yif (xi) > 1. These examples play no\nrole in the SVM decision function (2). Retraining after discarding these examples\nwould still yield the same SVM decision function (2).\n\nThese facts provide some insight into Steinwart\u2019s result (1). The SVM decision function,\nlike any other decision rule, must asymptotically misclassify at least Bn examples, where\nB is the Bayes risk. All these examples must therefore become bouncing SVs.\nTo illustrate dependence on the Bayes risk, we perform a linear classi(cid:2)cation task in two\ndimensions under varying amount of class overlap. The class distributions were uniform\non a unit square with centers c1 and c2. Varying the distance between c1 and c2 allows us\nto control the Bayes risk. The results are shown in (cid:2)gure 1.\n\n2.2 A posteriori reduction of the number of SVs.\nSeveral techniques aim to reduce the prediction complexity of SVMs by expressing the\nSVM solution (2) with a smaller kernel expansion. Since one must compute the SVM so-\nlution before applying these post-processing techniques, they are not suitable for reducing\nthe complexity of the training stage.\n\nReduced Set Construction. Burges (Burges, 1996) proposes to construct new patterns\nzj in order to de(cid:2)ne a compact approximation of the decision function (2). Reduced set\nconstruction usually involves solving a non convex optimization problem and is not appli-\ncable on arbitrary inputs such as graphs or strings.\n\n\f103\n\n102\n\n101\n\nV\nS\n#\n\n \n\ng\no\n\nl\n\n100\n\n0\n\n# a=C for SVM\n# a P (y = (cid:0)1 j x) > 0. A small number of training\nexamples of class y = (cid:0)1 can still appear in such a region. We say that they are located on\nthe wrong side of the Bayes decision boundary. Asymptotically, all such training examples\nbelong to the condensed training set in order to ensure that they are properly recognized as\nmembers of class y = (cid:0)1.\n\nRemoving noise examples. The Edited Nearest Neighbor rule (Wilson, 1972) suggests\nto (cid:2)rst discard all training examples that are misclassi(cid:2)ed when applying the 1-NN rule\nusing all n (cid:0) 1 remaining examples as the training set. It was shown that removing these\nexamples improves the asymptotic performance of the nearest neighbor rule. Whereas\nthe 1-NN risk is asymptotically bounded by 2B, the Edited 1-NN risk is asymptotically\nbounded by 1:2 B, where B is the Bayes risk.\nThe MULTIEDIT algorithm (Devijver and Kittler, 1982, section 3.11) asymptotically dis-\ncards all the training examples located on the wrong side of the Bayes decision boundary.\nThe asymptotic risk of the multi-edited nearest neighbor rule is the Bayes risk B.\nAlgorithm 2 (MULTIEDIT).\n\n1 Divide randomly the training data into s splits S1; : : : ; Ss. Let us call fi the 1-NN classi(cid:2)er\n\nthat uses Si as the training set.\n\n2 Classify all examples in Si using the classi(cid:2)er f(i+1) mod s and discard all misclassi(cid:2)ed\n\nexamples.\n\n3 Gather all the remaining examples and return to step 1 if any example has been discarded\n\nduring the last T iterations.\n\n4 The remaining examples constitute the multiedited training set.\n\nBy discarding examples located on the wrong side of the Bayes decision boundary, algo-\nrithm MULTIEDIT constructs a new training set whose apparent distribution has the same\nBayes decision boundary as the original problem, but with Bayes risk equal to 0. Devijver\nand Kittler claim that MULTIEDIT produces an ideal training set for CONDENSE.\n\n\fAlgorithm MULTIEDIT also discards some proportion of training examples located on the\ncorrect side of Bayes decision boundary. Asymptotically this does not matter. However\nthis is often a problem in practice. . .\n\n2.4 Editing algorithms and SVMs\nTraining examples recognized with high con(cid:2)dence usually do not appear in the SVM\nsolution (2) because they do not become support vectors. On the other hand, outliers always\nbecome support vectors. Intuitively, SVMs display the properties of the CONDENSE but\nlack the properties of the MULTIEDIT algorithm.\nThe mathematical proofs for the asymptotic properties of MULTIEDIT depend on the spe-\nci(cid:2)c nature of the 1-NN classi(cid:2)cation rule. The MULTIEDIT algorithm itself could be iden-\ntically de(cid:2)ned for any classi(cid:2)er. This suggests (but does not prove) that these properties\nmight remain valid for SVM classi(cid:2)ers3.\n\nThis contribution is an empirical attempt to endow Support Vector Ma-\nchines with the properties of the MULTIEDIT algorithm.\n\nEditing SVM training sets implicitly modi(cid:2)es the SVM loss function in a way that relates\nto robust statistics. Editing alters the apparent distribution of training examples such that\nthe class distributions P (x j y = 1) and P (x j y = (cid:0)1) no longer overlap. If the class\ndistributions were known, this could be done by trimming the tails of the class distributions.\nA similar effect could be obtained by altering the SVM loss function (the hinge loss) into a\nnon convex loss function that gives less weight to outliers.\n\n3 Cross-Training\nCross-Training is a representative algorithm of such combinations of SVMs and editing\nalgorithms. It begins with creating s subsets of the training set with r examples each. In-\ndependent SVMs are then trained on each subset. The decision functions of these SVMs\nare then used to discard two types of training examples: those which are con(cid:2)dently recog-\nnized, as in CONDENSE, and those which are misclassi(cid:2)ed, as in MULTIEDIT. A (cid:2)nal SVM\nis then trained using the remaining examples.\nAlgorithm 3 (CROSSTRAINING).\n\n1 Create s subsets of size r by randomly drawing r=2 examples of each class.\n2 Train s independent SVMs f1; : : : ; fs using each of the subsets as the training set.\n3 For each training example (xi; yi) estimate the margin average mi and variance vi:\n\nmi = 1\n\ns Ps\n\nr=1 (mi (cid:0) yifr(xi))2\n\nr=1 yifr(xi)\n\ns Ps\n4 Discard all training examples for which mi + vi < 0.\n5 Discard all training examples for which mi (cid:0) vi > 1.\n6 Train a (cid:2)nal SVM on the remaining training examples.\n\nvi = 1\n\nThe apparent simplicity of this algorithm hides a lot of hyperparameters. The value of the\nC parameters for the SVMs at steps [2] and [6] has a considerable effect on the overall\nperformance of the algorithm.\nFor the (cid:2)rst stage SVMs, we choose the C parameter which yields the best performance\non training sets of size r. For the second stage SVMs, we choose the C parameter which\nyields the best overall performance measured on a separate validation set.\nFurthermore, we discovered that the discarding steps tend to produce a (cid:2)nal set of training\nexamples with very different numbers of examples for each class. Speci(cid:2)c measures to\nalleviate this problem are discussed in section 4.3.\n\n3Further comfort comes from the knowledge that a SVM with the RBF kernel and without thresh-\n\nold term b implements the 1-NN rule when the RBF radius tends to zero.\n\n\fLIBSVM\nX\u2212Train LIBSVM\n\n3000\n\n2500\n\n2000\n\ns\nV\nS\n\n1500\n\n1000\n\n500\n\n0\n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nTraining set size\n\n0.2\n\n0.19\n\n0.18\n\n0.17\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0.16\n\n0.15\n\n0.14\n\n0.13\n0\n\n6000\n\n7000\n\n8000\n\nLIBSVM\nX\u2212Train LIBSVM\n\nLIBSVM\nX\u2212Train LIBSVM\n\n500\n\n450\n\n400\n\n350\n\n)\ns\nc\ne\ns\n(\n \ne\nm\nT\n\ni\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n0\n\n1000\n\n2000\n\n6000\n\n7000\n\n8000\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nTraining set size\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n3000\n\nTraining set size\n\nFigure 3: Comparing LIBSVM and Cross-Training on a toy problem of two Gaussian clouds for\nincreasing number of training points. Cross-Training gives an almost constant number of support\nvectors (left (cid:2)gure) for increasing training set size, whereas in LIBSVM the number of support vec-\ntors increases linearly. The error rates behave similarly (middle (cid:2)gure), and Cross-Training gives an\nimproved training time (right (cid:2)gure). See section 4.1.\n\n4 Experiments\n4.1 Arti(cid:2)cial Data\nWe (cid:2)rst constructed arti(cid:2)cial data, by generating two classes from two Gaussian clouds in\n10 dimensions with means (1; 1; 1; 1; 1; 0; 0; 0; 0; 0) and ((cid:0)1; (cid:0)1; (cid:0)1; (cid:0)1; (cid:0)1; 0; 0; 0; 0; 0)\nand standard deviation 4. We trained a linear SVM for differing amounts of training points,\nselecting C via cross validation. We compare the performance of LIBSVM4 with Cross-\nTraining using LIBSVM with s = 5, averaging over 10 splits. The results given in (cid:2)gure\n3 show a reduction in SVs and computation time using Cross-Training, with no loss in\naccuracy.\n\n4.2 Arti(cid:2)cial Noise\nOur second experiment involves the discrimination of digits 3 and 8 in the MNIST5\ndatabase. Arti(cid:2)cial noise was introduced by swapping the labels of 0%, 5%, 10% and\n15% of the examples. There are 11982 training examples and 1984 testing examples. All\nexperiments were carried out using LIBSVM\u2019s (cid:23)-SVM (Chang and Lin, 2001) with the\nRBF kernel ((cid:13) = 0:005). Cross-Training was carried out by splitting the 11982 training\nexamples into 5 subsets. Figure 4 reports our results for various amounts of label noise.\nThe number of SVs (left (cid:2)gure) increases linearly for the standard SVM and stays constant\nfor the Cross-Training SVM. The test errors (middle (cid:2)gure) seem similar. Since our label\nnoise is arti(cid:2)cial, we can also measure the misclassi(cid:2)cation rate on the unmodi(cid:2)ed test-\ning set (right (cid:2)gure). This measurement shows a slight loss of accuracy without statistical\nsigni(cid:2)cance.\n\n4.3 Benchmark Data\nFinally the cross-training algorithm was applied to real data sets from both the ANU repos-\nitory6 and from the the UCI repository7.\nExperimental results were quite disappointing until we realized that the discarding steps\ntends to produce training sets with very different numbers of examples for each class. To\nalleviate this problem, after training each SVM, we choose the value of the threshold b(cid:3)\n\n4 http://www.csie.ntu.edu.tw/(cid:24)cjlin/libsvm/\n5 http://yann.lecun.com/exdb/mnist\n6 http://mlg.anu.edu.au/(cid:24)raetsch/data/index.html\n7 ftp://ftp.ics.uci.edu/pub/machine-learning-databases\n\n\f8000\n\n6000\n\n4000\n\n2000\n\n0\n\n15.0%\n\n10.0%\n\n5.0%\n\n0.0%\n\n2.5%\n\n2.0%\n\n1.5%\n\n1.0%\n\n0.5%\n\n0.0%\n\n0%\n\n5%\n\n10%\n\n15%\n\n0%\n\n5%\n\n10%\n\n15%\n\n0%\n\n5%\n\n10%\n\n15%\n\nFigure 4: Number of SVs (left (cid:2)gure) and test error (middle (cid:2)gure) for varying amounts of label\nnoise on the MNIST 3-8 discrimination task. The x-axis in all graphs shows the amount of label noise;\nwhite squares correspond to LIBSVM; black circles to Cross-Training; dashed lines to bagging the\n(cid:2)rst stage Cross-Training SVMs. The last graph (right (cid:2)gure) shows the test error measured without\nlabel noise. See section 4.2\n\nin (2) which achieves the best validation performance. We also attempt to balance the\n(cid:2)nal training set by re-inserting examples discarded during step [5] of the cross-training\nalgorithm.\nExperiments were carried out using RBF kernels with the kernel width reported in the\nliterature. In the SVM experiments, the value of parameter C was determined by cross-\nvalidation and then used for training a SVM on the full dataset. In the cross-training ex-\nperiments, we make a validation set by taking r=3 examples from the training set. These\nexamples are only used for choosing the values of C and for adjusting the SVM thresholds.\nDetails and source code are available8.\n\nDataset\nBanana\nWaveform\nSplice\nAdult\nAdult\nForest\nForest\nForest\n\nTrain\nSize\n400\n400\n1000\n3185\n32560\n50000\n90000\n200000\n\nTest\nSize\n4900\n4600\n2175\n16280\n16280\n58100\n58100\n58100\n\nSVM SVM\n#SV\n111\n172\n601\n1207\n11325\n12476\n18983\n\nPerf.[%]\n89.0\n90.2\n90.0\n84.2\n85.1\n90.3\n91.6\n(cid:151)\n\nXTrain\nSubsets\n5(cid:2)200\n5(cid:2)200\n5(cid:2)300\n5(cid:2)700\n5(cid:2)6000\n5(cid:2)10000\n5(cid:2)18000\n(cid:151) 8(cid:2)30000\n\nXTrain XTrain\nPerf.[%]\n#SV\n51\n88.2\n87\n88.7\n522\n89.9\n606\n84.2\n1194\n84.8\n7967\n89.2\n13023\n90.7\n92.1\n19526\n\nTable 1: Comparison of SVM and Cross-Training results on standard benchmark data sets.\n\nThe columns in table 1 contain the dataset name, the size of the training set used for the\nexperiment, the size of the test set, the SVM accuracy and number of SVs, the Cross-\nTraining subset con(cid:2)guration, accuracy, and (cid:2)nal number of SVs. Bold typeface indicates\nwhich differences were statistically signi(cid:2)cant according to a paired test. These numbers\nshould be considered carefully because they are impacted by the discrete nature of the grid\nsearch for parameter C. The general trend still indicates that Cross-Training causes a slight\nloss of accuracy but requires much less SVs.\nOur largest training set contains 200000 examples. Training a standard SVM on such a\nset takes about one week of computation. We do not report this result because it was not\npractical to determine a good value of C for this experiment. Cross-Training with speci(cid:2)ed\nhyperparameters runs overnight. Cross-Training with hyperparameter grid searches runs in\ntwo days.\nWe do not report detailled timing results because much of the actual time can be attributed\n\n8 http://www.kyb.tuebingen.mpg.de/bs/people/gb/xtraining\n\n\fto the search for the proper hyperparameters. Timing results would then depend on loosely\ncontrolled details of the hyperparameter grid search algorithms.\n\n5 Discussion\nWe have suggested to combine SVMs and training set editing techniques to break the lin-\near relationship between number of support vectors and number of examples. Such com-\nbinations raise interesting theoretical questions regarding the relative value of each of the\ntraining examples.\nExperiments with a representative algorithm, namely Cross-Training, con(cid:2)rm that both the\ntraining and the recognition time are sharply reduced. On the other hand, Cross-Training\ncauses a minor loss of accuracy, comparable to that of reduced set methods (Burges, 1996),\nand seems to be more sensitive than SVMs in terms of parameter tuning.\nDespite these drawbacks, Cross-Training provides a practical means to construct kernel\nclassi(cid:2)ers with signi(cid:2)cantly larger training sets.\n\n6 Acknowledgement\nWe thank Hans Peter Graf, Eric Cosatto and Vladimir Vapnik for their advice and support.\nPart of this work was funded by NSF grant CCR-0325463.\n\nReferences\nBurges, C. J. C. (1996). Simpli(cid:2)ed Support Vector Decision Rules. In Saitta, L., editor, Proceed-\nings of the 13th International Conference on Machine Learning, pages 71(cid:150)77, San Mateo, CA.\nMorgan Kaufmann.\n\nBurges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data\n\nMining and Knowledge Discovery, 2(2):121(cid:150)167.\n\nChang, C.-C. and Lin, C.-J. (2001). Training (cid:23)-Support Vector Classi(cid:2)ers: Theory and Algorithms.\n\nNeural Computation, 13(9):2119(cid:150)2147.\n\nDevijver, P. and Kittler, J. (1982). Pattern Recogniton, A statistical approach. Prentice Hall, Engle-\n\nwood Cliffs.\n\nDowns, T., Gates, K. E., and Masters, A. (2001). Exact Simpli(cid:2)cation of Support Vector Solutions.\n\nJournal of Machine Learning Research, 2:293(cid:150)297.\n\nHart, P. (1968). The condensed nearest neighbor rule. IEEE Transasctions on Information Theory,\n\n14:515(cid:150)516.\n\nJoachims, T. (1999). Making Large(cid:150)Scale SVM Learning Practical. In Sch\u00a4olkopf, B., Burges, C.\nJ. C., and Smola, A. J., editors, Advances in Kernel Methods (cid:151) Support Vector Learning, pages\n169(cid:150)184, Cambridge, MA. MIT Press.\n\nSch\u00a4olkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press, Cambridge, MA.\nSteinwart, I. (2004). Sparseness of Support Vector Machines(cid:151)Some Asymptotically Sharp Bounds.\nIn Thrun, S., Saul, L., and Sch\u00a4olkopf, B., editors, Advances in Neural Information Processing\nSystems 16. MIT Press, Cambridge, MA.\n\nVapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New York.\nWilson, D. L. (1972). Asymptotic properties of the nearest neighbor rules using edited data. IEEE\n\nTransactions on Systems, Man, and Cybernetics, 2:408(cid:150)420.\n\n\f", "award": [], "sourceid": 2695, "authors": [{"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "G\u00f6khan", "family_name": "Bakir", "institution": null}]}