{"title": "Incremental and Decremental Support Vector Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 409, "page_last": 415, "abstract": null, "full_text": "Incremental and Decremental Support Vector \n\nMachine Learning \n\nGert Cauwenberghs* \n\nCLSP, ECE Dept. \n\nJohns Hopkins University \n\nBaltimore, MD 21218 \n\ngert@jhu.edu \n\nTomaso Poggio \nCBCL, BCS Dept. \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02142 \n\ntp@ai.mit.edu \n\nAbstract \n\nAn on-line recursive algorithm for training support vector machines, one \nvector at a time, is presented. Adiabatic increments retain the Kuhn(cid:173)\nTucker conditions on all previously seen training data, in a number \nof steps each computed analytically. The incremental procedure is re(cid:173)\nversible, and decremental \"unlearning\" offers an efficient method to ex(cid:173)\nactly evaluate leave-one-out generalization performance. Interpretation \nof decremental unlearning in feature space sheds light on the relationship \nbetween generalization and geometry of the data. \n\n1 \n\nIntroduction \n\nTraining a support vector machine (SVM) requires solving a quadratic programming (QP) \nproblem in a number of coefficients equal to the number of training examples. For very \nlarge datasets, standard numeric techniques for QP become infeasible. Practical techniques \ndecompose the problem into manageable subproblems over part of the data [7, 5] or, in the \nlimit, perform iterative pairwise [8] or component-wise [3] optimization. A disadvantage \nof these techniques is that they may give an approximate solution, and may require many \npasses through the dataset to reach a reasonable level of convergence. An on-line alterna(cid:173)\ntive, that formulates the (exact) solution for \u00a3 + 1 training data in terms of that for \u00a3 data and \none new data point, is presented here. The incremental procedure is reversible, and decre(cid:173)\nmental \"unlearning\" of each training sample produces an exact leave-one-out estimate of \ngeneralization performance on the training set. \n\n2 \n\nIncremental SVM Learning \n\nTraining an SVM \"incrementally\" on new data by discarding all previous data except their \nsupport vectors, gives only approximate results [II). In what follows we consider incre(cid:173)\nmentallearning as an exact on-line method to construct the solution recursively, one point \nat a time. The key is to retain the Kuhn-Tucker (KT) conditions on all previously seen data, \nwhile \"adiabatically\" adding a new data point to the solution. \n\n2.1 Kuhn-Thcker conditions \nIn SVM classification, the optimal separating function reduces to a linear combination \nof kernels on the training data, f(x) = E j QjYjK(xj, x) + b, with training vectors Xi \nand corresponding labels Yi = \u00b1l. In the dual formulation of the training problem, the \n\n\u00b7On sabbatical leave at CBCL in MIT while this work was performed. \n\n\fW~Whl\u00b7 W~. \n\n. \n. \n. \n. \n. \n: \n: \n\ng -0 \n'~ \n\n. \n. \n. \n. \n. \n: \ng,~o: \n\na, \n\nC \n, \n~\"\" x \n\nQ. ... f \n\n\", \n, \n, \n\na,=C \n, \n~\"\" \n\n......... x, \n0 \", \n, \n, \n\n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n: \n~ \n\nC \n\npO \n\na,=O \n\n, \n\n~\"\" ox, \n\n'...... \n\n\", \n, \n, \n\nsupport vector \n\nerror vector \n\nFigure 1: Soft-margin classification SVM training. \n\ncoefficients ai are obtained by minimizing a convex quadratic objective function under \nconstraints [12] \n\nmin \n\no<\"'. 0, terminate (c is not a margin or error vector); \n3. If ge :S 0, apply the largest possible increment ne so that (the first) one of the following \n\nconditions occurs: \n(a) ge = 0: Add c to margin set S, update R accordingly, and terminate; \n(b) ne = C: Add c to error set E , and terminate; \n(c) Elements of Dl migrate across S, E, and R (\"bookkeeping,\" section 2.3): Update \n\nmembership of elements and, if S changes, update R accordingly. \n\nand repeat as necessary. \n\nThe incremental procedure is illustrated in Figure 2. Old vectors, from previously seen \ntraining data, may change status along the way, but the process of adding the training data \nc to the solution converges in a finite number of steps. \n\n2.6 Practical considerations \nThe trajectory of an example incremental training session is shown in Figure 3. The algo(cid:173)\nrithm yields results identical to those at convergence using other QP approaches [7], with \ncomparable speeds on various datasets ranging up to several thousands training points L. \nA practical on-line variant for larger datasets is obtained by keeping track only of a limited \nset of \"reserve\" vectors: R = {i E D I 0 < 9i < t}, and discarding all data for which \n9i ~ t. For small t, this implies a small overhead in memory over Sand E. The larger \nt, the smaller the probability of missing a future margin or error vector in previous data. \nThe resulting storage requirements are dominated by that for the inverse Jacobian n, which \nscale as (is)2 where is is the number of margin support vectors, #S. \n\n3 Decremental \"Unlearning\" \n\nLeave-one-out (LOO) is a standard procedure in predicting the generalization power of a \ntrained classifier, both from a theoretical and empirical perspective [12]. It is naturally \nimplemented by decremental unlearning, adiabatic reversal of incremental learning, on \neach of the training data from the full trained solution. Similar (but different) bookkeeping \nof elements migrating across S, E and R applies as in the incremental case. \n\nI Matlab code and data are available at http://bach.ece.jhu.eduJpublgertlsvmlincremental. \n\n\f.. \n\n- 2 \n\n- ) \n\n- 2 \n\n- ) \n\n20 \n\n40 \n\n60 \n\nIteration \n\n80 \n\n100 \n\nFigure 3: Trajectory of coefficients Q:i as a function of iteration step during training, for \n\u00a3 = 100 non-separable points in two dimensions, with C = 10, and using a Gaussian \nkernel with u = 1. The data sequence is shown on the left. \n\ngc'\" \n\n-1 --------------------\n\n-1 ~ -------------~--\n'\" gc \n\nFigure 4: Leave-one-out (LOO) decremental unlearning (Q:c ~ 0) for estimating general(cid:173)\nization performance, directly on the training data. gc \\c < -1 reveals a LOO classification \nerror. \n\n3.1 Leave-oDe-out procedure \nLeU ~ \u00a3 - 1, by removing point c (margin or error vector) from D: D\\c = D \\ {c}. The \nsolution {Q:i \\c, b\\C} is expressed in terms of {Q:i' b}, R and the removed point Xc, Yc. The \nsolution yields gc \\c, which determines whether leaving c out of the training set generates a \nclassification error (gc \\c < -1). Starting from the full i-point solution: \nAlgorithm 2 (Decremental Unlearning, \u00a3 ---+ \u00a3 - 1, and LOO Classification) \n\n1. Ifc is not a margin or error vector: Terminate, \"correct\" (c is already left out, and correctly \n\nclassified); \n\n2. If c is a margin or error vector with gc < -1: Terminate, \"incorrect\" (by default as a \n\ntraining error); \n\n3. If c is a margin or error vector with gc ~ -1, apply the largest possible decrement Q c so \n\nthat (the first) one of the following conditions occurs: \n(a) gc < -1: Terminate, \"incorrect\"; \n(b) Q c = 0: Terminate, \"correct\"; \n(c) Elements of Dl migrate across S, E, and R: Update membership of elements and, \n\nif S changes, update R accordingly. \n\nand repeat as necessary. \n\nThe leave-one-out procedure is illustrated in Figure 4. \n\n\fo \n\n- 0.2 \n\n- 0.4 \n\n