{"title": "Incremental and Decremental Support Vector Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 409, "page_last": 415, "abstract": null, "full_text": "Incremental and Decremental Support Vector \n\nMachine Learning \n\nGert Cauwenberghs* \n\nCLSP, ECE Dept. \n\nJohns Hopkins University \n\nBaltimore, MD 21218 \n\ngert@jhu.edu \n\nTomaso Poggio \nCBCL, BCS  Dept. \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02142 \n\ntp@ai.mit.edu \n\nAbstract \n\nAn on-line recursive algorithm for training support vector machines, one \nvector  at  a  time,  is  presented.  Adiabatic  increments  retain  the  Kuhn(cid:173)\nTucker  conditions  on  all  previously  seen  training  data,  in  a  number \nof steps  each computed analytically.  The  incremental procedure is  re(cid:173)\nversible, and decremental \"unlearning\" offers an  efficient method to ex(cid:173)\nactly  evaluate  leave-one-out generalization performance.  Interpretation \nof decremental unlearning in feature space sheds light on the relationship \nbetween generalization and geometry of the data. \n\n1 \n\nIntroduction \n\nTraining a support vector machine (SVM) requires solving a quadratic programming (QP) \nproblem in  a number of coefficients equal  to  the  number of training examples.  For very \nlarge datasets, standard numeric techniques for QP become infeasible. Practical techniques \ndecompose the problem into manageable subproblems over part of the data [7, 5] or, in the \nlimit, perform iterative pairwise  [8]  or component-wise [3]  optimization.  A disadvantage \nof these techniques is  that they may  give an approximate solution, and may require many \npasses through the dataset to reach a reasonable level of convergence.  An on-line alterna(cid:173)\ntive, that formulates the (exact) solution for \u00a3 + 1 training data in terms of that for \u00a3 data and \none new data point, is presented here.  The incremental procedure is reversible, and decre(cid:173)\nmental \"unlearning\" of each training sample produces an  exact leave-one-out estimate of \ngeneralization performance on the training set. \n\n2 \n\nIncremental SVM Learning \n\nTraining an SVM \"incrementally\" on new data by discarding all previous data except their \nsupport vectors,  gives  only  approximate results  [II).  In what follows  we consider incre(cid:173)\nmentallearning as  an exact on-line method to construct the solution recursively, one point \nat a time.  The key is to retain the Kuhn-Tucker (KT) conditions on all previously seen data, \nwhile \"adiabatically\" adding a new data point to the solution. \n\n2.1  Kuhn-Thcker conditions \nIn  SVM  classification,  the  optimal  separating  function  reduces  to  a  linear combination \nof kernels  on  the training data,  f(x)  = E j  QjYjK(xj, x)  + b,  with  training  vectors  Xi \nand corresponding labels Yi  = \u00b1l.  In the dual formulation  of the  training problem,  the \n\n\u00b7On sabbatical leave at CBCL in MIT while this work was performed. \n\n\fW~Whl\u00b7 W~. \n\n. \n. \n. \n. \n. \n: \n: \n\ng -0 \n'~ \n\n. \n. \n. \n. \n. \n: \ng,~o: \n\na, \n\nC \n, \n~\"\" x \n\nQ. ...  f \n\n\", \n, \n, \n\na,=C \n, \n~\"\" \n\n......... x, \n0  \", \n, \n, \n\n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n: \n~ \n\nC \n\npO \n\na,=O \n\n, \n\n~\"\"  ox, \n\n'...... \n\n\", \n, \n, \n\nsupport vector \n\nerror vector \n\nFigure 1:  Soft-margin classification SVM training. \n\ncoefficients  ai  are  obtained by  minimizing a  convex quadratic  objective function  under \nconstraints [12] \n\nmin \n\no<\"'.<c \n\n-\n\nt_  \n\n:  W  = ~ '\"' a\u00b7Q\u00b7\u00b7a\u00b7 -\n\n2 ~ t \n\ntJ \n\ni,j \n\nJ  ~ t  ~ t \n\n'\"' a \u00b7 + b'\"' Y'a' \ni \n\ni \n\nt \n\n(1) \n\nwith Lagrange multiplier (and offset) b,  and with symmetric positive definite kernel matrix \nQij  = YiyjK(Xi,Xj).  The first-order conditions on W  reduce to the  Kuhn-Tucker (KT) \nconditions: \n\nL  Qijaj + Yi b - 1 =  yd(Xi) - 1 \nj \n\n{ \n\n~ OJ \n=OJ \n~ OJ \n\nj \n\naw \nab \n\nai = 0 \n0< ai < C \nai =  C \n\n(2) \n\n(3) \n\nwhich partition the training data D  and corresponding coefficients {ai, b}, i  =  1, ... i, in \nthree categories as  illustrated in  Figure 1 [9]:  the set S  of margin support vectors strictly \non the margin (yd(Xi) = 1), the set E  of error support vectors exceeding the margin (not \nnecessarily misc1assified), and the remaining set R of (ignored) vectors within the margin. \n\n2.2  Adiabatic increments \nThe margin  vector coefficients change value during each  incremental  step  to  keep all  el(cid:173)\nements in  D  in  equilibrium,  i.e., keep their KT conditions satisfied.  In particular, the  KT \nconditions are expressed differentially as: \n\n~9i \n\nQic~ac + L  Qij~aj + Yi~b, \n\n'Vi  E  D  U {c} \n\njES \n\no  =  Ye~ae + LYj~aj \n\njES \n\n(4) \n\n(5) \n\nwhere a e is the coefficient being incremented, initially zero, of a \"candidate\" vector outside \nD.  Since 9i  ==  0  for  the  margin  vector working  set  S  = {Sl, .. . Sis},  the  changes in \ncoefficients must satisfy \n\nQ. \n\n[ \n\n~b \n~aSl \n\n~aSts \n\n= \n\n1 \n\n[ \n\nQSle \n\ny,  1 \nQs~se \n\n~ae \n\nwith symmetric but not positive-definite Jacobian Q: \n\nQ= \n\nYSl \nQS1Sl \n\nYSl \n[  0 \nYS~s  QSts Sl \n\nYSts \nQS1Sts \n\nQStsSts \n\n1 \n\n(6) \n\n(7) \n\n\fThus, in equilibrium \n\nt!.b \n(3t!.ac \nt!.aj  =  (3j t!.ac , \n\n\\::Ij  E  D \n\nwith coefficient sensitivities given by \n\n(8) \n(9) \n\n(10) \n\nwhere R  =  Q-l, and (3j  ==  0 for  all j  outside S.  Substituted in  (4),  the margins change \naccording to: \n\n(11) \n\n(12) \n\nt!.9i = 'Yit!.ac, \n\n\\::Ii  E  D  U {c} \n\nwith margin sensitivities \n\n'Yi  = Qic + L Qij(3j + Yi(3, \n\njES \n\nand 'Yi  ==  0 for all i  in S. \n\n2.3  Bookkeeping: upper limit on increment t!.ac \nIt has been tacitly assumed above that t!.ac is small enough so that no element of D  moves \nacross  S,  E  and/or  R  in  the  process.  Since  the  aj  and  9i  change  with  a c  through  (9) \nand  (11),  some  bookkeeping is  required  to  check each  of the  following  conditions,  and \ndetermine the largest possible increment t!.ac accordingly: \n\n1.  gc  :::;  0, with equality when c joins S; \n2.  a c  :::;  C, with equality when c joins E; \n3.  0:::;  aj  :::;  C , Vj  E S, with equality 0 when j  transfers from S  to R, and equality C when \n\nj  transfers from S  to  E; \n\n4.  gi  :::;  0, Vi  E  E, with equality when i transfers from E  to S; \n5.  gi  ::::  0, Vi  E  R , with equality when i transfers from R to S. \n\n2.4  Recursive magic:  R  updates \nTo add candidate c to the working margin vector set S, R  is expanded as: \n\nThe  same formula applies  to  add  any vector (not necessarily  the  candidate) c to  S,  with \nparameters (3,  (3j  and 'Yc  calculated as  (10) and (12). \nThe expansion ofR, as incremental learning itself, is reversible. To remove a margin vector \nk from S, R is contracted as: \n\nRij +- Rij - Rkk -lRik Rkj \n\n\\::Ii,j E S U {O}; i,j :f.  k \n\n(14) \n\nwhere index 0 refers to the b-term. \nThe R  update rules (13) and (14) are  similar to  on-line recursive estimation of the covari(cid:173)\nance of (sparsified) Gaussian processes [2]. \n\n\fa, /+1 \n\nc \n\nc: a, \n\n. \n\nc \n\nsupport vector \n\nerror vector \n\nFigure 2:  Incremental learning.  A new vector, initially for O:c  =  0 classified with negative \nmargin 9c  < 0, becomes a new margin or error vector. \n\nIncremental procedure \n\n2.5 \nLeU!  --* i+ 1, by adding point c (candidate margin or error vector) to D : Dl+l = DiU{ c}. \nThen the new solution {o:f+1, bi +1 }, i  = 1, ... i  + 1 is  expressed in terms of the present \nsolution {o:f, bi}, the present Jacobian inverse n, and the candidate Xc, Yc,  as: \nAlgorithm 1 (Incremental Learning, \u00a3 --*  \u00a3 + 1) \n\n1.  Initialize ne to  zero; \n2.  If ge  > 0,  terminate (c is not a margin or error vector); \n3.  If ge  :S  0,  apply  the  largest possible increment ne  so  that (the first)  one of the following \n\nconditions occurs: \n(a)  ge  = 0:  Add c to margin set S, update R  accordingly, and terminate; \n(b)  ne = C: Add c to error set E , and terminate; \n(c)  Elements of Dl migrate across  S,  E,  and R  (\"bookkeeping,\"  section 2.3):  Update \n\nmembership of elements and,  if S  changes,  update R  accordingly. \n\nand repeat as necessary. \n\nThe  incremental  procedure is  illustrated  in  Figure  2.  Old  vectors,  from  previously  seen \ntraining data, may change status along the way, but the process of adding the training data \nc to the solution converges in a finite number of steps. \n\n2.6  Practical considerations \nThe trajectory of an example incremental training session is  shown in Figure 3.  The algo(cid:173)\nrithm  yields results identical to  those at convergence using other QP approaches [7], with \ncomparable speeds on various datasets ranging up to several  thousands training points L. \nA practical on-line variant for larger datasets is  obtained by keeping track only of a limited \nset of \"reserve\" vectors:  R  = {i  E  D  I 0  < 9i  < t}, and discarding all  data for  which \n9i  ~ t.  For small  t, this  implies  a small overhead in  memory over Sand E.  The larger \nt, the  smaller the probability of missing  a future  margin or error vector in previous data. \nThe resulting storage requirements are dominated by that for the inverse Jacobian n, which \nscale as  (is)2 where is is the number of margin support vectors, #S. \n\n3  Decremental \"Unlearning\" \n\nLeave-one-out (LOO) is  a standard procedure in predicting the  generalization power of a \ntrained  classifier,  both  from  a  theoretical  and  empirical  perspective  [12].  It  is  naturally \nimplemented  by  decremental  unlearning,  adiabatic  reversal  of incremental  learning,  on \neach of the training data from the full trained solution. Similar (but different) bookkeeping \nof elements migrating across S, E  and R  applies as in the incremental case. \n\nI Matlab code and data are available at http://bach.ece.jhu.eduJpublgertlsvmlincremental. \n\n\f.. \n\n- 2 \n\n- ) \n\n- 2 \n\n- ) \n\n20 \n\n40 \n\n60 \n\nIteration \n\n80 \n\n100 \n\nFigure 3:  Trajectory of coefficients  Q:i  as  a  function  of iteration  step  during training, for \n\u00a3  =  100 non-separable points  in  two  dimensions,  with  C  =  10,  and  using  a  Gaussian \nkernel with u = 1.  The data sequence is  shown on the left. \n\ngc'\" \n\n-1  --------------------\n\n-1  ~ -------------~--\n'\" gc \n\nFigure 4:  Leave-one-out (LOO) decremental unlearning (Q:c  ~ 0) for estimating general(cid:173)\nization performance, directly on the training data.  gc \\c  < -1 reveals a LOO classification \nerror. \n\n3.1  Leave-oDe-out procedure \nLeU ~ \u00a3 - 1, by removing point c (margin or error vector) from D:  D\\c = D \\  {c}.  The \nsolution {Q:i \\c, b\\C}  is expressed in terms of {Q:i' b}, R  and the removed point Xc, Yc.  The \nsolution yields gc \\c, which determines whether leaving c out of the training set generates a \nclassification error (gc \\c  < -1). Starting from the full i-point solution: \nAlgorithm 2 (Decremental Unlearning, \u00a3 ---+  \u00a3 - 1, and LOO Classification) \n\n1.  Ifc is not a margin or error vector:  Terminate,  \"correct\" (c is already left out, and correctly \n\nclassified); \n\n2.  If c is  a margin  or error vector with gc  <  -1:  Terminate,  \"incorrect\"  (by  default as  a \n\ntraining error); \n\n3.  If c is a margin or error vector with gc  ~ -1, apply the  largest possible decrement Q c  so \n\nthat (the first)  one of the following conditions occurs: \n(a)  gc  < -1: Terminate,  \"incorrect\"; \n(b)  Q c  = 0:  Terminate,  \"correct\"; \n(c)  Elements of Dl migrate across  S,  E,  and R: Update  membership of elements and, \n\nif S  changes,  update R  accordingly. \n\nand repeat as necessary. \n\nThe leave-one-out procedure is illustrated in Figure 4. \n\n\fo \n\n- 0.2 \n\n- 0.4 \n\n<KJ~ \n\n- 0.6 \n\n- 0.8 \n\n- 1 \n\na. c \n\nFigure 5:  Trajectory of LOO margin Bc  as  a function of leave-one-out coefficient a c. The \ndata and parameters are as in Figure 3. \n\n3.2  Leave-one-out considerations \nIf an  exact  LOO  estimate  is  requested,  two  passes  through  the  data  are  required.  The \nLOO pass  has  similar run-time complexity and  memory requirements as  the  incremental \nlearning procedure. This is significantly better than the conventional approach to empirical \nLOO evaluation which requires \u00a3 (partial but possibly still extensive) training sessions. \nThere is a clear correspondence between generalization performance and the LOO margin \nsensitivity'Yc.  As  shown in Figure 4,  the value of the LOO margin Bc \\c  is obtained from \nthe  sequence of Bc  vs.  a c segments for each of the decrement steps,  and thus  determined \nby their slopes 'Yc.  Incidentally, the LOO approximation using linear response theory in [6] \ncorresponds to the first  segment of the LOO procedure, effectively extrapolating the value \nof Bc \\c  from  the  initial  value  of 'Yc.  This  simple  LOO  approximation  gives  satisfactory \nresults in most (though not all) cases as illustrated in the example LOO session of Figure 5. \nRecent work in statistical learning theory has sought improved generalization performance \nby considering non-uniformity of distributions in  feature  space  [13]  or non-uniformity in \nthe kernel matrix eigenspectrum [10].  A geometrical interpretation of decremental unlearn(cid:173)\ning, presented next,  sheds further light on the dependence of generalization performance, \nthrough 'Yc,  on the geometry of the data. \n\n4  Geometric Interpretation in Feature Space \n\nThe differential Kuhn-Tucker conditions (4) and (5) translate directly in terms of the sensi(cid:173)\ntivities 'Yi  and f3j  as \n\n'Yi \n\njES \n\no  =  Yc  + LYjf3j. \n\nVi  E  D U {c} \n\n(15) \n\n(16) \n\nThrough the nonlinear map Xi = Yi'P(Xi)  into feature  space, the kernel matrix elements \n\njES \n\nreduce to linear inner products: \n\nand the KT sensitivity conditions (15) and (16) in feature space become \n\nQij  =  YiyjK(Xi,Xj)  =  Xi\u00b7 Xj, \n\nVi,j \n\n'Yi \n\nXi\u00b7 (Xc + L  Xjf3j) + Yif3 \n\nVi  E  D  U {c} \n\njES \n\n(17) \n\n(18) \n\n\fo \n\nYc  + LYj{Jj. \n\nSince 'Yi = 0, Vi  E S, (18) and (19) are equivalent to minimizing a functional: \n\njES \n\nmf3in:  We  =  ~(Xc + L  X j {Jj)2  , \n\n(19) \n\n(20) \n\nJ \n\njES \n\nsubject to the equality constraint (19) with Lagrange parameter {J.  Furthermore, the optimal \nvalue of We immediately yields the sensitivity 'Yc,  from  (18): \n\n'Yc  =  2Wc =  (Xc + L  X j {Jj)2  ~ O. \n\n(21) \n\njES \n\nIn  other  words,  the  distance  in  feature  space  between  sample  c  and  its  projection  on  S \nalong (16) determines, through (21), the extent to  which leaving out c affects the classifi(cid:173)\ncation  of c.  Note that only  margin  support vectors  are relevant in  (21),  and not the error \nvectors which otherwise contribute to the decision boundary. \n\n5  Concluding Remarks \nIncremental learning and, in particular, decremental unlearning offer a simple and compu(cid:173)\ntationally efficient scheme for on-line SVM training and exact leave-one-out evaluation of \nthe  generalization performance on  the  training data.  The procedures can be directly  ex(cid:173)\ntended to a broader class of kernel learning machines with convex quadratic cost functional \nunder linear constraints,  including SV regression.  The  algorithm is  intrinsically  on-line \nand extends to query-based learning methods [1].  Geometric interpretation of decremental \nunlearning in feature space elucidates a connection, similar to  [13], between generalization \nperformance and the distance of the data from the subspace spanned by the margin vectors. \n\nReferences \n[1]  C. Campbell, N. Cristianini and A. Smola, \"Query Learning with Large Margin Classifiers,\" in \n\nProc.  17th Tnt.  Con!  Machine  Learning (TCML2000), Morgan Kaufman, 2000. \n\n[2]  L. Csato and M.  Opper, \"Sparse Representation for Gaussian Process Models,\" in Adv.  Neural \n\nInformation Processing Systems (NIPS'2000), vol.  13, 2001. \n\n[3]  T.-T. FrieB, N.  Cristianini and C.  Campbell, \"The Kernel Adatron Algorithm:  A Fast and Sim(cid:173)\nple Learning  Procedure for  Support Vector Machines,\"  in  15th Tnt.  Con!  Machine  Learning, \nMorgan Kaufman,  1998. \n\n[4]  T.S.  Jaakkola  and  D.  Haussler,  \"Probabilistic  Kernel  Methods,\"  Proc.  7th  Int.  Workshop  on \n\nArtificial Tntelligence and Statistics,  1998. \n\n[5]  T. Joachims, \"Making Large-Scale Support Vector Machine Leaming Practical,\" in  Scholkopf, \nBurges and Smola, Eds., Advances in Kernel Methods- Support  Vector Learning,  Cambridge \nMA: MIT Press, 1998, pp  169-184. \n\n[6]  M. Opper and O. Winther, \"Gaussian Processes and SVM: Mean Field Results and Leave-One(cid:173)\n\nOut,\" Adv.  Large Margin  Classifiers,  A.  Smola, P.  Bartlett, B.  SchOlkopf and D.  Schuurmans, \nEds., Cambridge MA: MIT Press, 2000, pp 43-56. \n\n[7]  E. Osuna, R.  Freund and F.  Girosi, ''An Improved Training Algorithm for Support Vector Ma(cid:173)\nchines,\"  Proc.  1997 TEEE Workshop  on Neural Networks for Signal  Processing,  pp 276-285, \n1997. \n\n[8]  J.C.  Platt,  \"Fast Training of Support Vector Machines Using Sequential Minimum Optimiza(cid:173)\n\ntion,\"  in  Scholkopf,  Burges  and  Smola,  Eds.,  Advances in Kernel  Methods- Support  Vector \nLearning, Cambridge MA: MIT Press, 1998, pp 185-208. \n\n[9]  M.  Pontil  and  A.  Verri,  \"Properties  of Support  Vector  Machines,\"  it  Neural  Computation, \n\nvol.  10, pp 955-974,  1997. \n\n[10]  B.  Scholkopf, J.  Shawe-Taylor,  A.J.  Smola and R.C.  Williamson, \"Generalization Bounds via \n\nEigenvalues of the Gram Matrix,\" NeuroCOLT, Technical Report 99-035,  1999. \n\n[11]  N.A.  Syed, H.  Liu and K.K. Sung,  \"Incremental Learning with Support Vector Machines,\" in \n\nProc. Int.  foint Con! on Artificial Intelligence  (IJCAI-99),  1999. \n\n[12]  V.  Vapnik,  The Nature of Statistical Learning Theory,' New York:  Springer-Verlag, 1995. \n[13]  V.  Vapnik  and  O.  Chapelle,  \"Bounds  on  Error  Expectation  for  SVM,\"  in  Smola,  Bartlett, \nScholkopf and Schuurmans, Eds., Advances in Large Margin Classifiers, Cambridge MA: MIT \nPress, 2000. \n\n\f", "award": [], "sourceid": 1814, "authors": [{"given_name": "Gert", "family_name": "Cauwenberghs", "institution": null}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": null}]}