{"title": "Escaping the Convex Hull with Extrapolated Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 753, "page_last": 760, "abstract": null, "full_text": "Escaping the  Convex Hull  with \nExtrapolated Vector  Machines. \n\nPatrick  Haffner \n\nAT&T Labs-Research,  200  Laurel  Ave,  Middletown,  NJ 07748 \n\nhaffner@research.att.com \n\nAbstract \n\nMaximum  margin  classifiers  such  as  Support  Vector  Machines \n(SVMs)  critically  depends  upon  the  convex  hulls  of the  training \nsamples  of each  class,  as  they  implicitly  search for  the  minimum \ndistance  between the convex hulls.  We  propose Extrapolated Vec(cid:173)\ntor  Machines  (XVMs)  which  rely  on  extrapolations  outside  these \nconvex hulls.  XVMs improve SVM generalization very significantly \non  the  MNIST  [7]  OCR  data.  They  share  similarities  with  the \nFisher  discriminant:  maximize  the  inter-class  margin  while  mini(cid:173)\nmizing the intra-class disparity. \n\n1 \n\nIntroduction \n\nBoth intuition  and  theory  [9]  seem  to  support  that  the  best  linear  separation be(cid:173)\ntween  two  classes  is  the  one  that  maximizes  the  margin.  But  is  this  always  true? \nIn the example shown in Fig.(l), the maximum margin hyperplane is  Wo;  however, \nmost observers would say that the separating hyperplane WI  has better chances to \ngeneralize, as it takes into account the expected location of additional training sam-\n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 f\\J:- \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022\u2022  . . ,  --.Q. \n~\"-\n.. \n................ ~x~... \n\n_ \n\n~, \n\n\u2022\u2022 \n. '  \n\nW\n\n1 \n\n--------------- ~---------------\n\n\u00b7 K~ \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n. ..  . ....... (}  ............ . \n\n'''-,,- /0- 0  00  00 \n\n00 0'\\ \no- o::-o \n\n\"OW-0_ o __  o-\n\nFigure 1:  Example of separation where the large margin is  undesirable.  The convex \nhull and the separation that corresponds to the standard SVM  use plain lines while \nthe extrapolated convex hulls and XVMs  use dotted lines. \n\n\fpIes.  Traditionally, to take this into account, one would estimate the distribution of \nthe data.  In  this  paper,  we  just use  a  very elementary form  of extrapolation (\"the \npoor man variance\")  and show that it can be implemented into a  new  extension to \nSVMs that we  call Extrapolated Vector Machines  (XVMs). \n\n2  Adding Extrapolation to  Maximum  Margin  Constraints \n\nThis  section states extrapolation as  a  constrained optimization problem  and  com(cid:173)\nputes a  simpler dual form. \nTake  two  classes  C+  and C_  with  Y+  =  +1  and  Y_  =  -1  1  as  respective  targets. \nThe N  training samples {(Xi, Yi);  1 ::::;  i  ::::;  N} are separated with a margin p if there \nexists  a  set of weights  W  such that Ilwll  =  1 and \n\nVk  E  {+, -}, Vi  E Ck,  Yk(w,xi+b)  2:  p \n\n(1) \nSVMs  offer  techniques  to find  the  weights  W  which  maximize the margin p.  Now, \ninstead of imposing the margin constraint on each training point,  suppose that for \ntwo  points  in  the  same  class  Ck,  we  require  any  possible  extrapolation  within  a \nrange factor  17k  2:  0 to be larger than the margin: \n\nVi,j  E Ck,  V)\"  E  [-17k,  l+17k],  Yk  (W.()\"Xi  + (l-)\")Xj) + b)  2:  P \n\n(2) \nIt is  sufficient  to enforce  the  constraints at the end of the extrapolation segments, \nand \n\n(3) \nKeeping the constraint over each pair of points would result in N 2  Lagrange multi(cid:173)\npliers.  But we  can reduce it to a  double constraint applied to each single  point.  If \nfollows  from  Eq.(3)  that: \n\n(4) \n\n(5) \n\nWe  consider  J.Lk  =  max (Yk(W.Xj))  and  Vk  =  min (Yk(W.Xj))  as  optimization  vari-\n\nabIes.  By adding Eq.(4)  and  (5),  the margin becomes \n\nlEC. \n\nlEC. \n\n2p =  L ((17k+ 1)vk - 17kJ.Lk)  =  L (Vk  -17dJ.Lk - Vk)) \n\n(6) \n\nOur problem is  to maximize the margin under the double  constraint: \n\nk \n\nk \n\nVi  E Ck,  Vk  ::::;  Yk(W.Xi)  ::::;  J.Lk \n\nIn  other  words,  the  extrapolated  margin  maximization  is  equivalent  to  squeezing \nthe  points  belonging to a  given  class  between  two  hyperplanes.  Eq.(6)  shows  that \np is  maximized when Vk  is  maximized while J.Lk  - Vk  is  minimized. \n\nMaximizing the margin over J.Lk  , Vk  and W  with Lagrangian techniques gives us the \nfollowing  dual problem: \n\n(7) \n\nlIn  this  paper,  it  is  necessary  to  index  the  outputs  y  with  the  class  k  rather  than \nthe more traditional sample index i,  as  extrapolation constraints require  two  examples to \nbelong to the same class.  The resulting equations are more concise, but harder to read. \n\n\fCompared to the standard SVM formulation,  we  have two  sets  of support  vectors. \nMoreover,  the  Lagrange  multipliers  that  we  chose  are normalized  differently  from \nthe  traditional SVM  multipliers  (note  that this  is  one  possible  choice  of notation, \nsee  Section.6  for  an  alternative  choice).  They sum to  1  and  allow  and  interesting \ngeometric interpretation developed in the next section. \n\n3  Geometric Interpretation and  Iterative Algorithm \n\nFor  each  class  k,  we  define  the  nearest  point  to  the  other class  convex  hull  along \nthe  direction  of w:  Nk  =  I:iECk  f3iXi.  Nk  is  a  combination  of the  internal sup(cid:173)\nport vectors  that belong to class  k  with  f3i  > O.  At  the  minimum  of  (7),  because \nthey correspond to non zero  Lagrange multipliers, they fallon the internal margin \nYk(W,Xi)  =  Vk;  therefore,  we  obtain Vk  =  Ykw.Nk\u00b7 \nSimilarly, we  define the furthest point Fk  =  I:i ECk  ~iXi'  Fk  is  a combination of the \nexternal support vectors, and we  have  flk  =  Ykw.Fk. \n\nThe dual problem is  equivalent to the distance minimization problem \n\nmin \n\nNk ,Fk EHk \n\nIILYk ((1Jk+I)Nk  _1Jk F k)11\n\nk \n\n2 \n\nwhere  1{k  is  the convex hull  containing the examples of class  k. \n\nIt is  possible  to  solve  this  optimization  problem  using  an  iterative  Extrapolated \nConvex Hull  Distance Minimization  (XCHDM)  algorithm.  It is  an extension of the \nNearest  Point  [5]  or  Maximal  Margin  Percept ron  [6]  algorithms.  An  interesting \ngeometric  interpretation is  also  offered  in  [3].  All  the  aforementioned  algorithms \nsearch for  the points in  the  convex hulls  of each  class  that are the nearest  to each \nother (Nt and No  on Fig.I) , the maximal margin weight  vector w  =  Nt - No-' \nXCHDM  look  for  nearest  points  in  the  extrapolated  convex  hulls  (X+ I  and  X-I \non  Fig.I).  The  extrapolated  nearest  points  are  X k  =  1JkNk  - 1JkFk'  Note  that \nthey  can  be outside  the  convex  hull  because  we  allow  negative  contribution  from \nexternal  support  vectors.  Here  again,  the  weight  vector  can  be  expressed  as  a \ndifference between two points w  =  X+ - X - .  When the data is  non-separable, the \nsolution  is  trivial  with  w  =  O.  With  the  double  set  of Lagrange  multipliers,  the \ndescription of the XCHDM  algorithm is  beyond the scope of this  paper.  XCHDM \nwith 1Jk  =  0 are simple SVMs trained by the same algorithm as in  [6]. \n\nAn  interesting way  to follow  the  convergence of the  XCHDM  algorithm is  the fol(cid:173)\nlowing.  Define the extrapolated primal margin \n\nand the  dual  margin \n\n1';  =  2p =  L \n\nk \n\n((1Jk+ I )vk - 1Jkflk) \n\n1';  =  IIX+  - X-II \n\nConvergence  consists  in  reducing the  duality  gap  1'~ -1';  down  to  zero.  In the  rest \nof the paper, we  will  measure convergence with the  duality  ratio  r  =  1'~  . \n1'2 \n\nTo determine the threshold to compute the classifier output class sign(w.x+b) leaves \nus  with  two choices.  We  can require the separation to happen at the center of the \nprimal margin, with the primal threshold  (subtract Eq.(5)  from  Eq.(4)) \n\n1 \n\nbl  =  -2\"  LYk ((1Jk+ I )vk-1JkJ.lk) \n\nk \n\n\for at the center of the dual margin,  with the  dual  threshold \n\nb2  =  - ~w. 2:)(T}k+1)Nk - T}kFk)  =  - ~ (IIx+ 112  -lix-in \n\nk \n\nAgain,  at  the  minimum,  it  is  easy  to  verify  that  b1  =  b2 .  When  we  did  not  let \nthe  XCHDM  algorithm  converge  to  the  minimum,  we  found  that  b1  gave  better \ngeneralization results. \n\nOur standard stopping heuristic is  numerical:  stop when the duality ratio gets over \na  fixed  value  (typically between 0.5  and 0.9). \n\nThe only other stopping heuristic we have tried so far is based on the following idea. \nDefine the set of extrapolated pairs as {(T}k+1)Xi -T}kXj; 1 :S  i,j :S  N}.  Convergence \nmeans that we find extrapolated support pairs that contain every extrapolated pair \non the  correct side of the margin.  We  can relax this  constraint and stop when the \nextrapolated support pairs contain every vector.  This means that 12  must be lower \nthan  the  primal  true  margin  along  w  (measured  on  the  non-extrapolated  data) \n11  =  y+ + Y -.  This  causes  the XCHDM  algorithm to stop long before 12  reaches \nIi and is  called the  hybrid stopping heuristic. \n\n4  Beyond  SVMs and  discriminant  approaches. \n\nKernel Machines consist of any classifier of the type  f(x)  = L:i Yi(XiK(x, Xi).  SVMs \noffer  one solution among many others, with the constraint (Xi  > O. \nXVMs  look for  solutions  that  no  longer  bear this  constraint.  While  the  algorithm \ndescribed in  Section 2 converges toward a  solution where vectors act as  support of \nmargins  (internal and external),  experiments show that the performance of XVMs \ncan  be  significantly  improved if we  stopped  before  full  convergence.  In  this  case, \nthe vectors with (Xi  =/:  0 do not line up onto any type of margin, and should not be \ncalled  support vectors. \n\nThe  extrapolated  margin  contains  terms  which  are  caused  by  the  extrapolation \nand  are  proportional  to  the  width  of  each  class  along  the  direction  of  w.  We \nwould observe the same phenomenon if we had trained the classifier using Maximum \nLikelihood Estimation (MLE)  (replace class width with variance).  In both MLE and \nXVMs, examples which are the furthest from the decision surface play an important \nrole.  XVMs  suggest an explanation why. \n\nNote  also  that  like  the  Fisher  discriminant,  XVMs  look  for  the  projection  that \nmaximizes the inter-class variance while  minimizing the intra-class variances. \n\n5  Experiments on  MNIST \n\nThe  MNIST  OCR  database  contains  60,000  handwritten  digits  for  training  and \n10,000  for  testing  (the  testing  data  can  be  extended  to  60,000  but  we  prefer  to \nkeep  unseen  test  data for  final  testing  and  comparisons).  This  database has  been \nextensively  studied  on  a  large  variety  of  learning  approaches  [7].  It lead  to  the \nfirst  SVM  \"success  story\" [2],  and  results  have  been  improved  since  then  by  using \nknowledge about the invariance of the data [4]. \n\nThe input  vector is  a  list  of 28x28 pixels ranging from  0 to 255.  Before computing \nthe kernels, the input  vectors are normalized to  1:  x =  II~II' \n\nGood  polynomial kernels  are easy  to define  as Kp(x, y)  =  (x.y)P.  We  found  these \nnormalized kernels to outperform the unnormalized kernels Kp(x, y)  =  (a(x.y)+b)P \n\n\fthat  have  been  traditionally used  for  the  MNIST  data significantly.  For  instance, \nthe  baseline  error  rate  with  K4  is  below  1.2%,  whereas  it  hovers  around  1.5%  for \nK4  (after choosing optimal values for  a  and  b)2. \n\nWe  also  define  normalized  Gaussian kernels: \n\nKp(x, y)  =  exp (-~ Ilx  - y112)  =  [exp (x.y- 1)JP. \n\n(8) \n\nEq.(8)  shows  how  they  relate  to  normalized  polynomial  kernels:  when  x.y  \u00ab  1, \nKp  and  Kp  have  the  same  asymptotic  behavior.  We  observed  that  on  MNIST, \nthe  performance  with  Kp  is  very  similar  to  what  is  obtained  with  unnormalized \nGaussian kernels Ku(x , y)  =  exp _(X~Y)2. However, they are easier to analyze and \ncompare to polynomial kernels. \n\nMNIST contains 1 class per digit,  so  the total number of classes is  M=10.  To  com(cid:173)\nbine  binary  classifiers  to  perform  multiclass  classifications,  the  two  most  common \napproaches were  considered . \n\n\u2022  In the one-vs-others case (lvsR) , we  have one classifier per class c,  with the \npositive examples taken from  class c and negative examples form the other \nclasses.  Class  c is  recognized  when  the  corresponding classifier  yields  the \nlargest output . \n\n\u2022  In  the  one-vs-one  case  (lvs1),  each  classifier  only  discriminates  one  class \n\nfrom  another:  we  need a  total of  (MU:;-l)  =  45  classifiers. \n\nDespite the effort we  spent on optimizing the recombination of the classifiers  [8]  3, \n1 vsR SVMs  (Table  1)  perform significantly better than 1 vs1  SVMs  (Table  2).  4 \n\nFor  each  trial,  the  number  of errors  over  the  10,000  test  samples  (#err)  and  the \ntotal number of support  vectors( #SV)  are  reported.  As  we  only  count  SV s  which \nare shared by different classes once, this predicts the test time.  For instance, 12,000 \nsupport vectors mean that 20%  of the 60,000  vectors are used  as  support. \n\nPreliminary experiments  to  choose  the  value  of rJk  with  the  hybrid  criterion  show \nthat  the  results  for  rJk  =  1  are  better  than  rJk  =  1.5  in  a  statistically  significant \nway,  and  slightly  better than  rJk  =  0.5.  We  did  not  consider  configurations  where \nrJ+  f; rJ -; however,  this would  make sense for  the assymetrical 1 vsR classifiers. \nXVM  gain  in  performance  over  SVMs  for  a  given  configuration  ranges  from  15% \n(1 vsR in Table 3)  to 25%  (1 vs1  in Table  2). \n\n2This  may  partly  explain  a  nagging  mystery  among  researchers  working  on  MNIST: \n\nhow  did Cortes and Vapnik  [2]  obtain  1.1%  error with a  degree 4 polynomial  ? \n\n3We  compared the  Max  Wins  voting  algorithm with  the  DAGSVM decision  tree  algo(cid:173)\n\nrithm and found them to perform equally, and worse than 1 vsR SVMs.  This is is surprising \nin the light of results published on other tasks [8] , and would require further investigations \nbeyond the scope of this paper. \n\n4Slightly  better  performance  was  obtained with  a  new  algorithm  that  uses  the  incre(cid:173)\n\nmental  properties  of our  training  procedure  (this  is  be  the  performance  reported  in  the \ntables).  In a  transductive inference framework , treat the test example as a  training exam-\nple:  for  each  of the  M  possible  labels,  retrain  the  M  among  (M(\":-l)  classifiers  that  use \nexamples with such label.  The best label will  be the one that causes the smallest increase \nin  the multiclass margin p such that it combines the classifier  margins  pc in  the following \nmanner \n\n~= ,,~ \n2  ~ 2 \nP \nc~M Pc \n\nThe fact  that this margin  predicts generalization  is  \"justified\"  by Theorem 1 in  [8]. \n\n\fKernel \n\nK3 \nK4 \nK5 \nKg \n[(2 \n[(4 \nK5 \n\nDuality Ratio stop \n\n0.40 \n\n0.75 \n\n0.99 \n\n#err \n136 \n127 \n125 \n136 \n147 \n125 \n125 \n\n#SV \n8367 \n8331 \n8834 \n13002 \n9014 \n8668 \n8944 \n\n#err \n136 \n117 \n119 \n137 \n128 \n119 \n125 \n\n#SV \n11132 \n11807 \n12786 \n18784 \n11663 \n12222 \n12852 \n\n# err \n132 \n119 \n119 \n141 \n131 \n117 \n125 \n\n#SV \n13762 \n15746 \n17868 \n25953 \n13918 \n16604 \n18085 \n\nTable 1:  SVMs  on  MNIST  with  10  1vsR  classifiers \n\nKernel \n\nK3 \nK4 \nK5 \n\nSVM/ratio at  0.99  XVM/Hybrid \n# err \n#SV \n138 \n17020 \n16066 \n135 \n191 \n15775 \n\n#SV \n11952 \n13526 \n13526 \n\n# err \n117 \n110 \n114 \n\nTable  2:  SVMjXVM on MNIST with 45  1 vs1  classifiers \n\nThe  103  errors obtained with  K4  and r  =  0.5  in Table  3 represent only  about  1% \nerror:  t his  is  the  lowest  error  ever  reported  for  any  learning  technique  without  a \npriori knowledge about the fact  that t he input data corresponds to a pixel map (the \nlowest reproducible error previously reported was 1.2% with SVMs and polynomials \nof degree  9  [4],  it  could  be  reduced  to  0.6%  by  using  invariance  properties  of the \npixel  map).  The  downside  is  that XVMs  require 4  times  as  many support  vectors \nas standards SVMs. \n\nTable 3 compares stopping according to the duality ratio and the  hybrid criterion. \nWith the duality ratio, the best performance is  most often reached with r  = 0.50 (if \nt his  happens to be consistent ly  true,  validation data to  decide  when  to stop could \nbe spared).  The hybrid  criterion does  not require validation data and yields errors \nthat, while higher than the best XVM, are lower than SVMs and only require a few \nmore  support  vectors.  It  takes  fewer  iterations  to  train  than  SVMs.  One  way  to \ninterpret this hybrid stopping  criterion is that we  stop when  interpolation in  some \n(but  not  all)  directions  account  for  all  non-interpolated vectors.  This  suggest  that \ninterpolation is only desirable in  a few  directions. \n\nXVM  gain is  stronger in  the  1 vs 1 case  (Table  2).  This suggests that extrapolating \non  a  convex  hull  that  contains  several  different  classes  (in  the  1 vsR case)  may be \nundesirable. \n\nKernel \n\nK3 \nK4 \nK5 \nKg \nK2 \n[(4 \n\nDuality Ratio stop \n\n0.40 \n\n0. 50 \n\n0.75 \n\n# err \n118 \n112 \n109 \n128 \n114 \n108 \n\n#SV \n46662 \n40274 \n36912 \n35809 \n43909 \n36980 \n\n# err \n111 \n103 \n106 \n126 \n114 \n111 \n\n#SV \n43819 \n43132 \n44226 \n39462 \n46905 \n40329 \n\n# err \n116 \n110 \n110 \n131 \n114 \n114 \n\n#SV \n50216 \n52861 \n49383 \n50233 \n53676 \n51088 \n\nHybrid. \nStop  Crit. \n\n# err \n125 \n107 \n107 \n125 \n119 \n108 \n\n#SV \n20604 \n18002 \n17322 \n19218 \n20152 \n16895 \n\nTable 3:  XVMs  on MNIST with  10  1 vsR  classifiers \n\n\f6  The Soft  Margin  Case \n\nMNIST  is  characterized  by  the  quasi-absence  of outliers,  so  to  assume  that  the \ndata is  fully  separable  does  not  impair  performance  at  all.  To  extend  XVMs  to \nnon-separable data,  we  first  considered  the  traditional  approaches  of adding  slack \nvariables to allow margin constraints to be violated.  The most commonly used  ap(cid:173)\nproach with SVMs adds linear slack variables to the unitary margin.  Its application \nto  the  XVM  requires  to  give  up  the  weight  normalization  constraint,  so  that  the \nusual unitary margin can be used in the constraints  [9] . \n\nCompared to standard SVMs,  a  new issue to tackle is  the fact that each constraint \ncorresponds  to  a  pair of vectors:  ideally,  we  should  handle  N 2 slack  variables  ~ij. \nTo  have  linear  constraints  that  can  be  solved  with  KKT,  we  need  to  have  the \ndecomposition  ~ij  =  ('T}k+1)~i+'T}k~;  (factors  ('T}k+1)  and  'T}k  are added  here  to  ease \nlater simplifications). \n\nSimilarly to Eq.(3),  the constraint on the extrapolation from  any pair of points is \n\nVi,j E Ck,  Yk  (w. (('T}k+1)xi - 'T}kXj) +b)  2:  1 - ('T}k+1)~i - 'T}k~;  with ~i'~; 2:  0 \n\n(9) \nIntroducing  J.tk  =  max (Yk(w,xj+b)  - ~;)  and  Vk  =  min (Yk(W,Xi+b) + ~i)'  we  ob-\ntain the simpler double constraint \n\nJECk \n\n.ECk \n\nVi  E  Ck ,  Vk -~i ~ Yk(W,Xi+b)  ~ J.tk+~; with ~i'~; 2:  0 \n\n(10) \n\nIt follows  from  Eq.(9)  that J.tk  and Vk  are tied  through  (l+'T}k)vk  =  l+'T}kJ.tk \nIf we fix  J.tk  (and thus Vk)  instead of treating it as an optimization variable, it would \namount to  a  standard SVM  regression  problem with  {-I, + I}  outputs,  the  width \nof the asymmetric f-insensitive tube being J.tk-Vk  =  (~~~;)' \nThis  remark makes  it  possible  for  the  reader  to  verify  the  results  we  reported  on \nMNIST.  Vsing the publicly available SVM software SVMtorch  [1]  with C =  10 and \nf  =  0.1  as  the  width  of the  f-tube  yields  a  10-class  error rate  of 1.15%  while  the \nbest performance using  SVMtorch in classification mode is  1.3% (in both cases,  we \nuse  Gaussian kernels  with parameter (J  =  1650). \n\nAn  explicit  minimization  on  J.tk  requires  to  add  to  the  standard  SVM  regression \nproblem  the  following  constraint  over  the  Lagrange  multipliers  (we  use  the  same \nnotation as in  [9]) : \n\nYi= l \n\nYi=- l \n\nYi= l \n\nYi=- l \n\nNote that we  still have the standard regression constraint I: ai =  I: ai \nThis  has  not  been  implemented  yet , as  we  question  the  pertinence of the  ~; slack \nvariables  for  XVMs.  Experiments  with  SVMtorch  on  a  variety  of  tasks  where \nnon-zero slacks are required to achieve optimal performance (Reuters,  VCI/Forest, \nVCI/Breast  cancer)  have  not  shown  significant  improvement  using  the  regression \nmode while  we  vary the width of the f-tube. \n\nMany  experiments  on  SVMs  have  reported that  removing the  outliers  often  gives \nefficient  and sparse solutions.  The early stopping heuristics that we  have presented \nfor  XVMs suggest strategies to avoid learning (or to unlearn)  the outliers, and this \nis  the approach we  are currently exploring. \n\n\f7  Concluding Remarks \n\nThis paper shows that large margin classification on extrapolated data is equivalent \nto the addition of the minimization of a second external margin to the standard SVM \napproach.  The  associated  optimization  problem  is  solved  efficiently  with  convex \nhull distance minimization algorithms.  A  1 % error rate is  obtained on the MNIST \ndataset:  it is  the lowest ever obtained without a-priori knowledge  about the data. \n\nWe  are currently trying to identify  what other types  of dataset show similar gains \nover SVMs, to determine how dependent XVM performance is on the facts that the \ndata is  separable or has invariance properties.  We  have only explored a few  among \nthe  many  variations  the  XVM  models  and  algorithms  allow,  and  a  justification \nof  why  and  when  they  generalize  would  help  model  selection.  Geometry-based \nalgorithms that handle potential outliers are also  under investigation. \n\nLearning  Theory  bounds  that  would  be  a  function  of both  the  margin  and  some \nform of variance of the data would be necessary to predict XVM generalization and \nallow us  to also consider the extrapolation factor  'TJ  as  an optimization variable. \n\nReferences \n\n[1]  R.  Collobert and S.  Bengio.  Support  vector machines for  large-scale regression \n\nproblems.  Technical Report IDIAP-RR-00-17, IDIAP,  2000. \n\n[2]  C.  Cortes and V.  Vapnik.  Support vector networks.  Machine  Learning,  20:1- 25 , \n\n1995. \n\n[3]  D.  Crisp  and  C.J.C.  Burges.  A  geometric  interpretation of v-SVM  classifiers. \nIn  Advances  in  Neural  Information  Processing  Systems  12,  S.  A.  Solla,  T.  K. \nLeen,  K.-R.  Mller,  eds,  Cambridge, MA, 2000.  MIT Press. \n\n[4]  D.  DeCoste  and  B.  Schoelkopf.  Training  invariant  support  vector  machines. \nMachine Learning, special issue on Support Vector Machines  and Methods,  200l. \n[5]  S.S.  Keerthi,  S.K.  Shevade,  C.  Bhattacharyya,  and  K.R.K.  Murthy.  A  fast \niterative  nearest  point  algorithm  for  support  vector  machine  classifier  design. \nIEEE transactions  on  neural networks,  11(1):124 - 136, jan 2000. \n\n[6]  A.  Kowalczyk.  Maximal margin perceptron.  In Advances in Large  Margin  Clas(cid:173)\nsifiers,  Smola,  Bartlett,  Schlkopf,  and  Schuurmans,  editors,  Cambridge,  MA, \n2000.  MIT Press. \n\n[7]  Y.  LeCun,  L.  Bottou,  Y.  Bengio,  and  P.  Haffner.  Gradient-based learning ap(cid:173)\n\nplied to document  recognition.  proceedings  of the  IEEE,  86(11),  1998. \n\n[8]  J. Platt, N.  Christianini, and J. Shawe-Taylor.  Large margin dags for  multiclass \nclassification.  In  Advances  in  Neural  Information  Processing  Systems  12,  S.  A. \nSolla,  T.  K.  Leen,  K.-R.  Mller,  eds,  Cambridge, MA,  2000.  MIT Press. \n\n[9]  V.  N.  Vapnik.  Statistical Learning  Theory.  John Wiley &  Sons, New-York,  1998. \n\n\f", "award": [], "sourceid": 2037, "authors": [{"given_name": "Patrick", "family_name": "Haffner", "institution": null}]}