{"title": "Convergence of Large Margin Separable Linear Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 357, "page_last": 363, "abstract": null, "full_text": "Convergence of Large Margin Separable Linear \n\nClassification \n\nTong Zhang \n\nMathematical Sciences Department \nIBM TJ. Watson Research Center \n\nYorktown Heights, NY  10598 \n\ntzhang@watson.ibm.com \n\nAbstract \n\nLarge  margin  linear classification  methods  have  been  successfully  ap(cid:173)\nplied to many applications.  For a linearly separable problem, it is known \nthat under appropriate assumptions, the expected misclassification error \nof the computed \"optimal hyperplane\" approaches zero at a rate propor(cid:173)\ntional  to  the  inverse  training  sample  size.  This  rate  is  usually charac(cid:173)\nterized  by the margin and the maximum norm of the input data.  In  this \npaper,  we  argue  that another quantity,  namely  the robustness of the  in(cid:173)\nput data distribution,  also  plays  an  important role  in characterizing  the \nconvergence behavior of expected misclassification error.  Based on  this \nconcept of robustness,  we  show that for a large margin  separable linear \nclassification problem, the expected misclassification error may converge \nexponentially in the number of training sample size. \n\n1  Introduction \n\nWe  consider the binary classification problem:  to determine  a label  y  E  {-1, 1}  associ(cid:173)\nated  with an  input vector  x.  A useful method for solving this problem is  by using linear \ndiscriminant functions . Specifically, we seek a weight vector wand a threshold ()  such that \nwT  x  < ()  if its label y  =  -1 and wT  x  ~ ()  if its label y  =  1. \nIn this paper,  we are mainly interested in problems that are linearly separable by a positive \nmargin (although, as we shall see later, our analysis is suitable for non-separable problems). \nThat is, there exists a hyperplane that perfectly separates the in-class data from the out-of(cid:173)\nclass  data.  We  shall  also  assume  ()  =  0 throughout the  rest of the  paper for  simplicity. \nThis restriction usually does not cause problems in practice since one can always append a \nconstant feature to the input data x, which offset the effect of (). \n\nlinearly \n\nseparable  problems, \n\nFor \nlabeled  data \n(X1,yl), .. . ,(xn,yn),  Vapnik  recently  proposed  a  method  that  optimizes  a  hard \nmargin  bound which  he  calls  the  \"optimal hyperplane\"  method  (see  [11]).  The  optimal \nhyperplane Wn  is the solution to the following quadratic programming problem: \n\ngiven  a \n\ntraining  set  of  n \n\n.  1 \n2 \nmln-w \nw  2 \n\n(1) \n\n\fFor linearly  non-separable  problems,  a generalization of the optimal hyperplane method \nhas appeared in [2], where a slack variable f.i  is introduced for each data point (xi, yi) for \ni  =  1, ... ,n. We compute a hyperplane Wn  that solves \ns.t.  wTxiyi  2:  I-f.i, \n\nfori = 1, ... ,no \n\nf.i  2:  0 \n\nmin~wTw+CLf.i \nw,~  2 \n\n. , \n\n(2) \n\nWhere C > 0 is a given parameter (also see  [11]). \nIn this paper, we are interested in the quality of the computed weight Wn  for the purpose of \npredicting the label y of an unseen data point x. We study this predictive power of Wn in the \nstandard batch learning framework.  That is,  we assume that the training data (xi, yi)  for \ni  =  1, ... n  are independently drawn from the same underlying data distribution D  which \nis unknown. The predictive power of the computed parameter Wn  then corresponds to the \nclassification performance of Wn  with respect to the true distribution D. \n\nWe  organize  the paper as  follows.  In  Section 2,  we briefly  review  a number  of existing \ntechniques  for  analyzing  separable linear classification problems.  We  then  derive  an  ex(cid:173)\nponential convergence rate of misclassification error in Section 3 for certain large margin \nlinear classification.  Section 4 compares the newly derived bound with known results from \nthe  traditional margin  analysis.  We  explain  that  the  exponential  bound  relies  on  a  new \nquantity (the robustness of the distribution) which is  not explored  in  a traditional margin \nbound. Note that for certain batch learning problems, exponential learning curves have al(cid:173)\nready been observed [10].  It is thus not surprising that an  exponential rate of convergence \ncan be achieved by large margin linear classification. \n\n2  Some known results on generalization analysis \n\nThere are a number of ways to obtain bounds on the generalization error of a linear classi(cid:173)\nfier.  A general framework is to use techniques from empirical processes (aka VC analysis). \nMany  such  results  that  are  related  to  large  margin  classification  have  been  described  in \nchapter 4 of [3]. \n\nThe main advantage of this framework is  its generality.  The analysis does  not require the \nestimated  parameter  to  converge  to  the  true  parameter,  which  is  ideal  for combinatorial \nproblems.  However,  for  problems  that  are  numerical  in  natural,  the potential parameter \nspace can be significantly reduced by using the first order condition of the optimal solution. \nIn this case, the VC analysis may become suboptimal since it assumes a larger search space \nthan  what  a typical  numerical  procedure uses.  Generally  speaking,  for  a problem that is \nlinearly  separable  with a large  margin,  the expected  classification error  of the computed \nhyperplane resulted from  this  analysis  is  of the  order Oeo~n V  Similar  generalization \nbounds can also be obtained for non-separable problems. \n\nIn chapter  10 of [11],  Vapnik described  a leave-one-out cross-validation analysis  for lin(cid:173)\nearly separable problems.  This analysis takes into account the first order KKT condition of \nthe optimal hyperplane W n .  The expected generalization performance from this analysis is \nO( ~) ,  which is better than the corresponding bounds from the VC analysis.  Unfortunately, \nthis technique is only suitable for deriving an expected generalization bound (for example, \nit is  not useful for obtaining a PAC  style probability bound). \n\nAnother well-known technique  for  analyzing  linearly  separable  problems  is  the  mistake \nbound framework in online learning. It is possible to obtain an  algorithm with a small gen(cid:173)\neralization error in the batch learning setting from an algorithm with a small online mistake \n\n'Bounds described in [3]  would imply an expected classification error of 0(108:  n), which can be \nslightly improved (by a log n  factor)  if we adopt a slightly better covering number estimate such as \nthe bounds in  [12,  14]. \n\n\fbound.  The readers  are referred to [6]  and references therein for this type of analysis.  The \ntechnique may lead to a bound with an expected generalization performance of O(~). \n\nBesides  the  above  mentioned  approaches,  generalization  ability  can  also  be  studied  in \nthe  statistical  mechanical  learning framework.  It  was  shown  that  for  linearly  separable \nproblems, exponential decrease of misclassification error is possible under this framework \n[1,  5,  7,  8].  Unfortunately,  it is  unclear how  to relate the  statistical mechanical  learning \nframework to the batch  learning framework considered in this paper.  Their analysis, em(cid:173)\nploying approximation techniques, does not seem to imply small sample bounds which we \nare interested in. \n\nThe statistical mechanical learning result suggests that it may be possible to obtain a similar \nexponential decay  of misclassification error in the batch learning setting, which  we prove \nin the next section.  Furthermore, we  show that the exponential rate depends on a quantity \nthat is  different  than  the  traditional margin  concept.  Our analysis  relies  on  a  PAC  style \nprobability estimate on the convergence rate of the estimated parameter from (2) to the true \nparameter.  Consequently, it is suitable for non-separable problems. A direct analysis on the \nconvergence rate of the estimated parameter to the true parameter is important for problems \nthat are numerical in nature such as  (2). However,  a disadvantage of our analysis is that we \nare unable to directly deal with the linearly separable formulation (1). \n\n3  Exponential convergence \nWe can rewrite the SVM formulation (2) by eliminating e as: \n\nwhere A =  1/(nC) and \n\ni \n\nz  :S  0, \nz  > O. \n\nWn(A)  =  argmin -\nw  n \n\n12:  T\" \n\nf(w  x' y' - 1) + -w  w, \n\nAT \n2 \n\n(3) \n\nDenote by  D the true underlying data distribution of (x, y),  and let w. (A)  be the optimal \nsolution with respect to the true distribution as : \n\nW.(A)  =  arg inf EDf(wT xy - 1) + ~wT w . \n\nw \n\n2 \n\nLet w. be the solution to \n\nw. =  arginf ~w2 \n\nw  2 \n\nS.t.  EDf(wT xy - 1)  =  0, \n\n(4) \n\n(5) \n\nwhich is the infinite-sample version of the optimal hyperplane method. \nThroughout this section, we assume Ilw.112  < 00, and EDllxl12  < 00. The latter condition \nensures that EDf( wT xy - 1)  :S  IIwl12ED Ilx 112  + 1 exists for all w. \n\n3.1  Continuity of solution under regularization \nIn this section, we show that Ilw. (A)  - w.112  -+ 0 as  A -+ O. This continuity result allows \nus to approximate (5) by using (4) and (3) with a small positive regularization parameter A. \nWe  only need to show that within any  sequence of A that converges to zero,  there exists a \nsubsequence Ai  -+ 0 such that w. (Ai)  converges to w. strongly. \nWe  first consider the following inequality which follows from the definition of w. (A): \n\nEDf(w.(A)  xy - 1) + '2W.(A) \n\nT \n\nA \n\n2 \n\nA  2 \n:S  '2w. . \n\n(6) \n\n\fTherefore  Ilw.(A)112  :s  Ilw.112' \nIt is well-known that every bounded sequence in a Hilbert space contains a weakly conver(cid:173)\ngent subsequence  (cf.  Proposition 66.4  in  [4]).  Therefore within any  sequence of A that \nconverges  to zero, there exists a subsequence Ai  --t  0 such that W. (Ai)  converges weakly. \nWe denote the limit by w. \nSince  f(W.(Af xy - 1)  is dominated by  Ilw.11211x112  + 1 which has a finite integral with \nrespect  to  D,  therefore  from  (6)  and  the Lebesgue  dominated  convergence  theorem,  we \nobtain \n\n, \n\n, \n\n0= lim ED f(w. (AdT xy -1) =  ED limf(w.(Aif xy - 1)  =  EDf(wT xy - 1). \n(7) \nAlso note that  IIwl12  :s  liffii  Ilw.(Ai)112  :s  Ilw.112,  therefore by  the definition of W.,  we \nmust have w =  w \u2022. \nSince W.  is  the  weak  limit of W.(Ai),  we  obtain  Ilw.112  :s  liffii  Ilw.(Ai)112.  Also  since \n:s  Ilw.112,  therefore  liffii  Ilw.(AdI12  =  Ilw.112'  This  equality  implies  that \nIlw.(Ai)112 \nW. (Ai)  converges to w. strongly since \n, \n\nlim(w.(Ai) - W.)2  =  limw.(Ai)2 + w; - 21imw.(Ai)Tw\u2022  =  O. \n, \n\n, \n\n3.2  Accuracy of estimated hyperplane with non-zero regularization parameter \n\nOur goal is  to  show  that for  the estimation method  (3) with  a nonzero regularization pa(cid:173)\nrameter A > 0,  the estimated parameter Wn(A)  converges  to the  true parameter W.(A)  in \nprobability when the sample size n  --t  00.  Furthermore,  we give  a large deviation bound \non the rate of convergence. \n\nFrom (4), we obtain the following first order condition: \n\nEDf3(A, x, y)xy + AW.(A)  =  0, \n\n(8) \nwhere  f3(A,  x, y)  =  f'(W.(A)T xy - 1)  and  f'(z)  E  [-1,0] denotes  a  member  of the \nsubgradient of f  at z  [9].2  In the finite  sample case,  we  can  also  interpret f3(\\ x, y)  in \n(8)  as  a  scaled  dual  variable  a:  f3  = -a/C, where  a  appears  in  the  dual  (or  Kernel) \nformulation of an SVM (for example,  see chapter 10 of [11]). \nThe convexity of f  implies that f(zd + (Z2  - zdf'(zd :s  f(Z2)  for any subgradient f' of \nf. This implies the following inequality: \n\n~ L \n, \n. \nn \n\nf(W.(A)T xiyi - 1) + (Wn(A)  - W.(A))T ~ Lf3(A, xi, yi)xiyi \n\nn \n\n, \n. \n\nwhich is equivalent to: \n\n1 ' \"  \n- ~ f(W.(A)  x'y' - 1) + - W.(A)  + \nn \n\nT  . .  \n\nA \n2 \n\n2 \n\n, \n. \n(Wn(A)  - W.(A)?[ ~ Lf3(\\ xi, yi)xiyi + AW.(A)] + ~(W.(A) - Wn(A))2 \n\nn \n\n. \n, \n\n2 \n\n2Por readers not familiar  with the  sub gradient concept in convex analysis, our analysis requires \nlittle  modification  if we  replace f  with  a  smoother convex function  such  as  P,  which  avoids  the \ndiscontinuity in the first order derivative. \n\n\fAlso note that by the definition of Wn(A),  we have: \n!(WnC>..)T xiyi  - 1) + ~Wn(A)2 < ..!.  \" \n, \n\n..!.  \" \nn~ \n, \n\n2 \n\n-n~ \n\n!(w*(Af xiyi -1) + ~W*(A)2. \n\n2 \n\nTherefore by comparing the above two inequalities, we obtain: \n\n~(W*(A) - Wn(A))2  \u00abW*(A) - wn(A)f[..!. L,B(A, xi, yi)xiyi + AW*(A)] \n2 \n\n-\n:Sllw*(A) - Wn(A)11211~ L,B(A, xi, yi)xiyi + AW*(A)112. \n\n. , \n\nn \n\nTherefore we  have \n\ni \n\nIIW*(A)  - Wn(A)112  :S~II~ L,B(A, xi, yi)xiyi + AW*(A)112 \n\ni \n\n=-11- ~,B(A, x', y')x'y' - ED,B(A, x, y)xYI12' \n\n. . . .  \n\n2  1 \"  \nAn. , \n\n(9) \n\nNote that in (9), we  have already bounded the convergence of Wn(A)  to W*(A)  in terms of \nthe convergence of the empirical expectation of a random vector ,B( A,  x, y)xy to its mean. \nIn order to obtain a large deviation bound on the convergence rate,  we need the following \nresult which can be found in [13], page 95: \n\nTheorem 3.1  Let ei  be zero-mean independent random vectors in a Hilbert space.  If there \nexists M  > 0 such thatforall natural numbers 12:  2:  E~=l Elleill~ :S  \";bl!Ml.  Thenfor \naILS>  0:  P(II~ Ei eil12  2:.5):S  2exp(-~()2/(bM2 +.5M)). \nU sing the fact that ,B( A,  x, y)  E [-1, 0], it is easy to verify the following corollary by using \nTheorem 3.1  and  (9),  where  we  also bound the l-th moment of the right hand side of (9) \nusing the following form ofJensen's inequality:  la + bll  :S  2l-1( lall + Ibll) forl 2:  2. \nCorollary 3.1  If there exists  M  > 0 such thatfor all natural numbers 1 2:  2:  ED Ilxll~  :S \n%1!Ml.  Thenfor all.5 > 0: \n\nP(llw*(.A)  - wn(A)112  2:  .5)  :S  2 exp( -iA2.52 /(4bM 2 + A.5M)). \n\nLet PD  ( .) denote the probability with respect to distribution D, then the following bound \non  the  expected  misclassification  error of the computed  hyperplane Wn (A)  is  a  straight(cid:173)\nforward consequence of Corollary 3.1: \n\nCorollary 3.2  Under  the assumptions of Corollary 3.1,  then for any non-random values \nA, ,,(, K  > 0,  we have: \n\nEXPD(Wn(Af xy:S 0)  :SPD(w*(.A)T xy:S \"()  + PD (ll xl12  2:  K) \n\n+ 2 exp( -iA2\"{2 /( 4bK2 M2 + A\"{K M)), \n\nwhere the expectation Ex  is taken over n  random samples from  D  with Wn (A)  estimated \nfrom the n  samples. \n\nWe  now  consider linearly separable classification problems  where the solution W*  of (5) \nis  finite.  Throughout the rest of this section, we impose an additional assumption that the \n\n\fdistribution D  is  finitely  supported:  IIxl12  :s  M  almost  everywhere  with respect  to  the \nmeasure D. \nFrom  Section  3.1,  we  know  that  for  any  sufficiently  small  positive  number  A,  Ilw.  -\nw.(A)112  <  11M, which means  that W.(A)  also  separates  the in-class  data from  the  out(cid:173)\nof-class data with a margin of at least 2(1  - Mllw. - w. (A) 112). Therefore for sufficiently \nsmall A,  we can define: \n\nI'(A)  =  sup{b: Pn(W.(A)T xy :s  b)  =  O}  ~ 1- Mllw.  - w.(A)112  > O. \n\nBy Corollary 3.2, we obtain the following upper-bound on the misclassification error if we \ncompute a linear separator from (3) with a non-zero small regularization parameter A: \n\nEx Pn( wn(Af xy :s  0)  :s  2 exp( - ~A21'(A)2 1(4M4 + AI'(A)M2)). \n\nThis indicates that the expected misclassification error of an appropriately computed hyper(cid:173)\nplane for a linearly separable problem is exponential in n.  However, the rate of convergence \ndepends on AI'( A) 1M2.  This quantity is different than the margin concept which has been \nwidely  used  in  the  literature  to  characterize  the  generalization behavior of a  linear clas(cid:173)\nsification problem.  The  new  quantity  measures  the  convergence  rate  of W.(A)  to  w.  as \nA -+  O.  The faster the convergence, the more \"robust\" the linear classification problem is, \nand hence the faster the exponential decay of misclassification error is.  As  we shall see in \nthe next section, this \"robustness\" is related to the degree of outliers in the problem. \n\n4  Example \n\nWe give an example to illustrate the \"robustness\" concept that characterizes the exponential \ndecay  of misclassification error. It is known from Vapnik's cross-validation bound in  [11] \n(Theorem  10.7)  that  by  using  the  large  margin  idea alone,  one can  derive  an  expected \nmisclassification  error bound  that is  of the  order  O(l/n),  where  the  constant is  margin \ndependent.  We show that this bound is tight by using the following example. \n\nExample 4.1  Consider a two-dimensional problem.  Assume that with probability of 1-1', \nwe observe a data point x  with label y  such that xy =  [1, 0]; and with probability of 1', we \nobserve  a  data point x  with label  y  such  that  xy  =  [-1, 1].  This  problem is  obviously \nlinearly separable with a large margin that is I' independent. \nNow, for n random training data, with probability at most I'n + (1- I')n, we observe either \nxiyi  =  [1,0] for all  i  =  1, . .. , n, or xiyi  =  [-1,1] for all  i  =  1, ... , n.  For all  other \ncases,  the computed optimal hyperplane Wn  =  w \u2022.  This means  that the misclassification \nerror is 1'(1  - I')(\"Yn-l + (1  - I')n-l). This error converges to zero exponentially as  n  -+ \n00.  However the  convergence  rate  depends  on  the fraction  of outliers  in the distribution \ncharacterized by 1'. \nIn particular, for any n, if we let I' = 1 In, then we have an expected misclassification error \nthat is at least  ~(l-l/n)n ~ 1/(en).  D \n\nThe  above  tightness  construction  of the  linear decay  rate  of the  expected  generalization \nerror (using the margin  concept alone)  requires the  scenario  that a  small  fraction  (which \nshall  be  in  the  order of inverse  sample  size)  of data are  very  different  from  other data. \nThis  small  portion of data can  be  considered  as  outliers,  which  can  be measured  by  the \n\"robustness\"  of the  distribution.  In  general,  w. (A)  converges  to  w.  slowly  when  there \nexist  such  a  small  portion of data  (outliers)  that cannot  be  correctly  classified  from  the \nobservation of the remaining data. It can be seen that the optimal hyperplane in (1) is quite \nsensitive to even a single outlier. Intuitively, this instability is quite undesirable.  However, \nthe previous  large margin  learning bounds  seemed  to  have  dismissed this  concern.  This \n\n\fpaper indicates  that  such  a concern  is  still valid.  In  the  worst case,  even  if the  problem \nis  separable  by  a  large  margin,  outliers  can  still  cause  a  slow  down  of the  exponential \nconvergence rate. \n\n5  Conclusion \n\nIn  this  paper,  we  derived  new  generalization bounds  for large  margin  linearly  separable \nclassification.  Even  though  we  have  only  discussed the  consequence  of this  analysis  for \nseparable problems,  the  technique  can  be easily  applied to  non  separable  problems  (see \nCorollary 3.2).  For large  margin  separable problems,  we  show  that exponential decay  of \ngeneralization error may  be achieved  with an  appropriately chosen regularization parame(cid:173)\nter.  However,  the bound depends  on  a quantity which characterizes  the robustness of the \ndistribution.  An  important difference of the robustness concept and the margin concept is \nthat outliers may not be observable with large probability from data while margin generally \nwill.  This implies that without any prior knowledge, it could be difficult to directly apply \nour bound using only the observed data. \n\nReferences \n\n[1]  lK. Anlauf and M. Biehl. The AdaTron: an adaptive perceptron algorithm. Europhys. \n\nLett.,  10(7):687-692, 1989. \n\n[2]  C. Cortes and V.N. Vapnik.  Support vector networks. Machine Learning, 20:273-297, \n\n1995. \n\n[3]  Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines \n\nand other Kernel-based Learning Methods.  Cambridge University Press, 2000. \n\n[4]  Harro  G.  Heuser.  Functional analysis.  John Wiley  &  Sons Ltd.,  Chichester,  1982. \n\nTranslated from the German by John Horvath, A Wiley-Interscience Publication. \n\n[5]  W.  Kinzel.  Statistical mechanics of the perceptron with maximal stability. In Lecture \n\nNotes in Physics, volume 368, pages  175-188. Springer-Verlag, 1990. \n\n[6]  1  Kivinen and  M.K. Warmuth.  Additive versus  exponentiated gradient updates  for \n\nlinear prediction. Journal of Infonnation and Computation, 132:1-64, 1997. \n\n[7]  M.  Opper.  Learning times of neural  networks:  Exact solution for a perceptron algo(cid:173)\n\nrithm.  Phys.  Rev.  A, 38(7):3824-3826, 1988. \n\n[8]  M.  Opper.  Learning  in  neural  networks:  Solvable dynamics.  Europhysics  Letters, \n\n8(4):389-392,1989. \n\n[9]  R.  Tyrrell Rockafellar.  Convex analysis.  Princeton University Press,  Princeton, NJ, \n\n1970. \n\n[10]  Dale  Schuurmans.  Characterizing  rational  versus  exponential  learning  curves.  J. \n\nComput. Syst.  Sci., 55:140-160, 1997. \n\n[11]  V.N.  Vapnik.  Statistical learning theory.  John Wiley &  Sons, New York, 1998. \n[12]  Robert C.  Williamson,  Alexander 1  Smola,  and Bernhard Scholkopf.  Entropy num(cid:173)\n\nbers of linear function classes.  In COLT'OO, pages 309-319,2000. \n\n[13]  Vadim Yurinsky.  Sums and Gaussian vectors.  Springer-Verlag, Berlin, 1995. \n[14]  Tong  Zhang.  Analysis  of regularized  linear functions  for  classification  problems. \n\nTechnical  Report RC-21572, IBM,  1999.  Abstract in NIPS'99, pp. 370-376. \n\n\f", "award": [], "sourceid": 1891, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}