{"title": "Semiparametric Support Vector and Linear Programming Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 591, "abstract": null, "full_text": "Semiparametric Support Vector and \n\nLinear Programming Machines \n\nAlex J. Smola, Thilo T.  Frie6, and Bernhard Scholkopf \n\nGMD  FIRST, Rudower  Chaussee 5,  12489 Berlin \n\n{smola, friess,  bs }@first.gmd.de \n\nAbstract \n\nSemiparametric models  are  useful  tools  in the case  where  domain \nknowledge exists about the function to be estimated or emphasis is \nput onto understandability of the  model.  We  extend two learning \nalgorithms  - Support  Vector  machines  and  Linear  Programming \nmachines  to  this  case  and  give  experimental  results  for  SV  ma(cid:173)\nchines. \n\n1 \n\nIntroduction \n\nOne of the strengths of Support Vector (SV)  machines is that they are nonparamet(cid:173)\nric techniques, where one does not have to e.g. specify the number of basis functions \nbeforehand.  In fact,  for  many of the kernels  used  (not the polynomial kernels)  like \nGaussian  rbf- kernels  it  can  be shown  [6]  that SV  machines  are  universal  approxi(cid:173)\nmators. \n\nWhile  this  is  advantageous in  general,  parametric models  are  useful  techniques  in \ntheir own  right.  Especially if one  happens  to have additional knowledge  about the \nproblem, it  would  be unwise  not  to take advantage of it.  For instance it might  be \nthe case  that the major properties of the data are described by  a  combination of a \nsmall set of linear independent basis functions  {\u00a2Jt (.), ... , \u00a2n (.)}.  Or one may want \nto correct the  data for  some  (e.g.  linear)  trends.  Secondly it  also  may be the case \nthat the user wants to have an understandable  model, without sacrificing accuracy. \nFor instance many people in life sciences tend to have a preference for linear models. \nThis may be some motivation to construct  semiparametric  models,  which  are both \neasy  to  understand  (for  the  parametric  part)  and  perform  well  (often  due  to  the \nnonparametric term).  For more advocacy on semiparametric models see  [1]. \n\nA common approach is to fit  the data with the parametric model and train the non(cid:173)\nparametric add-on on the errors of the parametric part, Le.  fit  the nonparametric \npart  to  the errors.  We  show  in  Sec.  4  that this  is  useful  only in  a  very  restricted \n\n\f586 \n\nA. 1. Smola,  T.  T.  FriejJ and B.  SchOlkopf \n\nsituation.  In  general  it  is  impossible  to find  the best  model  amongst a  given  class \nfor  different  cost  functions  by  doing  so.  The  better  way  is  to  solve  a  convex  op(cid:173)\ntimization  problem  like  in  standard  SV  machines,  however  with  a  different  set  of \nadmissible functions \n\nf(x)  =  (w,1jJ(x))  + 2:f3irPi(X). \n\nn \n\ni=l \n\nNote that this is not so much different from  the classical SV  [10J  setting where one \nuses functions  of the type \n\nf(x)  =  (w, 1jJ(x))  + b. \n\n(1) \n\n(2) \n\n(3) \n\n2  Semiparametric Support  Vector Machines \n\nLet  us  now  treat  this  setting  more  formally.  For  the  sake  of  simplicity  in  the \nexposition we  will  restrict ourselves to the case of SV  regression and only deal  with \nthe c- insensitive loss  function  1~lc  =  max{O, I~I - c}.  Extensions of this setting are \nstraightforward and follow  the lines of [7J. \nGiven a training set of size f, X  := {(Xl, yd , ., . ,(xe, ye)} one tries to find a function \nf  that minimizes the functional  of the expected riskl \n\nR[JJ  = J c(f(x) - y)p(x, y)dxdy. \n\nHere  c(~)  denotes  a  cost  function,  i.e.  how  much  deviations  between  prediction \nand  actual training data should  be  penalized.  Unless  stated otherwise we  will  use \nc(~) =  1~lc . \nAs  we  do not know p(x, y)  we  can only compute the empirical risk Remp[JJ  (i.e.  the \ntraining error).  Yet,  minimizing  the  latter is  not  a  good  idea if the  model  class  is \nsufficiently  rich  and  will  lead  to overfitting.  Hence  one  adds  a  regularization term \nT [JJ  and minimzes the regularized risk functional \n\ne \n\nRreg[J]  =  2: C(f(Xi)  - Yi) + AT[J]  with  A >  O. \n\ni=l \n\n(4) \n\nThe standard choice  in SV  regression  is  to set T[J]  =  ~llwI12. \nThis is  the point of departure from  the standard SV  approach.  While in the latter \nf  is  described  by  (2),  we  will  expand f  in terms of (1).  Effectively this means that \nthere  exist  functions  rPl (.), ... , rPn (.)  whose  contribution  is  not  regularized  at  all. \nIf n  is  sufficiently  smaller  than  f  this  need  not  be  a  major  concern,  as  the  VC(cid:173)\ndimension of this  additional  class of linear  models  is  n,  hence  the overall capacity \ncontrol  will  still  work,  provided  the  nonparametric  part  is  restricted  sufficiently. \nFigure 1 explains the effect  of choosing a  different structure in detail. \n\nSolving  the  optimization  equations  for  this  particular  choice  of  a  regularization \nterm,  with  expansion  (1),  the  c- insensitive  loss  function  and  introducing  kernels \n\n1 More  general  definitions,  mainly  in  terms  of the  cost  function ,  do  exist  but  for  the \nsake  of clarity  in  the exposition we  ignored  these cases.  See  [10]  or  [7]  for  further  details \non  alternative definitions  of risk  functionals . \n\n\fSemiparametric Support  Vector and Linear Programming Machines \n\n587 \n\n.(cid:173)\n\n'\" \n, I \n, , \n\n, \n--\n\n--------- ---\n\n----\n\n.(cid:173)\nI , \n\n------ ..... \n\nI \n\n..... \n\n- ..........  \"  \" \n\" ----_ .....  \"  ,  ' \n-\n\nf \n\" \\ )   \\ \n, \n/}  I \n, \n---,. \n\n- - ; ;   I \n\n\" \n\nI \n\n'..... \n\n-----\n\n----------\n\n-----------\n\nFigure 1:  Two different nested subsets (solid and dotted lines) of hypotheses and the \noptimal model  (+) in the realizable case.  Observe that the optimal model is already \ncontained  in  much  a  smaller  (in  this  diagram  size  corresponds  to  the  capacity  of \na  subset)  subset  of the structure with solid  lines  than in  the structure denoted  by \nthe dotted lines.  Hence prior knowledge  in  choosing the structure can have a  large \neffect  on generalization bounds and performance. \n\nfollowing  [2J  we  arrive at the following  primal optimization problem: \n\nminimize  %llwl12 + L  ~i +~; \n\nl \n\ni=l \n\n(W,1jJ(Xi))  + L  (3j\u00a2j(Xi)  - Yi  < \n\nsubject to \n\nYi  - (w, 1jJ(xd)  - L  (3j\u00a2j (Xi)  < \n\nn \n\nj=l \n\nn \n\nj=l \n\nto  + ~i \n\nto  + ~i \n\n(5) \n\n>  0 \n\nHere k(x, x') has been written as  (1jJ(x) , 1jJ(x' )).  Solving (5)  for its Wolfe dual yields \n\nmaXImIze \n\nsubject to \n\n( \n\n( \n\n{ \n\n-~ i,El (ai  - ai)(aj - aj)k(xi,Xj) \n-E L  (ai + an + L  Yi (ai  - an \n{ ( \n\nL(ai - an\u00a2j(Xi) \ni=l \nLti,ai \n\ni=l \n\ni=l \n\nE \n\no for  all  1 ~ j  ~ n \n[0,1/ >.J \n\n(6) \n\nNote  the  similarity  to  the  standard SV  regression  model.  The  objective  function \nand the box constraints on the Lagrange multipliers ai, a; remain unchanged.  The \nonly modification comes from the additional unregularized basis functions.  Whereas \nin the standard SV  case we  only had a  single  (constant) function  b\u00b7 1 we  now  have \nan  expansion  in  the  basis  (3i \u00a2i ( .).  This  gives  rise  to  n  constraints  instead  of one. \nFinally f  can be found  as \n\nl \n\nn \nf(x)  = L(ai - a;)k(xi' x) + L \ni=l \n\ni=l \n\n(3i\u00a2i(X) \n\nl \n\nsince  w = L(ai - ai)1jJ(xi). \n\n(7) \n\ni=l \n\nThe only difficulty remaining is how to determine (3i.  This can be done by exploiting \nthe  Karush- Kuhn- Tucker optimality conditions,  or much  more easily,  by  using an \nIn  the  latter  case  the  variables  (3i  can  be \ninterior  point  optimization  code  [9J. \nobtained as the dual variables of the dual (dual dual = primal) optimization problem \n(6)  as  a  by  product of the  optimization  process.  This  is  also  how  these  variables \nhave been obtained in the experiments in the current paper. \n\n\fT[J]  =  ~//wI12 + t /~i/ \nT[f] = L lai - a:/ \n\ni=l \n\nt \n\ni=l \n\nt In  \n\nT[f] = L lai - a:1 +\"2  L  ~dJjMij \n\ni=l \n\ni ,j=l \n\n(8) \n\n(9) \n\n(10) \n\n588 \n\nA. 1.  Smola,  T.  T.  FriefJ and B. SchOlkopf \n\n3  Semiparametric Linear  Programming Machines \n\nEquation  (4)  gives rise  to the question whether not completely different  choices  of \nregularization functionals  would  also lead  to good  algorithms.  Again we  will  allow \nfunctions  as described in  (7).  Possible choices  are \n\nor \n\nor \n\nfor  some  positive  semidefinite  matrix  M.  This  is  a  simple  extension  of existing \nmethods  like  Basis  Pursuit  [3]  or  Linear  Programming Machines  for  classification \n(see  e.g.  [4]).  The basic  idea in  all  these  approaches  is  to  have  two  different  sets \nof basis  functions  that  are  regularized  differently,  or  where  a  subset  may  not  be \nregularized  at  all.  This  is  an  efficient  way  of  encoding  prior  knowledge  or  the \npreference of the user as the emphasis obviously will  be put mainly on the functions \nwith little or no regularization at all.  Eq.  (8)  is essentially the SV estimation model \nwhere  an additional  linear  regularization  term  has  been  added  for  the  parametric \npart.  In this case the constraints of the optimization problem  (6)  change into \n\n-1  <  E(ai-ai)\u00a2j(xd  <  1 \n\nforall1:::;j:::;n \n\n(11) \n\nt \n\ni=l \nai,ar \n\nE \n\n[O,l/A] \n\nIt makes little sense (from a technical viewpoint)  to compute Wolfe's dual objective \nfunction  in  (10)  as  the problem does  not get  significantly easier  by doing so.  The \nbest approach is  to solve the corresponding optimization problem directly by some \nlinear or  quadratic programming code,  e.g.  [9].  Finally  (10)  can be  reduced to the \ncase of (8)  by renaming variables accordingly and a  proper choice of M. \n\n4  Why  Backfitting is  not  sufficient \n\nOne might think that the approach presented above is  quite unnecessary and overly \ncomplicated for  semi parametric modelling.  In fact,  one could try to fit  the data to \nthe  parametric  model  first,  and  then  fit  the  nonparametric  part  to  the  residuals. \nIn  most cases,  however,  this does  not lead to finding  the minimum of (4).  We  will \nshow  this  at a  simple example. \nTake a  SV  machine with linear  kernel  (i.e.  k(x, x')  =  (x, x'))  in  one dimension  and \na constant term as parametric part (i.e.  f(x)  =  wx + $).  This is  one of the simplest \nsemiparametric SV  machines possible.  Now  suppose the data was generated by \n\n(12) \nwithout noise.  Clearly then also Yi  2:  1 for  all i.  By construction the best overall fit \nof the pair (~, w)  will  be arbitrarily close to (0,1)  if the regularization parameter A \nis  chosen sufficiently  small. \n\nYi  =  Xi  where  Xi  2:  1 \n\nFor backfitting one first carries out the parametric fit to find a constant ~ minimizing \nthe term E;=l C(Yi  - $).  Depending on the chosen cost  function  c(\u00b7),  ~ will  be the \nmean  (L 2-error), the median  (L1-error), etc., of the set  {Yl, ... , Yt}\u00b7  As  all  Yi  2:  1 \n\n\fSemi parametric Support Vector and Linear Programming Machines \n\n589 \n\n2  - - __ -\n\n....  _-... \n\n-\n\n, , \n, \n\nFigure 2:  Left:  Basis functions  used  in  the toy example.  Note the different length \nscales of sin x  and sinc 27rx.  For convenience the functions  were shifted by an offset \nof 2 and 4 respectively.  Right:  Training data denoted by  '+', nonparametric (dash(cid:173)\ndotted  line),  semiparametric  (solid  line),  and  parametric  regression  (dots).  The \nregularization constant  was  set  to  A = 2.  Observe that the semiparametric model \npicks up the characteristic  wiggles  of the original function. \n\nalso {3  ~ 1 which  is  surely not the optimal solution of the overall problem as  there \n(3  would  be close to a  as seen  above.  Hence not even in the simplest of all  settings \nbackfitting  minimizes  the  regularized  risk  functional,  thus  one  cannot  expect  the \nlatter  to  happen  in  the  more  complex  case  either.  There  exists  only  one  case  in \nwhich backfitting would suffice, namely if the function spaces spanned by the kernel \nexpansion  {k(Xi\")}  and  {4>i(')}  were orthogonal.  Consequently in  general one  has \nto jointly solve for  both the parametric and the semiparametric part. \n\n5  Experiments \n\nThe main  goal  of the experiments  shown  is  a  proof of concept  and  to display  the \nproperties  of  the  new  algorithm.  We  study  a  modification  of  the  Mexican  hat \nfunction,  namely \n\nf(x)  =  sinx + sinc(27r{x - 5)). \n\n(13) \nData  is  generated  by  an  additive  noise  process,  i.e.  Yi  =  f(xd  + ~i'  where  ~i  is \nadditive  noise.  For  the  experiments  we  choose  Gaussian  rbf-kernels  with  width \nu  = 1/4, normalized to maximum output 1.  The noise is  uniform with 0.2 standard \ndeviation,  the  E:-insensitive  cost  function  I . Ie  with  E  = 0.05.  Unless  stated other(cid:173)\nwise  averaging is  done  over  100  datasets with  50  samples each.  The Xi  are drawn \nuniformly from  the interval  [0,10].  L1  and L2  errors are  computed on the interval \n[0, 10]  with uniform measure.  Figure 2 shows the function and typical predictions in \nthe nonparametric, semiparametric, and parametric setting.  One can observe that \nthe semiparametric model  including sin x, cos x  and the constant function  as  basis \nfunctions  generalizes  better than the standard SV  machine.  Fig.  3  shows  that the \ngeneralization performance is  better in the semiparametric case.  The length of the \nweight vector of the kernel expansion IIwll  is  displayed in Fig.  4.  It  is smaller in the \nsemiparametric case for  practical values of the regularization strength.  To make a \nmore realistic comparison, model selection  (how to determine 1/ A)  was  carried out \nby  la-fold cross validation for  both algorithms independently for  all  100  datasets. \nTable  1 shows  generalization performance for  both a  nonparametric model,  a  cor(cid:173)\nrectly  chosen  and  an  incorrectly  chosen  semiparametric  model.  The  experiments \nindicate  that  cases  in  which  prior knowledge  exists on  the type of functions  to be \nused will benefit from semiparametric modelling.  Future experiments will show how \nmuch can be gained in real  world  examples. \n\n\f590 \n\n.. \n\n.35 \n\n03 \n\n0\" \n\n02 \n\n015 \n\nA. 1.  Smola,  T.  T.  FriejJ and B.  Sch6lkopj \n\n071~\"'---~-~r=====::::;l \n\n1 -\n\nSeF\\'llP&l'ametnc Mode~ \n- _ .  ~tnc Model J \n\n_ ._. _ . - ...... -_  .......  _-11- ___ ..... \n\n06 \n\n05 .. \n\n- ' ' ' '  \n\n\\ \n\\ \n\\ \n\n.\\ , \n\\ \n'. \n\no~oL,-\u00b7\u00b7 ~~\"\u00b7:------'O'-:-, ~~'O:-, --,,'-:-., ~\"\"\"\"\"O\u00b7~--'\", \n\nFigure 3:  L1  error (left)  and L2  error (right) of the nonparametric /  semiparametric \nregression computed on the interval [0,10]  vs.  the regularization strength 1/),.  The \ndotted  lines  (although  hardly  visible)  denote  the  variance  of  the  estimate.  Note \nthat in both error measures the semiparametric model consistently outperforms the \nnonparametric one. \n\nFigure 4:  Length  of the  weight  vector  w  in  fea(cid:173)\nture  space  CEi,j(ai  - ai)(aj  - aj)k(xi,Xj))1/2 \nvs.  regularization strength.  Note that Ilwl!'  con(cid:173)\ntrolling the capacity of that part of the function, \nbelonging to the kernel expansion, is smaller (for \npractical  choices  of  the  regularization  term)  in \nthe  semiparametric  than  in  the  nonparametric \nmodel.  If this  difference  is  sufficiently  large the \noverall capacity of the resulting model is smaller \nin  the semiparametric approach.  As  before dot(cid:173)\nted lines  indicates  the variance. \nFigure  5:  Estimate  of  the  parameters  for \nsin x  (top  picture)  and  cos x  (bottom  picture) \nin  the  semiparametric  model  vs.  regularization \nstrength 1/),.  The dotted lines above and below \nshow  the  variation  of the  estimate  given  by  its \nvariance.  Training set size was f.  = 50.  Note the \nsmall  variation  of the estimate.  Also  note  that \neven in the parametric case 1/), ~ 0 neither the \ncoefficient  for  sin x  converges to  1,  nor  does  the \ncorresponding term for  cos x  converge to O.  This \nis  due to the additional frequency  contributions \nof sinc 27rx. \n\nI \nL1  error  I 0.1263 \u00b1 0.0064  (12)  I 0.0887 \u00b1 0.0018  (82)  I 0.1267 \u00b1 0.0064  (6)  I \nL2  error  I 0.1760 \u00b1 0.0097 112)1  0.1197 \u00b1 0.0046  (82)  I 0.1864 \u00b1 0.0124  (6)  I \n\nI  Semiparam. \n\nSemi par am. \nsin x, cos x, 1 \n\nsin 2x, cos 2x, 1 \n\nI \n\n,\" \n\nO(). \n\n003 \n\n002 \n\n00\\ \n\nNonparam. \n\nI \n\nTable 1:  Ll and L2  error for  model selection by 10-fold crossvalidation.  The correct \nsemiparametric  model  (sin x, cos x, 1)  outperforms the  nonparametric model  by  at \nleast 30%, and has significantly smaller variance.  The wrongly chosen nonparamet(cid:173)\nric model (sin 2x, cos 2x, 1), on the other hand, gives performance comparable to the \nnon parametric one, in fact,  no significant performance degradation was  noticeable. \nThe number in  parentheses denotes the number of trials in which the corresponding \nmodel  was  the best among the three models. \n\n\fSemiparametric Support  Vector and Linear Programming Machines \n\n591 \n\n6  Discussion and  Outlook \n\nSimilar models have been proposed and explored in the context of smoothing splines. \nIn fact,  expansion  (7)  is  a  direct result of the representer theorem,  however only in \nthe case of regularization in feature  space  (aka Reproducing  Kernel Hilbert Space, \nRKHS).  One  can  show  [5]  that  the expansion  (7)  is  optimal  in  the space  spanned \nby the RKHS  and the additional set of basis functions. \n\nMoreover the semi parametric setting arises naturally in the context of conditionally \npositive definite  kernels of order m  (see  [8]).  There, in order to use a  set of kernels \nwhich do not satisfy Mercer's condition, one has to exclude polynomials up to order \nm - 1.  Hence,  to with that one has to add polynomials back in  'manually' and our \napproach presents a  way  of doing that. \n\nAnother  application  of semiparametric  models  besides  the  conventional  approach \nof  treating  the  nonparametric  part  as  nuisance  parameters  [1]  is  the  domain  of \nhypothesis testing, e.g. to test whether a  parametric model fits  the data sufficiently \nwell.  This can be achieved in the framework of structural risk minimization  [10]  -\ngiven  the  different  models  (nonparametric  vs.  semiparametric vs.  parametric)  one \ncan  evaluate the bounds on the expected risk  and then choose  the model  with  the \nlowest  error bound.  Future  work  will  tackle  the  problem of computing good  error \nbounds  of compound  hypothesis  classes.  Moreover  it  should  be  easily  possible  to \napply the methods proposed in this paper to Gaussian processes. \n\nAcknowledgements  This  work  was  supported in  part by grants of the  DFG  Ja \n379/51 and ESPRIT Project Nr.  25387- STORM. The authors thank Peter Bartlett, \nKlaus- Robert  Muller,  Noboru  Murata,  Takashi  Onoda,  and  Bob  Williamson  for \nhelpful  discussions and comments. \n\nReferences \n\n[1]  P.J. Bickel, C.A.J. Klaassen, Y. Ritov, and J.A. Wellner.  Efficient and adaptive \nestimation for  semiparametric  models.  J. Hopkins Press, Baltimore, ML,  1994. \n[2]  B. E.  Boser, I. M.  Guyon, and V.  N.  Vapnik.  A  training algorithm for  optimal \n\nmargin classifiers.  In  COLT'92,  pages  144- 15'2,  Pittsburgh, PA,  1992. \n\n[3]  S.  Chen, D.  Donoho, and M.  Saunders. Atomic decomposition by basis pursuit. \n\nTechnical  Report 479,  Department of Statistics, Stanford University,  1995. \n\n[4]  T.T.  FrieB  and R.F.  Harrison.  Perceptrons in kernel  feature  spaces.  TR RR-\n\n720,  University of Sheffield,  Sheffield,  UK,  1998. \n\n[5]  G.S.  Kimeldorf and G.  Wahba.  A correspondence between Bayesan estimation \non stochastic processes and  smoothing by splines.  Ann.  Math.  Statist.,  2:495-\n502,  1971. \n\n[6]  C.A.  Micchelli.  Interpolation of scattered  data:  distance  matrices  and  condi(cid:173)\n\ntionally positive definite functions.  Constructive Approximation, 2:11- 22, 1986. \n\n[7]  A.  J.  Smola and B.  Scholkopf.  On a  kernel-based method for  pattern recogni(cid:173)\ntion,  regression, approximation and operator inversion.  Algorithmica,  22:211-\n231,1998. \n\n[8]  A.J.  Smola,  B.  Scholkopf,  and  K.  Muller.  The connection between regulariza(cid:173)\n\ntion operators and support vector kernels.  Neural  Netw.,  11:637- 649,  1998. \n\n[9]  R.J. Vanderbei. LOQO: An interior point code for quadratic programming. TR \n\nSOR-94-15, Statistics and  Operations Research,  Princeton Univ.,  NJ,  1994. \n[10]  V.  Vapnik.  The  Nature  of Statistical Learning  Theory.  Springer,  N.Y.,  1995. \n\n\f", "award": [], "sourceid": 1575, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Thilo-Thomas", "family_name": "Frie\u00df", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}