{"title": "The Kernel Trick for Distances", "book": "Advances in Neural Information Processing Systems", "page_first": 301, "page_last": 307, "abstract": null, "full_text": "The Kernel Trick for Distances \n\nBernhard SchOikopf \nMicrosoft Research \n1 Guildhall Street \nCambridge, UK \n\nbs@kyb.tuebingen.mpg.de \n\nAbstract \n\nA method is described which, like the kernel trick in  support vector ma(cid:173)\nchines  (SVMs),  lets  us  generalize distance-based  algorithms to  operate \nin  feature  spaces,  usually  nonlinearly  related  to  the  input  space.  This \nis  done  by  identifying  a  class  of kernels  which  can  be  represented  as \nnorm-based distances in Hilbert spaces.  It turns out that common kernel \nalgorithms,  such as  SVMs and kernel PCA,  are actually really distance \nbased algorithms and can be run with that class of kernels, too. \nAs  well  as  providing  a  useful  new  insight  into  how  these  algorithms \nwork, the present work can form the basis for conceiving new algorithms. \n\n1  Introduction \n\nOne of the crucial ingredients of SVMs is the so-called kernel trick for the computation of \ndot products in high-dimensional feature spaces using simple functions defined on pairs of \ninput patterns. This trick allows the formulation of nonlinear variants of any algorithm that \ncan be cast in terms of dot products, SVMs being but the most prominent example [13, 8]. \nAlthough the mathematical result underlying the kernel trick is almost a century old [6], it \nwas only much later [1, 3,13] that it was made fruitful for the machine learning community. \nKernel methods have since led to interesting generalizations of learning algorithms and to \nsuccessful real-world applications.  The present paper attempts to  extend the utility of the \nkernel  trick by looking at the problem of which kernels can  be used to  compute distances \nin feature spaces.  Again, the underlying mathematical results, mainly due to Schoenberg, \nhave been known for a while [7]; some of them have already attracted interest in the kernel \nmethods community in various contexts [11, 5,  15]. \nLet us consider training data (Xl, yd, ... , (xm, Ym)  E X  x y. Here, Y is the set of possible \noutputs (e.g., in pattern recognition, {\u00b11}), and X  is  some nonempty set (the domain) that \nthe  patterns  are  taken from.  We  are  interested in  predicting the  outputs  y  for previously \nunseen patterns x.  This is  only possible if we have some measure that tells  us how (x, y) \nis  related  to  the  training examples.  For many  problems,  the  following  approach  works: \ninformally, we want similar inputs to lead to similar outputs. To formalize this, we have to \nstate what we mean by similar.  On the outputs, similarity is usually measured in terms of \na loss function.  For instance, in the case of pattern recognition, the situation is simple: two \noutputs can either be identical  or different.  On the inputs, the notion of similarity is  more \ncomplex.  It hinges  on  a representation  of the  patterns and  a  suitable  similarity  measure \noperating on that representation. \n\n\fthe one we will \nOne particularly simple yet surprisingly useful notion of (dis)similarity -\nuse in this paper -\nderives from embedding the data into a Euclidean space and utilizing \ngeometrical concepts.  For instance, in SVMs, similarity is measured by dot products (i.e. \nangles and lengths) in some high-dimensional feature space F . Formally, the patterns are \nfirst mapped into Fusing \u00a2  : X  -t F,  x I-t \u00a2(x), and then compared using a dot product \n(\u00a2(x), \u00a2(X')).  To  avoid working in the potentially high-dimensional space F, one tries  to \npick a feature  space in  which the dot product can be evaluated directly using a  nonlinear \nfunction in input space, i.e. by means of the kernel trick \n\nk(x, x') = (\u00a2(x), \u00a2(X')). \n\n(1) \n\nOften, one simply chooses a kernel  k  with  the property that there exists  some \u00a2 such that \nthe above holds true, without necessarily worrying about the actual form of \u00a2 -\nalready the \nexistence of the linear space F  facilitates a number of algorithmic and theoretical issues. It \nis  well established that (1) works out for Mercer kernels [3,  13],  or, equivalently, positive \ndefinite kernels [2,  14].  Here and below, indices i  and j  by default run over 1, ... , m. \n\nDefinition 1 (Positive definite kernel)  A  symmetric function  k  :  X  x  X  -t IR  which for \nall mEN, Xi  E  X  gives  rise  to  a positive  definite  Gram  matrix,  i.e.  for which for all \nCi  E IR  we have \n\n\"\"~.  CicjKij ~ 0,  where Kij := k(Xi, Xj), \nL...J l ,J=1 \n\n(2) \n\nis  called a positive definite (pd) kernel. \n\nOne particularly intuitive way to construct a feature map satisfying (1) for such a kernel k \nproceeds, in a nutshell, as follows (for details, see [2]): \n\n1.  Define a feature map \n\nX  I-t k(., x). \nHere, IRA:'  denotes the space of functions mapping X  into Ilt \n\n\u00a2 : X  -t IRA:', \n\n(3) \n\n2.  Turn  it into a linear space by forming linear combinations \n\nm \n\ni=1 \n\nm' \n\nj=1 \n\n(4) \n3.  Endow  it  with  a  dot product  (1, g)  := 2::1 2:;~1 ai/Jjk(xi,xj), and  turn  it into a \nHilbert space Hk  by completing it in the corresponding norm. \nNote that in particular, by definition ofthe dot product, (k(., x), k(., x')) = k(x, x'), hence, \nin  view  of (3),  we  have  k(x,x' )  = (\u00a2(X),\u00a2(X')),  the  kernel  trick.  This  shows  that  pd \nkernels can be thought of as  (nonlinear) generalizations of one of the  simplest similarity \nmeasures, the canonical dot product (x, x') , x, x'  E IRN.  The question arises as to whether \nthere also exi st generalizations of the simplest dissimilarity measure, the di stance Ilx - x'11 2 . \nClearly, the distance 11\u00a2(x) - \u00a2(X') 112  in the feature space associated with a pd kernel k can \nbe computed using the kernel trick (1) as  k(x, x) + k(X', x') - 2k(x, x') . Positive definite \nkernels  are,  however,  not the full  story:  there exists  a  larger class  of kernels  that can  be \nused as generalized distances, and the following section will describe why. \n\n2  Kernels as Generalized Distance Measures \n\nLet us  start by considering how a dot product and the corresponding distance measure are \naffected by a translation of the data, x  I-t x - Xo.  Clearly, Ilx - x'11 2  is translation invariant \n\n\fwhile  (x, x')  is  not.  A  short calculation  shows  that  the  effect  of the  translation  can  be \nexpressed in terms of II. - .11 2  as \n\n((x - xo), (x'  - xo))  =  ~ (-llx - x/11 2  + Ilx - xol12 + Ilxo - X'W)  . \n\n(5) \nNote  that  this  is,  just like  (x,x /),  still  a pd kernel:  ~i,j CiCj((Xi  - xo), (Xj  - xo))  = \nII  ~i Ci(Xi  - xo)112  ~ O.  For any choice of Xo  E  X, we  thus  get a similarity measure (5) \nassociated with the dissimilarity measure Ilx - x'II. \nThis naturally leads to the question whether (5) might suggest a connection that holds true \nalso  in  more general cases:  what kind  of nonlinear dissimilarity  measure do  we  have  to \nsubstitute instead of II. - .11 2  on the right hand side of (5) to ensure that the left hand side \nbecomes positive definite? The answer is given by a known result.  To state it, we first need \nto define the appropriate class of kernels. \n\nDefinition 2 (Conditionally positive definite kernel)  A symmetric function k  : X  x X  -t \nIR which satisfies (2) for all mEN, Xi  E  X  and for all Ci  E  IR  with \n\n~~ Ci  =  0, \nL... t =l \n\n(6) \n\nis  called a conditionally positive definite (cpd) kernel. \n\nProposition 3 (Connection pd -\non  X  x X.  Then \n\ncpd [2])  Let Xo  E  X,  and let k  be a symmetric kernel \n\nk(x, x') := ~ (k(x, x') - k(x, xo)  - k(xo, x') + k(xo, xo)) \n\n(7) \n\nis positive definite if and only if k  is conditionally positive definite. \n\nThe proof follows directly from the definitions and can be found in  [2]. \n\nThis  result  does  generalize  (5):  the  negative  squared  distance  kernel  is  indeed  cpd,  for \n~i Ci  = 0 implies - ~i,j cicjllxi - xjl12 = - ~i Ci ~j Cj IIxjl12 - ~j Cj  ~i cillxil12 + \n2 ~i,j CiCj (Xi, Xj)  = 2 ~i,j CiCj (Xi, Xj) = 211  ~i CiXi 112  ~ O.  In fact, this implies that all \nkernels of the form \n\nk(x, x') =  -llx - x/II/3, 0 < f3  ~ 2 \nare cpd (they are not pd), by application of the following result: \n\n(8) \n\nProposition 4 ([2])  If k  :  X  x  X  -t] - 00,0]  is  cpd,  then so are  - (_k)O<  (0  <  Q  < 1) \nand -log(l - k). \n\nTo  state another class of cpd kernels that are not pd, note first that as  trivial consequences \nof Definition 2,  we know that (i) sums of cpd kernels are cpd, and (ii)  any constant b E IR \nis  cpd.  Therefore, any kernel of the form k + b,  where k is cpd and b E  IR,  is also cpd.  In \nparticular, since pd kernels are cpd, we can take any pd kernel and offset it by b and it will \nstill be at least cpd.  For further examples of cpd kernels, cf.  [2,  14, 4,  11]. \n\nWe  now  return  to  the  main  flow  of the  argume~t.  Proposition  3  allows  us  to  construc5 \nthe feature map for k from that of the pd kernel k.  To  this  end,  fix  Xo  E  X  and define k \naccording to (7).  Due to Proposition 3, k is positive definite. Therefore, we may employ the \nHilbert space representation \u00a2  : X  -t H of k (ct. (1\u00bb, satisfying (\u00a2(x), \u00a2(X')) = k(x, x'), \nhence \n11\u00a2(x) - \u00a2(x' )112  = (\u00a2(x) - \u00a2(X'), \u00a2(x) - \u00a2(X')) = k(x, x) + k(X', x') - 2k(x, x').  (9) \n\n\fSubstituting (7) yields \n\n114>(x)  - 4>(x' )112  =  -k(x, x') + 2 (k(x, x) + k(X', x')) . \n\n1 \n\n(10) \n\nWe thus have proven the following result. \n\nProposition 5 (Hilbert space representation of cpd kernels [7, 2])  Let  k  be  a  real(cid:173)\nvalued conditionally positive definite  kernel  on  X,  satisfying k(x, x)  =  0 for all x  E  X. \nThen  there  exists  a  Hilbert  space  H  of real-valued functions  on  X,  and  a  mapping \n4>  : X  -t H, such that \n\n114>(x)  - 4>(x' )112  = -k(x, x'). \n\n(11) \n\nlfwe drop  the assumption k(x, x)  =  0,  the Hilbert .space representation reads \n\n114>(x)  - 4>(x' )112  = -k(x, x') + 2 (k(x, x)  + k(X', x')) . \n\n1 \n\n(12) \n\nIt can  be  shown  that  if k(x, x)  =  0  for  all  x  E  X,  then  d(x, x')  :=  J -k(x, x')  = \n114>(x)  - 4>(x' )11  is a semi-metric; it is a metric if k(X,X') f:.  0 for x f:.  x' [2]. \nWe next show how to represent general symmetric kernels (thus in particular cpd kernels) \nas  symmetric  bilinear forms  Q in  feature  spaces.  This  generalization of the  previously \nknown feature  space representation for pd kernels  comes  at a cost:  Q will  no  longer be \na dot product.  For our purposes,  we  can  get  away  with  this.  The result will  give  us  an \nintuitive understanding of Proposition 3:  we  can  then  write  k as  k(X,X')  := Q(4)(x)  -\n4>(xo), 4>(x' ) - 4>(xo)).  Proposition 3 thus essentially adds an origin in feature space which \ncorresponds  to  the  image  4>(xo)  of one point  Xo  under the  feature  map.  For translation \ninvariant algorithms, we are always allowed to do this, and thus turn a cpd kernel into a pd \none -\n\nin this sense, cpd kernels are \"as good as\" pd kernels. \n\nProposition 6 (Vector space representation of symmetric kernels)  Let  k  be  a  real(cid:173)\nvalued symmetric kernel on X.  Then there exists a linear .space H  of real-valued functions \non  X ,  endowed with a symmetric bilinear form Q(., .), and a mapping 4>  : X  -t H, such \nthat \n\nk(x, x') = Q(4)(x), 4>(x' )). \n\n(13) \n\nProof  The proof is  a direct modification of the pd case.  We  use the map (3) and linearly \ncomplete the image as  in  (4).  Define Q(f,g) := L:l LT~1 ad3j k(xi, xj). To  see that it \nis  well-defined, although it explicitly contains the expansion coefficients  (which need not \nbe unique), note that Q(f, g)  = LT~1 /3jf(xj), independent of the ai.  Similarly,  for g, \nnote that Q(f, g)  =  Li aig(xi), hence it is independent of /3j.  The last two equations also \nshow that Q is  bilinear; clearly, it is symmetric. \n\u2022 \n\nNote,  moreover,  that  by  definition  of Q,  k  is  a reproducing kernel for  the  feature  space \n(which is  not a Hilbert space):  for  all  functions  f  (4),  we  have Q(k(.,x),f) = f(x); in \nparticular, Q(k(., x), k(., x'))  =  k(x, x'). \n\nRewriting k as  k(x, x')  := Q(4)(x)  - 4>(xo), 4>(x' ) - 4>(xo))  suggests an  immediate gen(cid:173)\neralization  of Proposition 3:  in  practice, we might want to  choose other points as  origins \nin feature space - points that do not have a preimage Xo  in input space,  such as  (usually) \nthe  mean  of a set of points  (cf.  [12]).  This will  be useful when considering kernel PCA. \nCrucial  is  only  that our reference point's behaviour under translations  is  identical  to  that \nof individual points.  This  is  taken  care  of by  the  constraint on  the  sum of the  Ci  in  the \nfollowing proposition. The asterisk denotes the complex conjugated transpose. \n\n\fProposition 7 (Exercise 2.23, [2])  Let K  be a symmetric matrix,  e  E ~m be the vector of \nall ones, J the m  x  m  identity matrix, and let c  E em  satisfy e*c = 1.  Then \n\nK := (J - ec*)K(J - ce*) \n\n(14) \n\nis positive definite if and only if K  is conditionally positive definite. \n\nProof \n\"~\": suppose K is positive definite, i.e. for any a  E em, we have \n\no ~ a* Ka =  a* Ka + a*ec* Kce*a - a* Kce*a - a*ec* Ka. \n\n(15) \nIn the case a*e = e*a = 0  (cf.  (6\u00bb, the  three last terms  vanish, i.e.  0  ~ a* Ka, proving \nthat K  is conditionally positive definite. \n\"\u00a2=\":  suppose K  is  conditionally positive definite.  The map  (J - ce*) has its range in \nthe orthogonal complement of e, which can be seen by computing, for any a  E em, \n\ne*(J - ce*)a= e*a-e*ce*a =  O. \n\n(16) \n\nMoreover, being symmetric and  satisfying  (J  - ce*)2  =  (J  - ce*),  the  map  (J  - ce*) \nis  a projection.  Thus  K is  the  restriction  of K  to  the  orthogonal  complement of e,  and \nby  definition  of conditional  positive definiteness,  that is  precisely  the  space  where  K  is \npositive definite. \n\n\u2022 \n\nThis result directly implies a corresponding generalization of Proposition 3: \n\nProposition 8 (Adding a general origin)  Let k  be a symmetric kernel,  Xl, ... ,Xm  E  X, \nand let Ci  E e satisfy E~l Ci  = 1.  Then \n\nis positive definite if and only if k  is conditionally positive definite. \n\nProof  Consider  a  set  of points  x~, . . . , x~\"  m'  E  N, x~  EX,  and  let  K  be  the \n(m + m')  x  (m + m') Gram matrix based on Xl, .. .  ,  X m , X~, ... , x~,.  Apply  Proposi(cid:173)\ntion 7 using cm +! = ... = cm +m '  = O. \n\u2022 \n\n(17) \n\nExample 9 (SVMs and kernel peA)  (i)  The  above  results show that conditionally posi(cid:173)\ntive  definite  kernels  are  a natural choice whenever we  are  dealing  with  a  translation  in(cid:173)\nvariant problem,  such as the SVM:  maximization of the margin of separation between two \nclasses of data is independent of the origin '.I' position.  Seen in this light,  it is not surprising \nthat the structure of the dual optimization problem (cf [13}) allows cpd kernels:  as noticed \nin  [11,  10},  the  constraint E~l QiYi  = 0 projects  out the  same  sub.lpace  as  (6)  in  the \ndefinition of conditionally positive definite kernels. \n\n(ii) Another example of a kernel algorithm that works with  conditionally positive definite \nkernels is kernel peA [9},  where the data is centered, thus removing the dependence on the \norigin infeature .Ipace.  Formally,  this follows from Proposition 7 for Ci = 11m. \n\n\fExample 10 (Parzen windows)  One  of the  simplest  distance-based  classification  algo(cid:173)\nrithms  conceivable proceeds  as follows.  Given m+ points labelled with  + 1,  m_  points \nlabelled with -1, and a test point \u00a2( x),  we compute the mean squared distances between \nthe latter and the two classes,  and assign it to the one where this mean is smaller, \n\nWe  use the distance kernel trick (Proposition 5) to express the decision function as a kernel \nexpansion in  input space:  a short calculation shows that \n\ny = sgn (_1_ L  k(X,Xi) - _1_  L  k(X,Xi) + c) , \n\nm+  Yi=l \n\nm_  Yi=-l \n\n(19) \n\nwith the constant offset c = (1/2m_) L:Yi=-l k(Xi, Xi) - (1/2m+) L:Yi=l k(Xi, Xi).  Note \nthatfor some cpd kernels,  such as (8),  k(Xi, Xi)  is always 0,  thus c = O.  For others, such as \nthe commonly used Gaussian kernel,  k(Xi, Xi)  is a nonzero constant,  in  which case c also \nvanishes. \n\nFor  normalized Gaussians and other kernels  that are  valid density  models,  the  resulting \ndecision boundary can be interpreted as the Bayes decision based on two Parzen windows \ndensity estimates of the classes; for general cpd kernels, the analogy is a mere formal one. \n\nExample 11  (Toy experiment)  In  Fig.  J, we illustrate the finding that kernel  peA can be \ncarried out using cpd kernels.  We  use the kernel (8).  Due to  the centering that is built into \nkernel peA (cf  Example 9,  (ii),  and (5)),  the  case (3  =  2 actually is  equivalent to  linear \npeA. As we decrease (3,  we obtain increasingly nonlinear feature extractors. \n\nNote,  moreover,  that as  the  kernel parameter (3  gets  smaller,  less  weight  is  put on  large \ndistances,  and we get more localizedfeature extractors (in the sense that the regions where \nthey have large gradients,  i.e.  dense sets of contour lines in the plot, get more localized). \n\nFigure 1: Kernel PCA on a toy dataset using the cpd kernel (8); contour plots of the feature \nextractors corresponding to  projections onto the  first  two  principal axes  in  feature  space. \nFrom left to  right:  (3  = 2,1.5,1,0.5.  Notice how  smaller  values  of (3  make  the  feature \nextractors increasingly nonlinear, which allows the identification of the cluster structure. \n\n\f3  Conclusion \n\nWe have described a kernel trick for distances in feature spaces.  It can be used to generalize \nall  distance  based  algorithms  to  a feature  space  setting  by  substituting a  suitable  kernel \nfunction  for  the  squared  distance.  The  class  of kernels  that  can  be  used  is  larger  than \nthose  commonly  used  in  kernel  methods  (known  as  positive  definite  kernels).  We  have \nargued that this reflects the translation invariance of distance based algorithms, as opposed \nto genuinely dot product based algorithms.  SVMs and kernel PCA are translation invariant \nin  feature  space,  hence  they  are really  both distance rather  than  dot product based.  We \nthus  argued  that  they  can  both use conditionally positive  definite kernels.  In  the  case  of \nthe  SVM,  this  drops  out  of the  optimization  problem automatically  [11],  in  the  case  of \nkernel  PCA,  it corresponds to  the introduction of a reference point in  feature  space.  The \ncontribution of the present work is that it identifies translation invariance as the underlying \nreason, thus enabling us to use cpd kernels in a much larger class of kernel algorithms, and \nthat it draws the learning community's attention to the kernel trick for distances. \n\nAcknowledgments.  Part  of the  work was  done  while  the  author was  visiting  the  Aus(cid:173)\ntralian  National University.  Thanks  to  Nello  Cristianini,  Ralf Herbrich,  Sebastian  Mika, \nKlaus  Miiller,  John  Shawe-Taylor,  Alex  Smola,  Mike  Tipping,  Chris  Watkins,  Bob \nWilliamson, Chris Williams and a conscientious anonymous reviewer for valuable input. \n\nReferences \n[1]  M.  A. Aizerman, E. M. Braverman, and L. 1. Rozonoer. Theoretical foundations of the potential \nfunction method in pattern recognition learning.  Autom. and Remote Contr. , 25:821- 837, 1964. \n[2]  C. Berg, J.P.R. Christensen, and P.  Ressel. Hannonic Analysis on Semigroups.  Springer-Verlag, \n\nNew York,  1984. \n\n[3]  B.  E. Boser,  1.  M.  Guyon, and V.  N.  Vapnik.  A training algorithm  for  optimal  margin  classi(cid:173)\nfiers.  In  D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop  on Computational \nLearning Theory, pages 144-152, Pittsburgh, PA, July  1992. ACM Press. \n\n[4]  F.  Girosi, M. Jones,  and T.  Poggio.  Regularization  theory  and  neural  networks  architectures. \n\nNeural Computation, 7(2):219- 269, 1995. \n\n[5]  D. Haussler.  Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-1O, \n\nComputer Science Department, University of California at Santa Cruz, 1999. \n\n[6]  J. Mercer.  Functions  of positive  and  negative  type  and  their  connection  with  the  theory  of \n\nintegral equations.  Philos.  Trans. Roy. Soc. London, A 209:415-446, 1909. \n\n[7]  I.  J.  Schoenberg.  Metric  spaces  and  positive  definite  functions.  Trans.  Amer.  Math.  Soc., \n\n44:522- 536, 1938. \n\n[8]  B. Sch61kopf, C. J. C. Burges, and A. J. Smola.  Advances in Kernel Methods - Support Vector \n\nLearning.  MIT Press, Cambridge, MA, 1999. \n\n[9]  B. SchDlkopf, A. Smola, and K-R. Miiller. Nonlinear component analysis as a kernel eigenvalue \n\nproblem.  Neural Computation, 10:1299- 1319,  1998. \n\n[10]  A. Smola, T.  FrieB, and B. ScMlkopf.  Semiparametric support vector and linear programming \nmachines. In M.S. Keams, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Infonnation \nProcessing Systems 11 , pages 585 - 591, Cambridge, MA, 1999. MIT Press. \n\n[11]  A.  Smola,  B. SchDlkopf,  and K-R. Miiller.  The connection  between regularization  operators \n\nand support vector kernels.  Neural Networks, 11:637- 649, 1998. \n\n[12]  W.S . Torgerson.  Theory and Methods of Scaling. Wiley, New York, 1958. \n[13]  V.  Vapnik.  The Nature of Statistical Learning Theory.  Springer, N.Y.,  1995. \n[14]  G. Wahba.  Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Confer(cid:173)\n\nence Series in Applied Mathematics.  SIAM, Philadelphia, 1990. \n\n[15]  C. Watkins, 2000.  personal communication. \n\n\f", "award": [], "sourceid": 1862, "authors": [{"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}