{"title": "Learning with Transformation Invariant Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1561, "page_last": 1568, "abstract": null, "full_text": "Learning with Transformation Invariant Kernels\n\nChristian Walder\n\nchristian.walder@tuebingen.mpg.de\n\nMax Planck Institute for Biological Cybernetics\n\n72076 T\u00a8ubingen, Germany\n\nOlivier Chapelle\nYahoo! Research\nSanta Clara, CA\n\nchap@yahoo-inc.com\n\nAbstract\n\nThis paper considers kernels invariant to translation, rotation and dilation. We\nshow that no non-trivial positive de\ufb01nite (p.d.) kernels exist which are radial and\ndilation invariant, only conditionally positive de\ufb01nite (c.p.d.) ones. Accordingly,\nwe discuss the c.p.d. case and provide some novel analysis, including an elemen-\ntary derivation of a c.p.d. representer theorem. On the practical side, we give a\nsupport vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thin-\nplate kernel this leads to a classi\ufb01er with only one parameter (the amount of regu-\nlarisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian\nkernel, even though the Gaussian involves a second parameter (the length scale).\n\n1 Introduction\n\nRecent years have seen widespread application of reproducing kernel Hilbert space (r.k.h.s.) based\nmethods to machine learning problems (Sch\u00a8olkopf & Smola, 2002). As a result, kernel methods\nhave been analysed to considerable depth. In spite of this, the aspects which we presently investigate\nseem to have received insuf\ufb01cient attention, at least within the machine learning community.\nThe \ufb01rst is transformation invariance of the kernel, a topic touched on in (Fleuret & Sahbi, 2003).\nNote we do not mean by this the local invariance (or insensitivity) of an algorithm to application\nspeci\ufb01c transformations which should not affect the class label, such as one pixel image translations\n(see e.g. (Chapelle & Sch\u00a8olkopf, 2001)). Rather we are referring to global invariance to transforma-\ntions, in the way that radial kernels (i.e. those of the form k(x, y) = \u03c6(kx \u2212 yk)) are invariant to\ntranslations. In Sections 2 and 3 we introduce the more general concept of transformation scaled-\nness, focusing on translation, dilation and orthonormal transformations. An interesting result is that\nthere exist no non-trivial p.d. kernel functions which are radial and dilation scaled.\nThere do exist non-trivial c.p.d. kernels with the stated invariances however. Motivated by this,\nwe analyse the c.p.d. case in Section 4, giving novel elementary derivations of some key results,\nmost notably a c.p.d. representer theorem. We then give in Section 6.1 an algorithm for applying\nthe s.v.m. with arbitrary c.p.d. kernel functions. It turns out that this is rather useful in practice,\nfor the following reason. Due to its invariances, the c.p.d. thin-plate kernel which we discuss in\nSection 5, is not only richly non-linear, but enjoys a duality between the length-scale parameter\nand the regularisation parameter of Tikhonov regularised solutions such as the s.v.m. In Section\n7 we compare the resulting classi\ufb01er (which has only a regularisation parameter), to that of the\ns.v.m. with Gaussian kernel (which has an additional length scale parameter). The results show that\nthe two algorithms perform roughly as well as one another on a wide range of standard machine\nlearning problems, notwithstanding the new method\u2019s advantage in having only one free parameter.\nIn Section 8 we make some concluding remarks.\n\n1\n\n\f2 Transformation Scaled Spaces and Tikhonov Regularisation\nDe\ufb01nition 2.1. Let T be a bijection on X and F a Hilbert space of functions on some non-empty\nset X such that f 7\u2192 f \u25e6 T is a bijection on F. F is T -scaled if\n\nhf, giF = gT (F)hf \u25e6 T , g \u25e6 T iF\n\n(1)\nfor all f \u2208 F, where gT (F) \u2208 R+ is the norm scaling function associated with the operation of T\non F. If gT (F) = 1 we say that F is T -invariant.\nThe following clari\ufb01es the behaviour of Tikhonov regularised solutions in such spaces.\nLemma 2.2. For any \u0398 : F \u2212\u2212\u2212\u2192 R and T such that f 7\u2192 f \u25e6 T is a bijection of F, if the left hand\nside is unique then\n\n(cid:18)\n\n(cid:19)\nfT \u2208F \u0398(fT \u25e6 T )\n\n\u25e6 T\n\narg min\n\nf\u2208F \u0398(f) =\n\narg min\n\nProof. Let f\u2217 = arg minf\u2208F \u0398(f) and f\u2217\nthat \u2200g \u2208 F, \u0398(f\u2217\n\u2200g \u2208 F, \u0398(f\u2217\n\nT = arg minfT \u2208F \u0398(fT \u25e6 T ). By de\ufb01nition we have\nT \u25e6 T ) \u2264 \u0398(g \u25e6 T ). But since f 7\u2192 f \u25e6 T is a bijection on F, we also have\n\nT \u25e6 T ) \u2264 \u0398(g). Hence, given the uniqueness, this implies f\u2217 = f\u2217\n\nT \u25e6 T .\n\nThe following Corollary follows immediately from Lemma 2.2 and De\ufb01nition 2.1.\nCorollary 2.3. Let Li be any loss function. If F is T -scaled and the left hand side is unique then\n\u25e6 T .\n\n(cid:16)kfk2F /gT (F) +X\n\nkfk2F +X\n\nLi (f (T xi))\n\nLi (f (xi))\n\n(cid:17)(cid:19)\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)\n\n=\n\narg min\nf\u2208F\n\narg min\nf\u2208F\n\ni\n\ni\n\nCorollary 2.3 includes various learning algorithms for various choices of Li \u2014 for example the\ns.v.m. with linear hinge loss for Li(t) = max (0, 1 \u2212 yit), and kernel ridge regression for Li(t) =\n(yi \u2212 t)2. Let us now introduce the speci\ufb01c transformations we will be considering.\nDe\ufb01nition 2.4. Let Ws, Ta and OA be the dilation, translation and orthonormal transformations\nRd \u2192 Rd de\ufb01ned for s \u2208 R \\ {0}, a \u2208 Rd and orthonormal A : Rd \u2192 Rd by Wsx = sx,\nTax = x + a and OAx = Ax respectively.\nHence, for an r.k.h.s. which is Ws-scaled for arbitrary s 6= 0, training an s.v.m. and dilating the\nresultant decision function by some amount is equivalent training the s.v.m. on similarly dilated\ninput patterns but with a regularisation parameter adjusted according to Corollary 2.3.\nWhile (Fleuret & Sahbi, 2003) demonstrated this phenomenon for the s.v.m. with a particular kernel,\nas we have just seen it is easy to demonstrate for the more general Tikhonov regularisation setting\nwith any function norm satisfying our de\ufb01nition of transformation scaledness.\n\n3 Transformation Scaled Reproducing Kernel Hilbert Spaces\n\nWe now derive the necessary and suf\ufb01cient conditions for a reproducing kernel (r.k.) to correspond\nto an r.k.h.s. which is T -scaled. The relationship between T -scaled r.k.h.s.\u2019s and their r.k.\u2019s is easy\nto derive given the uniqueness of the r.k. (Wendland, 2004). It is given by the following novel\nLemma 3.1 (Transformation scaled r.k.h.s.). The r.k.h.s. H with r.k. k : X \u00d7 X \u2192 R, i.e. with k\nsatisfying\n\nis T -scaled iff\n\nhk(\u00b7, x), f(\u00b7)iH = f(x),\n\n(2)\n\nk(x, y) = gT (H) k(T x,T y).\n\n(3)\n\nWhich we prove in the accompanying technical report (Walder & Chapelle, 2007) . It is now easy\nto see that, for example, the homogeneous polynomial kernel k(x, y) = hx, yip corresponds to a\nWs-scaled r.k.h.s. H with gWs(H) = hx, yip /hsx, syip = s\u22122p. Hence when the homogeneous\npolynomial kernel is used with the hard-margin s.v.m. algorithm, the result is invariant to multiplica-\ntive scaling of the training and test data. If the soft-margin s.v.m. is used however, then the invariance\n\n2\n\n\fholds only under appropriate scaling (as per Corollary 2.3) of the margin softness parameter (i.e. \u03bb\nof the later equation (14)).\nWe can now show that there exist no non-trivial r.k.h.s.\u2019s with radial kernels that are also Ws-scaled\nfor all s 6= 0. First however we need the following standard result on homogeneous functions:\nLemma 3.2. If \u03c6 : [0,\u221e) \u2192 R and g : (0,\u221e) \u2192 R satisfy \u03c6(r) = g(s)\u03c6(rs) for all r \u2265 0 and\ns > 0 then \u03c6(r) = a\u03b4(r) + brp and g(s) = s\u2212p, where a, b, p \u2208 R, p 6= 0, and \u03b4 is Dirac\u2019s function.\nWhich we prove in the accompanying technical report (Walder & Chapelle, 2007). Now, suppose\nthat H is an r.k.h.s. with r.k. k on Rd \u00d7 Rd. If H is Ta-invariant for all a \u2208 Rd then\n\nk(x, y) = k(T\u2212yx, T\u2212yy) = k(x \u2212 y, 0) , \u03c6T (x \u2212 y).\n\nIf in addition to this H is OA-invariant for all orthogonal A, then by choosing A such that A(x\u2212y) =\nkx \u2212 yk \u02c6e where \u02c6e is an arbitrary unit vector in Rd we have\n\nk(x, y) = k(OAx, OAy) = \u03c6T (OA(x \u2212 y)) = \u03c6T (kx \u2212 yk \u02c6e) , \u03c6OT (kx \u2212 yk)\n\ni.e. k is radial. All of this is straightforward, and a similar analysis can be found in (Wendland,\n2004). Indeed the widely used Gaussian kernel satis\ufb01es both of the above invariances. But if we\nnow also assume that H is Ws-scaled for all s 6= 0 \u2014 this time with arbitrary gWs(H) \u2014 then\n\nk(x, y) = gWs(H)k(Wsx, Wsy) = gW|s|(H)\u03c6OT (|s|kx \u2212 yk)\n\nso that letting r = kx \u2212 yk we have that \u03c6OT (r) = gW|s|(H)\u03c6OT (|s| r) and hence by Lemma 3.2\nthat \u03c6OT (r) = a\u03b4(r) + brp where a, b, p \u2208 R. This is positive semi-de\ufb01nite for the trivial case\np = 0, but there are various ways of showing this cannot be non-trivially positive semi-de\ufb01nite for\np 6= 0. One simple way is to consider two arbitrary vectors x1 and x2 such that kx1 \u2212 x2k = d > 0.\nFor the corresponding Gram matrix\n\n(cid:18) a\n\nbdp\n\n(cid:19)\n\nbdp\na\n\n,\n\nK ,\n\nto be positive semi de\ufb01nite we require 0 \u2264 det(K) = a2 \u2212 b2d2p, but for arbitrary d > 0 and\na < \u221e, this implies b = 0. This may seem disappointing, but fortunately there do exist c.p.d. kernel\nfunctions with the stated properties, such as the thin-plate kernel. We discuss this case in detail in\nSection 5, after the following particularly elementary and in part novel introduction to c.p.d. kernels.\n\n4 Conditionally Positive De\ufb01nite Kernels\n\nIn the last Section we alluded to c.p.d. kernel functions \u2013 these are given by the following\nDe\ufb01nition 4.1. A continuous function \u03c6 : X \u00d7 X \u2192 R is conditionally positive de\ufb01nite with\nrespect to (w.r.t.) the linear space of functions P if, for all m \u2208 N, all {xi}i=1...m \u2282 X , and all\n\n\u03b1 \u2208 Rm \\ {0} satisfyingPm\n\nj=1 \u03b1jp(xj) = 0 for all p \u2208 P, the following holds\n\nj,k=1 \u03b1j\u03b1k\u03c6(xj, xk) > 0.\n\n(4)\n\nPm\n\nDue to the positivity condition (4) \u2014 as opposed one of non negativity \u2014 we are referring to c.p.d.\nrather than conditionally positive semi-de\ufb01nite kernels. The c.p.d. case is more technical than the\np.d. case. We provide a minimalistic discussion here \u2014 for more details we recommend e.g. (Wend-\nland, 2004). To avoid confusion, let us note in passing that while the above de\ufb01nition is quite stan-\ndard (see e.g. (Wendland, 2004; Wahba, 1990)), many authors in the machine learning community\nuse a de\ufb01nition of c.p.d. kernels which corresponds to our de\ufb01nition when P = {1} (e.g. (Sch\u00a8olkopf\n& Smola, 2002)) or when P is taken to be the space of polynomials of some \ufb01xed maximum degree\n(e.g. (Smola et al., 1998)). Let us now adopt the notation P\u22a5(x1, . . . , xm) for the set\n\n{\u03b1 \u2208 Rm :Pm\n\ni=1 \u03b1ip(xi) = 0 for all p \u2208 P} .\n\nThe c.p.d. kernels of De\ufb01nition 4.1 naturally de\ufb01ne a Hilbert space of functions as per\nDe\ufb01nition 4.2. Let \u03c6 : X \u00d7 X \u2192 R be a c.p.d. kernel w.r.t. P. We de\ufb01ne F\u03c6 (X ) to be the Hilbert\nspace of functions which is the completion of the set\n\nnPm\no\nj=1 \u03b1j\u03c6(\u00b7, xj) : m \u2208 N, x1, .., xm \u2208 X , \u03b1 \u2208 P\u22a5(x1, .., xm)\nDPm\nj=1 \u03b1j\u03c6(\u00b7, xj),Pn\n\nE\nk=1 \u03b2k\u03c6(\u00b7, yk)\n\nwhich due to the de\ufb01nition of \u03c6 we may endow with the inner product\n\n=Pm\n\nj=1\n\nk=1 \u03b1j\u03b2k\u03c6(xj, yk).\n\nPn\n\n(5)\n\n,\n\nF\u03c6(X )\n\n3\n\n\fNote that \u03c6 is not the r.k. of F\u03c6 (X ) \u2014 in general \u03c6(x,\u00b7) does not even lie in F\u03c6 (X ). For the\nremainder of this Section we develop a c.p.d. analog of the representer theorem. We begin with\nLemma 4.3. Let \u03c6 : X \u00d7 X \u2192 R be a c.p.d. kernel w.r.t. P and p1, . . . pr a basis for P.\nj=1 \u03b1j\u03c6(\u00b7, xj) \u2208 F\u03c6 (X ) and sP =Pr\nPm\nFor any {(x1, y1), . . . (xm, ym)} \u2282 X \u00d7 R, there exists an s = sF\u03c6(X ) + sP where sF\u03c6(X ) =\n\nk=1 \u03b2kpk \u2208 P, such that s(xi) = yi, i = 1 . . . m.\n\nA simple and elementary proof (which shows (17) is solvable when \u03bb = 0), is given in (Wendland,\n2004) and reproduced in the accompanying technical report (Walder & Chapelle, 2007). Note that\nalthough such an interpolating function s always exists, it need not be unique. The distinguishing\nproperty of the interpolating function is that the norm of the part which lies in F\u03c6 (X ) is minimum.\nDe\ufb01nition 4.4. Let \u03c6 : X \u00d7X \u2192 R be a c.p.d. kernel w.r.t. P. We use the notation P\u03c6(P) to denote\nthe projection F\u03c6 (X ) \u2295 P \u2192 F\u03c6 (X ).\n\nNote that F\u03c6 (X ) \u2295 P\u03c6(P) is a direct sum since p =Pm\n\nj=1 \u03b2i\u03c6(zj,\u00b7) \u2208 P \u2229 F\u03c6 (X ) implies\n\nF\u03c6(X ) = hp, piF\u03c6(X ) =Pm\n\nPn\nj=1 \u03b2i\u03b2j\u03c6(zi, zj) =Pm\nP. Consider an arbitrary function s = sF\u03c6(X ) + sP with sF\u03c6(X ) = Pm\nand sP = Pr\n\nHence, returning to the main thread, we have the following lemma \u2014 our proof of which seems to\nbe novel and particularly elementary.\nLemma 4.5. Denote by \u03c6 : X \u00d7 X \u2192 R a c.p.d. kernel w.r.t. P and by p1, . . . pr a basis for\nj=1 \u03b1j\u03c6(\u00b7, xj) \u2208 F\u03c6 (X )\nk=1 \u03b2kpk \u2208 P. kP\u03c6(P)skF\u03c6(X ) \u2264 kP\u03c6(P)fkF\u03c6(X ) holds for all f \u2208 F\u03c6 (X ) \u2295 P\n(6)\n\nf(xi) = s(xi), i = 1 . . . m.\n\nj=1 \u03b2jp(zj) = 0.\n\nsatisfying\n\nkpk2\n\ni=1\n\nProof. Let f be an arbitrary element of F\u03c6 (X ) \u2295 P. We can always write f as\n\nmX\n\nnX\n\nrX\n\nf =\n\n(\u03b1i + \u03b1i) \u03c6(\u00b7, xj) +\n\nbl\u03c6(\u00b7, zl) +\n\nckpk.\n\nj=1\n\nl=1\n\nk=1\n\nIf we de\ufb01ne1 [Px]i,j = pj(xi), [Pz]i,j = pj(zi), [\u03a6xx]i,j = \u03c6(xi, xj), [\u03a6xz]i,j = \u03c6(xi, zj), and\n[\u03a6zx]i,j = \u03c6(zi, xj), then the condition (6) can hence be written\nPx\u03b2 = \u03a6xx\u03b1 + \u03a6xzb + Pxc,\n\n(7)\nand the de\ufb01nition of F\u03c6 (X ) requires that e.g. \u03b1 \u2208 P\u22a5(x1, . . . , xm), hence implying the constraints\n(8)\n\nP >\nx \u03b1 = 0 and P >\n\nz b = 0.\n\nThe inequality to be demonstrated is then\n\nL , \u03b1>\u03a6xx\u03b1 \u2264\n\nR = \u03b1>\u03a6xx\u03b1\n\n|\n\n{z\n\n=L\n\n(cid:18)\u03b1 + \u03b1\n\nb\n\n(cid:18)\u03b1 + \u03b1\n(cid:19)\n\nb\n\n\u03a6zx \u03a6zz\n\nx (\u03b1 + \u03b1) + P >\n(cid:19)\n}\n(cid:18)\u03b1\n(cid:19)>\n{z\n|\n\n(cid:19)>(cid:18)\u03a6xx \u03a6xz\n|\n{z\n(cid:18)\u03b1\n(cid:19)\n}\n\n+ 2\n\n,\u03a6\n\n0\n\nb\n\nb\n\n\u03a6\n\n+\n\n(cid:18)\u03b1\n(cid:19)>\n(cid:18)\u03b1\n(cid:19)\n}\n}\n{z\n|\nz \u03b2 = 0, and since \u03a6 is c.p.d. w.r.t.(cid:0)P >\nz }| {\n\n,\u22061\n\n,\u22062\n\n\u03a6\n\n=0\n\nb\n\nx\n\n,\n\n, R.\n\n(9)\n\n(cid:1) that \u22061 \u2265 0. But\n\nP >\n\nz\n\nBy expanding\n\nit follows from (8) that P >\nx \u03b1 + P >\n(7) and (8) imply that L \u2264 R, since\n\n\u22062 = \u03b1>\u03a6xx\u03b1 + \u03b1>\u03a6xzb =\n\n\u03b1>Px (\u03b2 \u2212 c) \u2212 \u03b1>\u03a6xzb + \u03b1>\u03a6xzb = 0.\n\n1Square brackets w/ subscripts denote matrix elements, and colons denote entire rows or columns.\n\n4\n\n\fUsing these results it is now easy to prove an analog of the representer theorem for the p.d. case.\nTheorem 4.6 (Representer theorem for the c.p.d. case). Denote by \u03c6 : X \u00d7 X \u2192 R a c.p.d. kernel\nw.r.t. P, by \u2126 a strictly monotonic increasing real-valued function on [0,\u221e), and by c : Rm \u2192\nR \u222a {\u221e} an arbitrary cost function. There exists a minimiser over F\u03c6 (X ) \u2295 P of\n\nwhich admits the formPm\nProof. Let f be a minimiser of W. Let s = Pm\n\nW (f) , c (f(x1), . . . , f(xm)) + \u2126\ni=1 \u03b1i\u03c6(\u00b7, xi) + p, where p \u2208 P.\n\n1 . . . m. By Lemma 4.3 we know that such an s exists. But by Lemma 4.5 kP\u03c6(P)sk2\nkP\u03c6(P)fk2\n\nF\u03c6(X ). As a result, W (s) \u2264 W (f) and s is a minimizer of W with the correct form.\n\ni=1 \u03b1i\u03c6(\u00b7, xi) + p satisfy s(xi) = f(xi), i =\nF\u03c6(X ) \u2265\n\n(cid:16)kP\u03c6(P)fk2\n\nF\u03c6(X )\n\n(cid:17)\n\n(10)\n\n(\n\n\u03c6m(x, y) =\n\n5 Thin-Plate Regulariser\nDe\ufb01nition 5.1. The m-th order thin-plate kernel \u03c6m : Rd \u00d7 Rd \u2192 R is given by\nif d \u2208 2N,\nif d \u2208 (2N \u2212 1),\n\n(\u22121)m\u2212(d\u22122)/2 kx \u2212 yk2m\u2212d log(kx \u2212 yk)\n(\u22121)m\u2212(d\u22121)/2 kx \u2212 yk2m\u2212d\n\n(11)\nfor x 6= y, and zero otherwise. \u03c6m is c.p.d. with respect to \u03c0m\u22121(Rd), the set of d-variate polyno-\nmials of degree at most m \u2212 1. The kernel induces the following norm on the space F\u03c6m\nDe\ufb01nition 4.2 (this is not obvious \u2014 see e.g. (Wendland, 2004; Wahba, 1990))\nhf, giF\u03c6m (Rd)\ndX\n\u00b7\u00b7\u00b7\n\n(cid:19)\nZ \u221e\nZ \u221e\n, h\u03c8f, \u03c8giL2(Rd)\ndX\n\u00b7\u00b7\u00b7\n(cid:0)Rd(cid:1) \u2192 L2(Rd) is a regularisation operator, implicitly de\ufb01ned above.\n(cid:0)Rd(cid:1)) = gTa(F\u03c6m\n(cid:0)Rd(cid:1)) = 1. Moreover, from the chain rule we have\n\ni1=1\nwhere \u03c8 : F\u03c6m\n\n(cid:0)Rd(cid:1) of\n\n(cid:19)(cid:18) \u2202\n\nClearly gOA(F\u03c6m\n\n(cid:18) \u2202\n\ng\n\ndx1 . . . dxd,\n\nxd=\u2212\u221e\n\n\u2202xi1\n\n\u2202xi1\n\n\u2202xim\n\nx1=\u2212\u221e\n\n\u00b7\u00b7\u00b7\n\n\u2202\n\nf\n\n\u2202xim\n\n\u00b7\u00b7\u00b7\n\nim=1\n\n=\n\n\u2202\n\n(cid:18) \u2202\n\n(cid:19)\n\n\u00b7\u00b7\u00b7\n\n\u2202\n\n\u2202xi1\n\n\u2202\n\n\u2202xim\n\n(f \u25e6 Ws) = sm\n\n\u00b7\u00b7\u00b7\n\n\u2202\n\nf\n\n\u25e6 Ws\n\n(12)\n\n\u2202xi1\n\n\u2202xim\n\nand therefore since hf, giL2(Rd) = sd hf \u25e6 Ws, g \u25e6 WsiL2(Rd) ,we can immediately write\nh\u03c8 (f \u25e6 Ws) , \u03c8 (g \u25e6 Ws)iL2(Rd) = s2m h(\u03c8f) \u25e6 Ws, (\u03c8g) \u25e6 WsiL2(Rd) = s2m\u2212d h\u03c8f, \u03c8giL2(Rd)\n(13)\nso that gWs(F\u03c6m\neasily using (11) and an argument similar to Lemma 3.1, the process is actually more involved due\nto the log factor in the \ufb01rst case of (11), and it is necessary to use the fact that the kernel is c.p.d.\nw.r.t. \u03c0m\u22121(Rd). Since this is redundant and not central to the paper we omit the details.\n\n(cid:0)Rd(cid:1)) = s\u2212(2m\u2212d). Note that although it may appear that this can be shown more\n\n6 Conditionally Positive De\ufb01nite s.v.m.\n\nIn the Section 3 we showed that non-trivial kernels which are both radial and dilation scaled cannot\nbe p.d. but rather only c.p.d. It is therefore somewhat surprising that the s.v.m. \u2014 one of the most\nwidely used kernel algorithms \u2014 has been applied only with p.d. kernels, or kernels which are\nc.p.d. respect only to P = {1} (see e.g. (Boughorbel et al., 2005)). After all, it seems interesting\nto construct a classi\ufb01er independent not only of the absolute positions of the input data, but also of\ntheir absolute multiplicative scale.\nHence we propose using the thin-plate kernel with the s.v.m. by minimising the s.v.m. objective over\nthe space F\u03c6 (X ) \u2295 P (or in some cases just over F\u03c6 (X ), as we shall see in Section 6.1). For this\nwe require somewhat non-standard s.v.m. optimisation software. The method we propose seems\nsimpler and more robust than previously mentioned solutions. For example, (Smola et al., 1998)\nmentions the numerical instabilities which may arise with the direct application of standard solvers.\n\n5\n\n\fThin-Plate\n\ndim/n\nDataset Gaussian\n10.567 (0.547) 10.667 (0.586) 2/3000\u2217\nbanana\n26.574 (2.259) 28.026 (2.900) 9/263\nbreast\ndiabetes 23.578 (0.989) 23.452 (1.215) 8/768\n36.143 (0.969) 38.190 (2.317) 9/144\n\ufb02are\ngerman 24.700 (1.453) 24.800 (1.373) 20/1000\n17.407 (2.142) 17.037 (2.290) 13/270\nheart\n\nThin-Plate\n\ndim/n\nDataset Gaussian\n3.210 (0.504) 1.867 (0.338) 18/2086\nimage\nringnm 1.533 (0.229) 1.833 (0.200) 20/3000\u2217\n8.931 (0.640) 8.651 (0.433) 60/2844\nsplice\nthyroid 4.199 (1.087) 3.247 (1.211) 5/215\ntwonm 1.833 (0.194) 1.867 (0.254) 20/3000\u2217\nwavefm 8.333 (0.378) 8.233 (0.484) 21/3000\n\nTable 1: Comparison of Gaussian and thin-plate kernel with the s.v.m. on the UCI data sets. Results\nare reported as \u201cmean % classi\ufb01cation error (standard error)\u201d. dim is the input dimension and n\nthe total number of data points. A star in the n column means that more examples were available\nbut we kept only a maximum of 2000 per class in order to reduce the computational burden of the\nextensive number of cross validation and model selection training runs (see Section 7). None of the\ndata sets were linearly separable so we always used used the normal (\u03b2 unconstrained) version of\nthe optimisation described in Section 6.1.\n\n6.1 Optimising an s.v.m. with c.p.d. Kernel\n\nIt is simple to implement an s.v.m. with a kernel \u03c6 which is c.p.d. w.r.t. an arbitrary \ufb01nite dimensional\nspace of functions P by extending the primal optimisation approach of (Chapelle, 2007) to the c.p.d.\ncase. The quadratic loss s.v.m. solution can be formulated as arg minf\u2208F\u03c6(X )\u2295P of\n\n\u03bbkP\u03c6(P)fk2\n\nF\u03c6(X ) +\n\nmax(0, 1 \u2212 yif(xi))2,\n\n(14)\n\nnX\n\ni=1\n\nis given by fsvm(x) =Pn\n\nNote that for the second order thin-plate case we have X = Rd and P = \u03c01(Rd) (the space of\nconstant and \ufb01rst order polynomials). Hence dim (P) = d + 1 and we can take the basis to be\npj(x) = [x]j for j = 1 . . . d along with pd+1 = 1.\nIt follows immediately from Theorem 4.6 that, letting p1, p2, . . . pdim(P) span P, the solution to (14)\n\u03b2jpj(x). Now, if we consider only the margin\nviolators \u2014 those vectors which (at a given step of the optimisation process) satisfy yif(xi) < 1,\nwe can replace the max (0,\u00b7) in (14) with (\u00b7). This is equivalent to making a local second order\napproximation. Hence by repeatedly solving in this way while updating the set of margin violators,\nwe will have implemented a so-called Newton optimisation. Now, since\n\ni=1 \u03b1i\u03c6(xi, x) +Pdim(P)\n\nj=1\n\nnX\n\nkP\u03c6(P)fsvm k2\n\nF\u03c6(X ) =\n\n\u03b1i\u03b1j\u03c6(xi, xj),\n\n(15)\n\nthe local approximation of the problem is, in \u03b1 and \u03b2\n\ni,j=1\n\nminimise \u03bb\u03b1>\u03a6\u03b1 + k\u03a6\u03b1 + P \u03b2 \u2212 yk2 , subject to P >\u03b1 = 0,\n\n(16)\nwhere [\u03a6]i,j = \u03c6(xi, xj), [P ]j,k = pk(xj), and we assumed for simplicity that all vectors violate\nthe margin. The solution in this case is given by (Wahba, 1990)\n\n(cid:18)\u03b1\n(cid:19)\n\n\u03b2\n\n=\n\n(cid:18)\u03bbI + \u03a6 P >\n\n(cid:19)\u22121(cid:18)y\n\n(cid:19)\n\nP\n\n0\n\n0\n\n.\n\n(17)\n\nIn practice it is essential that one makes a change of variable for \u03b2 in order to avoid the numerical\nproblems which arise when P is rank de\ufb01cient or numerically close to it. In particular we make the\nQR factorisation (Golub & Van Loan, 1996) P > = QR, where Q>Q = I and R is square. We then\nsolve for \u03b1 and \u03b2 = R\u03b2. As a \ufb01nal step at the end of the optimisation process, we take the minimum\nnorm solution of the system \u03b2 = R\u03b2, \u03b2 = R#\u03b2 where R# is the pseudo inverse of R. Note that\nalthough (17) is standard for squared loss regression models with c.p.d. kernels, our use of it in\noptimising the s.v.m. is new. The precise algorithm is given in (Walder & Chapelle, 2007), where\nwe also detail two ef\ufb01cient factorisation techniques, speci\ufb01c to the new s.v.m. setting. Moreover, the\nmethod we present in Section 6.2 deviates considerably further from the existing literature.\n\n6\n\n\f6.2 Constraining \u03b2 = 0\nPreviously, if the data can be separated with only the P part of the function space \u2014 i.e. with \u03b1 = 0\n\u2014 then the algorithm will always do so regardless of \u03bb. This is correct in that, since P lies in the null\nspace of the regulariser kP\u03c6(P)\u00b7k2\nF\u03c6(X ), such solutions minimise (14), but may be undesirable for\nvarious reasons. Firstly, the regularisation cannot be controlled via \u03bb. Secondly, for the thin-plate,\nP = \u03c01(Rd) and the solutions are simple linear separating hyperplanes. Finally, there may exist\nin\ufb01nitely many solutions to (14). It is unclear how to deal with this problem \u2014 after all it implies\nthat the regulariser is simply inappropriate for the problem at hand. Nonetheless we still wish to\napply a (non-linear) algorithm with the previously discussed invariances of the thin-plate.\nTo achieve this, we minimise (14) as before, but over the space F\u03c6 (X ) rather than F\u03c6 (X ) \u2295 P. It\nis important to note that by doing so we can no longer invoke Theorem 4.6, the representer theorem\nfor the c.p.d. case. This is because the solvability argument of Lemma 4.3 no longer holds. Hence\nwe do not know the optimal basis for the function, which may involve in\ufb01nitely many \u03c6(\u00b7, x) terms.\nThe way we deal with this is simple \u2014 instead of minimising over F\u03c6 (X ) we consider only the\n\ufb01nite dimensional subspace given by\n\nnPn\n\no\nj=1 \u03b1j\u03c6(\u00b7, xj) : \u03b1 \u2208 P\u22a5(x1, . . . , xn)\n\n,\n\nwhere x1, . . . xn are those of the original problem (14). The required update equation can be ac-\nquired in a similar manner as before. The closed form solution to the constrained quadratic pro-\ngramme is in this case given by (see (Walder & Chapelle, 2007))\n\n(18)\nwhere \u03a6sx = [\u03a6]s,:, s is the current set of margin violators and P\u22a5 the null space of P satisfying\nP P\u22a5 = 0. The precise algorithm we use to optimise in this manner is given in the accompanying\ntechnical report (Walder & Chapelle, 2007), where we also detail ef\ufb01cient factorisation techniques.\n\nsxys\n\nsx\u03a6sx\n\n\u03b1 = \u2212P\u22a5(cid:0)P >\n\n\u22a5(cid:0)\u03bb\u03a6 + \u03a6>\n\n(cid:1) P\u22a5(cid:1)\u22121\n\nP >\n\u22a5 \u03a6>\n\n7 Experiments and Discussion\n\nWe now investigate the behaviour of the algorithms which we have just discussed, namely the thin-\nplate based s.v.m. with 1) the optimisation over F\u03c6 (X ) \u2295 P as per Section 6.1, and 2) the optimi-\nsation over a subspace of F\u03c6 (X ) as per Section 6.2. In particular, we use the second method if the\ndata is linearly separable, otherwise we use the \ufb01rst. For a baseline we take the Gaussian kernel\nk(x, y) = exp\n\n(cid:16)\u2212kx \u2212 yk2 /(2\u03c32)\n(cid:17)\n\n, and compare on real world classi\ufb01cation problems.\n\nBinary classi\ufb01cation (UCI data sets). Table 1 provides numerical evidence supporting our claim\nthat the thin-plate method is competitive with the Gaussian, in spite of it\u2019s having one less hyper\nparameter. The data sets are standard ones from the UCI machine learning repository. The experi-\nments are extensive \u2014 the experiments on binary problems alone includes all of the data sets used in\n(Mika et al., 2003) plus two additional ones (twonorm and splice). To compute each error measure,\nwe used \ufb01ve splits of the data and tested on each split after training on the remainder. For parameter\nselection, we performed \ufb01ve fold cross validation on the four-\ufb01fths of the data available for training\neach split, over an exhaustive search of the algorithm parameter(s) (\u03c3 and \u03bb for the Gaussian and\nhappily just \u03bb for the thin-plate). We then take the parameter(s) with lowest mean error and retrain\non the entire four \ufb01fths. We ensured that the chosen parameters were well within the searched range\nby visually inspecting the cross validation error as a function of the parameters. Happily, for the\nthin-plate we needed to cross validate to choose only the regularisation parameter \u03bb, whereas for the\nGaussian we had to choose both \u03bb and the scale parameter \u03c3. The discovery of an equally effec-\ntive algorithm which has only one parameter is important, since the Gaussian is probably the most\npopular and effective kernel used with the s.v.m. (Hsu et al., 2003).\nMulti class classi\ufb01cation (USPS data set). We also experimented with the 256 dimensional, ten\nclass USPS digit recognition problem. For each of the ten one vs. the rest models we used \ufb01ve fold\ncross validation on the 7291 training examples to \ufb01nd the parameters, retrained on the full training\nset, and labeled the 2007 test examples according to the binary classi\ufb01er with maximum output. The\nGaussian misclassi\ufb01ed 88 digits (4.38%), and the thin-plate 85 (4.25%). Hence the Gaussian did not\nperform signi\ufb01cantly better, in spite of the extra parameter.\n\n7\n\n\f2nsv + nb\n\nComputational complexity.\nThe normal computational complexity of the c.p.d. s.v.m. algo-\nrithm is the usual O(nsv\n3) \u2014 cubic in the number of margin violators. For the \u03b2 = 0 variant\n(necessary only on linearly separable problems \u2014 presently only the USPS set) however, the cost\nis O(nb\n3), where nb is the number of basis functions in the expansion. For our USPS\nexperiments we expanded on all m training points, but if nsv (cid:28) m this is inef\ufb01cient and proba-\nbly unnecessary. For example the \ufb01nal ten models (those with optimal parameters) of the USPS\nproblem had around 5% margin violators, and so training each Gaussian s.v.m. took only \u223c 40s in\ncomparison to \u223c 17 minutes (with the use of various ef\ufb01cient factorisation techniques as detailed\nin the accompanying (Walder & Chapelle, 2007) ) for the thin-plate. By expanding on only 1500\nrandomly chosen points however, the training time was reduced to \u223c 4 minutes while incurring only\n88 errors \u2014 the same as the Gaussian. Given that for the thin-plate cross validation needs to be\nperformed over one less parameter, even in this most unfavourable scenario of nsv (cid:28) m, the overall\ntimes of the algorithms are comparable. Moreover, during cross validation one typically encoun-\nters larger numbers of violators for some suboptimal parameter con\ufb01gurations, in which cases the\nGaussian and thin-plate training times are comparable.\n\n8 Conclusion\n\nWe have proven that there exist no non-trivial radial p.d. kernels which are dilation invariant (or\nmore accurately, dilation scaled), but rather only c.p.d. ones. Such kernels have the advantage that,\nto take the s.v.m. as an example, varying the absolute multiplicative scale (or length scale) of the data\nhas the same effect as changing the regularisation parameter \u2014 hence one needs model selection to\nchose only one of these, in contrast to the widely used Gaussian kernel for example.\nMotivated by this advantage we provide a new, ef\ufb01cient and stable algorithm for the s.v.m. with\narbitrary c.p.d. kernels. Importantly, our experiments show that the performance of the algorithm\nnonetheless matches that of the Gaussian on real world problems.\nThe c.p.d. case has received relatively little attention in machine learning. Our results indicate that\nit is time to redress the balance. Accordingly we provided a compact introduction to the topic,\nincluding some novel analysis which includes an new, elementary and self contained derivation of\none particularly important result for the machine learning community, the representer theorem.\n\nReferences\nBoughorbel, S., Tarel, J.-P., & Boujemaa, N. (2005). Conditionally positive de\ufb01nite kernels for svm based\n\nimage recognition. Proc. of IEEE ICME\u201905. Amsterdam.\n\nChapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19, 1155\u20131178.\nChapelle, O., & Sch\u00a8olkopf, B. (2001).\nIncorporating invariances in nonlinear support vector machines.\n\nIn\nT. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in neural information processing systems 14,\n609\u2013616. Cambridge, MA: MIT Press.\n\nFleuret, F., & Sahbi, H. (2003). Scale-invariance of support vector machines based on the triangular kernel.\n\nProc. of ICCV SCTV Workshop.\n\nGolub, G. H., & Van Loan, C. F. (1996). Matrix computations. Baltimore MD: The Johns Hopkins University\n\nPress. 2nd edition.\n\nHsu, C.-W., Chang, C.-C., & Lin, C.-J. (2003). A practical guide to support vector classi\ufb01cation (Technical\n\nReport). National Taiwan University.\n\nMika, S., R\u00a8atsch, G., Weston, J., Sch\u00a8olkopf, B., Smola, A., & M\u00a8uller, K.-R. (2003). Constructing descriptive\n\nand discriminative non-linear features: Rayleigh coef\ufb01cients in feature spaces. IEEE PAMI, 25, 623\u2013628.\n\nSch\u00a8olkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, opti-\n\nmization, and beyond. Cambridge: MIT Press.\n\nSmola, A., Sch\u00a8olkopf, B., & M\u00a8uller, K.-R. (1998). The connection between regularization operators and support\n\nvector kernels. Neural Networks, 11, 637\u2013649.\n\nWahba, G. (1990). Spline models for observational data. Philadelphia: Series in Applied Math., Vol. 59, SIAM.\nWalder, C., & Chapelle, O. (2007). Learning with transformation invariant kernels (Technical Report 165).\n\nMax Planck Institute for Biological Cybernetics, Department of Empirical Inference, T\u00a8ubingen, Germany.\n\nWendland, H. (2004). Scattered data approximation. Monographs on Applied and Computational Mathematics.\n\nCambridge University Press.\n\n8\n\n\f", "award": [], "sourceid": 219, "authors": [{"given_name": "Christian", "family_name": "Walder", "institution": null}, {"given_name": "Olivier", "family_name": "Chapelle", "institution": null}]}