{"title": "Hyperkernels", "book": "Advances in Neural Information Processing Systems", "page_first": 495, "page_last": 502, "abstract": null, "full_text": "Hyperkernels\n\nCheng Soon Ong, Alexander J. Smola, Robert C. Williamson\n\nResearch School of Information Sciences and Engineering\n\n@anu.edu.au\n\nThe Australian National University\n\nCanberra, 0200 ACT, Australia\n Cheng.Ong, Alex.Smola, Bob.Williamson\n\nAbstract\n\nWe consider the problem of choosing a kernel suitable for estimation\nusing a Gaussian Process estimator or a Support Vector Machine. A\nnovel solution is presented which involves de\ufb01ning a Reproducing Ker-\nnel Hilbert Space on the space of kernels itself. By utilizing an analog\nof the classical representer theorem, the problem of choosing a kernel\nfrom a parameterized family of kernels (e.g. of varying width) is reduced\nto a statistical estimation problem akin to the problem of minimizing a\nregularized risk functional. Various classical settings for model or kernel\nselection are special cases of our framework.\n\n1 Introduction\n\nChoosing suitable kernel functions for estimation using Gaussian Processes and Support\nVector Machines is an important step in the inference process. To date, there are few if\nany systematic techniques to assist in this choice. Even the restricted problem of choosing\nthe \u201cwidth\u201d of a parameterized family of kernels (e.g. Gaussian) has not had a simple and\nelegant solution.\nA recent development [1] which solves the above problem in a restricted sense involves\n,\nthe use of semide\ufb01nite programming to learn an arbitrary positive semide\ufb01nite matrix\nsubject to minimization of criteria such as the kernel target alignment [1], the maximum of\nthe posterior probability [2], the minimization of a learning-theoretical bound [3], or subject\nto cross-validation settings [4]. The restriction mentioned is that the methods work with the\nkernel matrix, rather than the kernel itself. Furthermore, whilst demonstrably improving the\nperformance of estimators to some degree, they require clever parameterization and design\nto make the method work in the particular situations. There are still no general principles to\nguide the choice of a) which family of kernels to choose, b) ef\ufb01cient parameterizations over\nthis space, and c) suitable penalty terms to combat over\ufb01tting. (The last point is particularly\nan issue when we have a very large set of semide\ufb01nite matrices at our disposal).\nWhilst not yet providing a complete solution to these problems, this paper presents a frame-\nwork that allows the optimization within a parameterized family relatively simply, and cru-\ncially, intrinsically captures the tradeoff between the size of the family of kernels and the\nsample size available. Furthermore, the solution presented is for optimizing kernels them-\nselves, rather than the kernel matrix as in [1]. Other approaches on learning the kernel\ninclude using boosting [5] and by bounding the Rademacher complexity [6].\n\n\u0001\n\u0002\n\fOutline of the Paper We show (Section 2) that for most kernel-based learning methods\nthere exists a functional, the quality functional1, which plays a similar role to the empiri-\ncal risk functional, and that subsequently (Section 3) the introduction of a kernel on ker-\nnels, a so-called hyperkernel, in conjunction with regularization on the Reproducing Ker-\nnel Hilbert Space formed on kernels leads to a systematic way of parameterizing function\nclasses whilst managing over\ufb01tting. We give several examples of hyperkernels (Section 4)\nand show (Section 5) how they can be used practically. Due to space constraints we only\nconsider Support Vector classi\ufb01cation.\n\nC\u0006K\n\nthe\n\n\u0019%\u00017\u0003\u0006\u0005\u0016\u0007\n\n\u00017\u0003\u0006\u0005\u0016\u0007\n\n\t83\n\nwhere\n\nis the kernel matrix.\n\n, de\ufb01ne\n\n\u000eBC\u0016\u0012#\u000e0D\u0013$\n\n\u0019<A\n\u0012#\u000e\nD;E\n\u0002I\fJ@\n:B \n\nthe kernel is for explaining the training data.\n\n, and data \nto be an empirical quality functional if it depends on : only via :B \n; i.e. if there exists a function F such that 9\n\u0019<AG\fHF0 \n\u0012#\u000e\n\n2 Quality Functionals\n\u000f\u001b\u001c\u0010\u0013\u0012\u0015\u0014\u000f\u0014\u0015\u0014\u001d\u0012\u0016\u001b\u001e\u0017\n\u000f\u000e\u0011\u0010\u0013\u0012\u0015\u0014\u000f\u0014\u0015\u0014\u0015\u0012\u0016\u000e\u0018\u0017\nLet \u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\ndenote the set of training data and \u0019\u0018\u0001\u0004\u0003\u0004\u0005\u0016\u0007\n\t\u0002\n\u001a\f\n\t\u000b\n\r\f\n\u000e\"\u0012#\u001b%$\nset of corresponding labels, jointly drawn iid from some probability distribution \u001f! \n\u0001\u0004+-,\u0006\u0001 and \u0019\n. Furthermore, let \non &('*)\n\u0001\u0004+-,.\u0001 denote the corresponding test sets (drawn from\n\u000e/\u0012\u0016\u001b0$ ). Let 1\n\r\f2\nthe same \u001f! \n\u00017+-,.\u0001 .\n\u0001\u0004+-,.\u0001 and \u00195\n\r\f6\u0019\n\t43\n\u0001\u0004\u0003\u0006\u0005\b\u0007\nWe introduce a new class of functionals 9 on data which we call quality functionals. Their\n$ , how suitable\npurpose is to indicate, given a kernel : and the training data \u0006;\u0001\u0004\u0003\u0006\u0005\b\u0007\nDe\ufb01nition 1 (Empirical Quality Functional) Given a kernel :\n9<+-=?>%@\nwhere \u000e\n+-=?>\nThe basic idea is that 9L+-=?> could be used to adapt :\nminimized, based on this single dataset \nnels :\nit is in general possible to \ufb01nd a kernel :PO\nvalues of 9\n+-=?>\n9<+-=?>%@\n\u0002\u0001\u0004+-,.\u0001\n:QO\nDe\ufb01nition 2 (Expected Quality Functional) Suppose 9\n:RA/\n\u001a\f2SUT\n9!@\nis the expected quality functional, where the expectation is taken with respect to \u001f\nNote the similarity between 9\n+-=?>\n\u000e\u0018C^$#$\n\u000e\u0018C\u0016\u0012#\u001b\u001eC\u0016\u0012\n+-=?>%@\ncases we compute the value of a functional which depends on some sample \nfrom \u001f! \n\u0019<A`A and X\n\u0019<A`A\nHere X\n\n9L+-=?>R@\n\u0019<A and the empirical risk of an estimator\n(where ]\n\u000e/\u0012#\u001b%$ and a function, and in both cases we have\nY\u0018AP\f6S_T\n9!@\nY\u0018A is known as the expected risk. We now present some examples of quality func-\n\nin a manner such that 9M+-=?>\n\nis\nof ker-\nthat attains arbitrarily small\n\nmethods of statistical learning theory, we aim to minimize the expected quality functional:\n\nA for any training set. However, it is very unlikely that\n:QO\n\u0019Q\u0001\u0004+-,.\u0001-A would be similarly small in general. Analogously to the standard\n\n. Given a suf\ufb01ciently rich class N\n\nis a suitable loss function): in both\ndrawn\n\nis an empirical quality func-\n(1)\n\ntionals, and derive their exact minimizers whenever possible.\n\n\u0001\u0004\u0003\u0004\u0005\u0016\u0007\n\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n\n:RAP\f2S_T\n\n9<+-=?>%@\n\n\u0017[Z\n\nCW\\\"\u0010\u0011]\n\n\u0019<AL\f\n\ntional. Then\n\n+-=?>%@\n\nYG \n\nExample 1 (Kernel Target Alignment) This quality functional was introduced in [7] to\nassess the \u201calignment\u201d of a kernel with training labels. It is de\ufb01ned by\n\n\u0005\ba\nb\b\t\u0015=?+-\tc\u0001\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n+-=?>\nwhere \u001b denotes the vector of elements of \u0019\nis the Frobenius norm: h\n\n\u001a\fnm\u0016o\n\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n\u0001\u0004\u0003\u0004\u0005\u0016\u0007\n\nA/\n\u001a\fJdfe\n\t , h\u001d\u001bjh\nC\u0004K\n\nsomewhat different, yet it is algebraically identical to (3).\n\nk denotes the l\nC\u001aD . Note that the de\ufb01nition in [7] looks\n\nk norm of \u001b\n\n, and h\n\n1We actually mean badness, since we are minimizing this functional.\n\n+-=?>\n\u0019<AWA\n\n\u001b%g\nhi\u001bjh\u001dk\n\n.\n\n(2)\n\n(3)\n\n\u0001\n\u0001\n\n\u0019\n\t\n\u0012\n\t\n\u0012\n\u0019\n:\n\u0012\n\n\u0012\nC\n\n@\n:\n\u0012\n\n\u0012\n\u0002\n\u0012\n\n\u0012\n\u0019\n$\n\u000e\nC\nD\n$\nA\nD\n\u0012\n\u0019\nE\nN\n@\n\u0012\n\n\t\n\u0012\n\u0019\n\t\n\u0012\n\u0012\nK\nV\n@\n:\n\u0012\n\n\u0012\n\u0017\n@\n:\n\u0012\n\n\u0012\nX\nY\n\u0012\n\n\u0012\n\u0010\n\u0017\n \n\u0012\n\u0019\nK\nV\n@\n:\n\u0012\n\n\u0012\n@\nK\nV\n@\nX\nY\n\u0012\n\n\u0012\n\u0014\n@\n9\n\u0007\n@\n:\n\u0012\n\n\t\n\u0012\n\u0019\n\t\n\u0002\n\u001b\nk\nh\n\u0002\nh\nk\n\u0012\n\u0002\nh\nk\n\u0002\nh\nk\nk\n\u0002\n\u0002\ng\n\f\nZ\nD\n\u0002\nk\n\f\u001bR\u001b\nh\u001d\u001b\u001c\u001b\n\nYG \n\n\u0012#\u001b\n\nC`\\\"\u0010\n\n\u000e\u0018C\b\u0012\n\n. First, note that for\n\n(5)\n\n(6)\n\n\u0012\u0016\u001b\n\n\u001bR\u001b\n\nCW\\\"\u0010\n\n\u0003\u0006+-b\n\n\u0001\u0004\u0003\u0004\u0005\u0016\u0007\n\n\u0003\u0006+-b\u0016\u0003\u0004\u0007\n+-=?>\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n\n\u0012\u0016\u001b\n\n,\n\n\u001b and\n\n, we can\n\nmake 9\n\n\u0002\u001d\u001c\n\n\u0003\u0006+-b\b\u0003\u0006\u0007\n+-=?>\n\nA/\n\r\f\n\nCW\\\"\u0010\n\nthis leads to the quality functional\n\nis the Reproducing Kernel Hilbert Space\n\n\fHd4e\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n\nof (5) can be written as a kernel expansion.\n\n, the regularized risk functionals have the form\n\nEven if we disallow setting\nof (6) as follows. Set\n\n. By virtue of the representer theorem (see e.g., [4, 8])\n\n,\r\f\n\u0001\u0004\u0003\u0006\u0005\b\u0007\nand\u001c\n\u0010 . Thus 9\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n\n\u0001\u0004\u0003\u0004\u0005\u0016\u0007\n\n\u0001\u0004\u0003\u0006\u0005\b\u0007\n\nA<\f\nto zero, by setting m\bo\n\nh\u001d\u001bjh\u001dk\n\t\u0013A\u0011\fHdfe\n\u0005\ba\nb\u0016\t\u000f=?+-\tc\u0001\n:QO\n+-=?>\n\nExample 2 (Regularized Risk Functional) If\u0003\n\n\u000b\u00017\u0003\u0006\u0005\u0016\u0007\n\u0019%\u0001\u0004\u0003\u0004\u0005\u0016\u0007\nIt is clear that one cannot expect that 9\nthe set chosen to determine :PO .\n(RKHS) associated with the kernel :\n\u0001\u0004\u0003\u0006\u0005\b\u0007\nwhere h\nis the RKHS norm of Y\nwe know that the minimizer over Y\nFor a given loss ]\n\u0003\u0006+-b\b\u0003\u0006\u0007\n\u00017\u0003\u0006\u0005\u0016\u0007\n+-=?>\n\nd , we can determine the minimum\n\u0012\u0016\u001b\n\nThe minimizer of (6) is more dif\ufb01cult to \ufb01nd, since we have to carry out a double mini-\nmization over\n\n$#$\u0007\u0006\t\b\n\nfh\n$\u001e\u0006\u001f\b\n\n\u001a\f\u000f\u000e\u0011\u0010\u0013\u0012\n\u0002\u001d\u001c/A\n\u0002 \u001c\u001a!\n\u0014\u0016\u0015\u0018\u0017\u001a\u0019\ng and\u001c\n$&%(')%(*\n\f#\"\n\u0002 \u001c6\f\n,.\f\n. For suf\ufb01ciently large\"\nA arbitrarily close to\u0002 .\nE65\n, where2\n*3242\n2 . Then\n2 and so\n%10\u0018%\n, and\u001c\n\u0002\u001d\u001c\n$7\u0006\u001f\b\n$\u0007\u0006\t\b\n\n8h\n\u0002 \u001cjA\n\u0002\u001d\u001c\nk yields the minimum with respect to2 . The\n\u0012?>\u001e$7\u0006\n\f98\u001eo;:4\u000e\u0011\u0010<\u0012\u0016=\nC;G\n POMe\n\n\f+\"-,\n,\r\f\nChoosing each2\n$\u001e\u0006\n@\u0016b\b>\u0018@\u0016,.\u0001\n\t\u0013A/\n\r\fA\u000eB\u0010\u0013\u0012\n:\u001eF\" \n\u00107D\u0013E\n\u0015\u0018\u0017\nwhich does not have full rank will send (7) to eIH\nneed to be excluded. When we \ufb01xG\n\u0019NM\n,LK\n\fJ\"\nwhich leads toG\nC , we can see that\"RQSH\n@\bb\u0016>\u0018@\b,.\u0001\n\nOther examples, such as cross-validation, leave-one-out estimators, the Luckiness frame-\nwork, the Radius-Margin bound also have empirical quality functionals which can be arbi-\ntrarily minimized.\nThe above examples illustrate how many existing methods for assessing the quality of a\nkernel \ufb01t within the quality functional framework. We also saw that given a rich enough\n\nC`\\\n\fJd , to exclude the above case, we can set\n\nloss term (the negative log-likelihood). In addition, it also includes the log-determinant of\n\n\t\u0013A since it includes a regularization term (the negative log prior) and a\n\nd . Under the assumption that the minimum of e\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\nExample 3 (Negative Log-Posterior) In Gaussian processes, this functional is similar to\n\nwhich measures the size of the space spanned by\n\n. The quality functional is\n\nproof that\n\nis the global minimizer of this quality functional is omitted for brevity.\n\nh\u001d\u001bBh\nis attained at Y\n\nwith respect to Y\nminimum of 9\n+-=?>\n\n\u0003\u0006+-b\u001c@\n\n\u0002\u0001\u0004\u0003\u0004\u0005\u0016\u0007\n\n\u0019Q\u0001\u0004\u0003\u0006\u0005\b\u0007\n\nD\u0013E\n:\u001aF\" \n\n+-=?>\n\n\u0002\u0001\u0004\u0003\u0004\u0005\u0016\u0007\n\n\u0019Q\u0001\u0004\u0003\u0006\u0005\b\u0007\n\n\u001bR\u001b\n\n\u0019Q\u0001\u0004\u0003\u0006\u0005\b\u0007\n\nA .\n\n, and thus such cases\n\nhi\u001bjh\n\n\u001bR\u001b\n\nstill leads to the overall\n\n\u0012\u0016\u000e\n\nCW\\\"\u0010\n\n\u000e\u0018C\b\u0012\u0016\u001b\u001eC\n\n! (7)\n\n(8)\n\nNote that any\n\nD<E\n\ninto its eigensystem, one can see that (3) is minimized if\n\nBy decomposing\nwhich case\n\n\u0005\u0016a\nb\b\t\u0015=?+-\tc\u0001\n+-=?>\n\n\u001b\u001c\u001b\n\n, in\n\n(4)\n\nfor data other than\n\nhi\u001bjh\nhi\u001bjh\u001dk\nh\u001d\u001bjh\u001dk\nA\u0011\f\n\nclass of kernels N\n\nuseless for prediction purposes. This is yet another example of the danger of optimizing\ntoo much \u2014 there is (still) no free lunch.\n\n+-=?> over N would result in a kernel that would be\n\n, optimization of 9\n\n\u0002\n\u0002\n\f\ng\n9\n\u0007\n@\n:\nO\n\u0012\n\t\n\u0012\n\u001b\ng\ng\n\u001b\nk\ng\nh\nk\n\u0001\nk\nk\nk\n\f\n\u0002\n\u0014\n\u0007\n@\n\u0012\n\n\t\n\u0012\n\u0019\n\t\n\u0002\nX\n@\nY\n\u0012\n\n\t\n\u0012\n\u0019\n\t\nd\n\u0004\n\u0017\n\u0005\n]\n \n\u000e\nC\nC\n\u0012\n\u000e\nC\nY\nh\nk\n\u000b\n\u0012\nY\nh\nk\n\u000b\nE\n\u0003\n9\n@\n:\n\u0012\n\n\t\n\u0012\n\u0019\n\t\nA\n\u001b\nd\n\u0004\n\u0017\n\u0005\n]\n \n\u000e\nC\nC\n\u0012\n@\nC\n\n\u001c\ng\n\u0014\n\u0002\n\u0002\n\f\n\u0010\n\u001b\n\u001c\ng\n@\n:\n\u0012\n\n\t\n\u0012\n\u0019\n\t\n/\nk\n$\n@\n:\n\u0012\n\n\t\n\u0012\n\u0019\n\t\n\u0002\n\u0002\n\f\n\u0002\n\f\n\u0010\ng\n\u0017\n\f\n\f\nd\n\u0004\n\u0017\n\u0005\n]\n \n\u000e\nC\nC\n\u0012\n@\nC\n\n\u001c\ng\n\f\n\u0017\n\u0005\n]\n \n\u000e\nC\nC\n\u0012\n2\nC\n2\nh\nk\nk\n\u0014\nC\n]\n \n/\nk\n>\n\u0002\nX\nY\n\u0012\n\t\n\u0012\n\u0002\n\u0002\n9\na\n@\n:\n\u0012\n\t\n\u0012\nC\n\u0019\n\u001b\ne\n\u0017\n\u0005\n\u001b\nY\nC\nd\n\nY\ng\n\u0002\n,\n\u0010\nY\n\u0006\nd\n\n:\nG\n\u0002\nG\n\u0002\n\u0002\nG\n\u0002\n,\nk\ng\n\u0006\n\"\nK\n,\nk\ng\n$\n\u0002\nG\n\f\n\u001b\nC\nC\n\u0012\nY\nC\n$\nC\nC\n\f\n\u001b\na\n@\n:\n\u0012\n\t\n\u0012\n\t\n\f, as can be seen in the theorem below.\n\nof the regularized risk\n\na strictly monotonic\nan arbitrary loss\n\n(9)\n\n$\u001d\u0012\n\n:B \n\n \u0016 \n\n; in particular,\n\n:\n\n.\n\n1.\n\n2.\n\n\u001a\f\n\n1.\n\n2.\n\n\u0002@\n\nYG \n\n\u000e\u0018\u0017<$#$\b$\u001e\u0006\n\n&#Q\n\n\u000eP$\nYG \nwhere\n\n(see Def 2.9 and Thm 4.2 in [8] and citations for more details).\n\na Hilbert space of functions Y\n\n3 A Hyper Reproducing Kernel Hilbert Space\nWe now introduce a method for optimizing quality functionals in an effective way. The\nmethod we propose involves the introduction of a Reproducing Kernel Hilbert Space on\nitself \u2014 a \u201cHyper\u201d-RKHS. We begin with the basic properties of an RKHS\n\nbe a nonempty set (often called\nis\n(and the\n\nfor all Y\nis the completion of \n\nThe advantage of optimization in an RKHS is that under certain conditions the optimal\nsolutions can be found as the linear combination of a \ufb01nite number of basis functions,\n\nthe kernel :\nDe\ufb01nition 3 (Reproducing Kernel Hilbert Space) Let &\n. Then\u0003\nthe index set) and denote by\u0003\ncalled a reproducing kernel Hilbert space endowed with the dot product \u0001\u0003\u0002`\u0012\u0004\u0002\nsatisfying, \u000e/\u0012\u0016\u000e\t\b\nnorm h\n\u0005 ) if there exists a function :\n\u000e\"\u0012\n\u0002\u001a$\u0003\u0005\n:B \n\u000e\u000b\b^\u0012\n\u0002\u001a$\u0003\u0005\n, i.e.\u0003\n\n\f\u0007\u0006\n: has the reproducing property \u0001\n\u000e/\u0012#\u000e\f\b.$ .\n\u000e/\u0012\u0004\u0002\n:B \n\u000e/\u0012\n\u0002\u001a$\n: spans\u0003\n:B \n8)\u0012\nregardless of the dimensionality of the space\u0003\nTheorem 4 (Representer Theorem) Denote by \u0010\na set, and by ]\nincreasing function, by &\n\n4 \nfunction. Then each minimizer Y\n\u000e\u0018\u0017\u000b\u0012#\u001b\u001e\u0017\u000b\u0012\n\u000eB\u0010\n\u0012#\u001bR\u0010\n\u000e\u0011\u0010\u000f$#$\u0011\u0012\u000f\u0014\u0015\u0014\u000f\u0014\u0015\u0012\nYG \n\u000e\u0018C\u0016\u0012#\u000eP$ .\n\u000e\u0011$\nC`\\\"\u0010\nadmits a representation of the form YG \nThe above de\ufb01nition allows us to de\ufb01ne an RKHS on kernels &\nintroducing &\n\n\f6&\nDe\ufb01nition 5 (Hyper Reproducing Kernel Hilbert Space) Let &\n(the compounded index set). Then the Hilbert space\u0003\nlet &\n(and the norm h\n, endowed with a dot product \u0001\u0011\u0002W\u0012\n\u0002\u0012\u0005\n\f\u0013\u0006\na Hyper Reproducing Kernel Hilbert Space if there exists a hyperkernel :\n\u0012\n\u0002\u001a$\u0003\u0005\n: has the reproducing property \u0001\n\u0012\u0016\u000e\n$ .\n\u0012\n\u0002\u001a$\u0003\u0005\n$\u001d\u0012\n\u0012\u0004\u0002\n\u0012\n\u0002\u001a$\n, i.e.\u0003\n: spans\u0003\n3. For any \ufb01xed \u000e\n\r\u000f\u000e\nthe hyperkernel :\nany \ufb01xed \u000e\n\u000e/\u0012#\u000e\n, the function :B \nfrom a normal RKHS is the particular form of its index set (&\nWhat distinguishes\u0003\n\nto be a kernel in its second argument for any \ufb01xed \ufb01rst\nargument. This condition somewhat limits the choice of possible kernels. On the other\n, which\n. Analogously to the de\ufb01nition of the regularized risk functional\n\nand the additional condition on :\nhand, it allows for simple optimization algorithms which consider kernels :\nare in the convex cone of :\n\u0003\u0004+-b\n\n:B \nand by treating : as functions :\n\n\u0019<A/\n\r\f69\n\n+-=?>\nis a regularization constant and h\nis less prone to over\ufb01tting than minimizing 9M+-=?> , since the regularization\nk effectively controls the complexity of the class of kernels under consideration.\nk are also possible. The question arising immediately from\n\n\b\f\u0014\nk denotes the RKHS norm in\u0003\n\nis a kernel in its second argument, i.e. for\nis a kernel.\n\n(10) is how to minimize the regularized quality functional ef\ufb01ciently. In the following we\nshow that the minimum can be found as a linear combination of hyperkernels.\n\n(5), we de\ufb01ne the regularized quality functional:\n\n:B \n\u000e/\u0012\u0016\u000e\n\nfor all :\n$#$ with \u000e\"\u0012#\u000e\n\nbe a nonempty set and\nof functions\n\n\u0005 ) is called\n\n\u0018&\n\n, in particular,\n\n, simply by\n\n:\n\nk )\n\n(10)\n\n. Mini-\n\n.\n\n\u001a\f\n\n\u0019LA\n\n:B \n\r\u000f\u000e\n\n8)\u0012\n\nwith the following properties:\n\nmization of 9<\u0003\u0004+-b\n\nwhere\b\f\u0014\u0016\u0015\nterm/\u0018\u0017\nRegularizers other than/\n\n&\nQ\n5\n\u0005\nY\nh\n\u0001\nY\n\u0012\nY\n\n&\n'\n5\nE\n&\nY\n\u0012\n\f\nE\n\u0003\n\u0001\n\f\n\f\n\nG\n\u000e\nE\n&\n\u0001\n\u0002\n\u0012\nH\n$\nQ\n5\n&\n'\n5\nk\n$\n\u0017\nQ\n5\n3\n\nH\n\u0001\nE\n\u0003\n]\n\u0012\n \n\u0010\n \nh\nY\nh\n\u000b\n$\n\f\nZ\n\u0017\n\u001c\nC\n'\n&\nQ\n5\n'\n&\n\n&\nQ\n5\n&\n'\n&\n:\n\n&\nQ\n5\n:\nh\n\u0001\n:\n\u0012\n:\n'\n&\nQ\n5\n:\n\u0012\n:\n \n\u000e\n\f\n\u000e\n$\nE\n\u0003\n\u0001\n:\n \n\u000e\n:\n \n\u000e\n\b\n\f\n:\n \n\u000e\n\b\n\f\n\n:\n \n\u000e\nG\n\u000e\nE\n&\n\u0001\nE\n&\nE\n&\n\b\n$\n:\n \n\u000e\n\u0012\n \n\b\n\b\nE\n&\n\f\n&\nE\n\u0003\n9\n@\n:\n\u0012\n\n\u0012\n@\n:\n\u0012\n\n\u0012\n\u0006\n\nh\n:\nh\nk\n\u0012\n\u0002\n:\nh\nk\nh\n:\nh\n\u0017\nk\nh\n:\nh\n\f(11)\n\n$\u001d\u0012\n\n\u000e/\u0012#\u000e\n\n$#$ .\n\na strictly monotonic increasing function, by &\n\nof the regularized quality\n\na set, and by 9\n\nbe a hyper-RKHS and de-\n\nProof All we need to do is rewrite (11) so that it satis\ufb01es the conditions of Theorem 4. Let\n\n\u0019_A has the properties of a loss function, as it only depends\n\nis an RKHS regularizer, so the representer\n\n\u0019<A\n\n\u000e/\u0012#\u000e\n\nfunctional\n\n\u0006\u001f\b\f\u0014\n\nCorollary 6 (Representer Theorem for Hyper-RKHS) Let\u0003\nnote by \u0010\nan arbitrary quality functional. Then each minimizer :\n\n9!@\nadmits a representation of the form :B \n$ . Then 9;@\nC\rD\non : via its values at \u000e\ntheorem applies and the expansion of :\namong a \ufb01nite dimensional subspace. The dimension required (\u0004\n\nC\rD . Furthermore,/\n\n\u000ePC\b\u0012#\u000e0D\n\nfollows.\n\nD\u0016\\\"\u0010\n\n # \n\nC\u0004K\n\n\u001a\f\n\nC\u001aD\n\n\u000ePC\b\u0012#\u000e0D\n\n4 Examples of Hyperkernels\n\nThis result shows that even though we are optimizing over an entire (potentially in\ufb01nite\ndimensional) Hilbert space of kernels, we are able to \ufb01nd the optimal solution by choosing\n\nni\ufb01cantly larger than the number of kernels required in a kernel function expansion which\nmakes a direct approach possible only for small problems. However, sparse expansion\ntechniques, such as [9, 8], can be used to make the problem tractable in practice.\n\nk ) is, not surprisingly, sig-\n\nHaving introduced the theoretical basis of the Hyper-RKHS, we need to answer the ques-\n\naddress this question by giving a set of general recipes for building such kernels.\n\ntion whether practically useful : exist which satisfy the conditions of De\ufb01nition 5. We\nC and\nExample 4 (Power Series Construction) Denote by : a positive semide\ufb01nite kernel, and\nby \nconvergence radius X\n\n5 a function with positive Taylor expansion coef\ufb01cients B \u0002\u0001\n\n. Then for :\n\nC`\\\u0006\u0005/]\nZ\u0004\u0003\n\n(12)\n\n\u0012#\u000e\n\n\u000e/\u0012\u0016\u000e\u000b\b\u0006$\b\u0007\n:B \n\n\f\tB 7:j \n\u000e\"\u0012#\u000e\n, :\nis a kernel if :\n\n\fJ \f\u000e\n\nX we have that\n$#$\n:B \n\n:B \n\nCW\\\n\u0005\nis). To show that :\n\n$#$\n\n$#$\n\n$i\u0012\n\nis a sum of kernel functions, hence it is\nis a kernel, note that\n\n$i\u0012\u000f\u0014\u0015\u0014\u0015\u0014\u001a$ .\n\nExample 5 (Harmonic Hyperkernel) A special case of (12) is the harmonic hyperkernel:\n\n\u0001\f\u000b\n\n\u0012\u0016\u000e\n\nis a hyperkernel: for any \ufb01xed \u000e\n\u000e/\u0012\u0016\u000e\u000b\b.$\na kernel itself (since :\n$\u0003\u0005 , where \u000b\n$i\u0012\r\u000b\n&(';&#Q\nfor some\u0002\u0012\u0011\n\b\u0010\u000f\n\nDenote by : a kernel with :\nset ]\n\n\fH \n\n\b\u0010\u000f\n\ndfe\n\n\u0012#\u000e\n\nC`\\\u0006\u0005\n\nd\u001dA (e.g., RBF kernels satisfy this property), and\nd . Then we have\n:B \n\ndfe\n\n:B \n\n(13)\n\n$\u0016$\n\ndfe\n\u000e\"\u0012#\u000e\n\n:j \n\u000e/ #e\u0019\u0018\n\n\b\u0013\u000f\n\n:B \n\f\u0015\u0014\u0017\u0016\nhi\u000e\n\nh\u001d\u000e\n\n\u000e\f\b-h\n\n$ ,\n\n(14)\n\n$#$\n\nk converges to the Frobenius\n\nExample 6 (Gaussian Harmonic Hyperkernel) For :B \n\b\u0013\u000f\ndfe\nh\u001d\u000e\ndfe\n\u001d ; that is, the expression h\n\n\u000e/\u0012#\u000e\n$\u0016$\n # \nd , : converges to \u001a\u001c\u001b\n\n\u000eL \u0016e\u0019\u0018\n\n$i\u0012\n\n\u0012#\u000e\n\n\u0014\u0017\u0016\n\n.\n\nnorm of : on 1'\n\nFor\b\n\n@\n\u0002\n\u0012\nH\n$\nQ\n5\nE\n\u0003\n:\n\u0012\n\n\u0012\n\nh\n:\nh\nk\n\b\n$\n\f\n\u0017\n\u0005\n\"\n:\n \n\b\n\u000e\n \n:\n\u0012\n\n\u0012\n\u0017\nk\nh\n:\nh\nk\n\n5\nQ\n$\n\f\nC\n\u0001\nk\n \n:\n \n\u000e\n\b\n$\n\u000e\n$\n\u000e\n\b\n\f\n\u0003\n\u0005\n]\nC\n \n\u000e\n$\n\u000e\n\b\nC\n \n\u000e\n\u0012\n \n\b\nC\n \n:\n \n\u000e\n\b\n$\n\f\n \n\u000e\n \n\u000e\n\b\n \n\u000e\n$\n]\n\u0005\n\u0012\n\u000e\n]\n\u0010\n:\n\u0010\n \n\u000e\n\u000e\n]\nk\n:\nk\n \n\u000e\n\n@\n\u0002\n\u0012\nC\nd\ne\n$\n\b\nC\n\u000f\n\u0011\n:\n \n\u000e\n\b\n$\n\f\n \n\b\n\u000f\n$\n\u0003\n\u0005\n \n\b\n\u000f\n\u000e\n$\n\u000e\n\b\nC\n\f\n\b\n\u000f\n\u000e\n$\n\u000e\n\b\n$\n\u0014\n\b\n$\nk\ne\nk\n:\n\b\n \n\u000e\n\b\n\b\n\b\n\b\n\b\n\f\n\b\n\u000f\nk\n \ne\n\u000e\n\b\nh\nk\n\u0006\n\b\n\b\ne\n\u000e\n\b\n\b\n\b\nh\nk\n\u0014\n\u000f\nQ\nK\n\u001b\n:\nh\n\n\fPower series expansion\n\n\u0004\u0006\u0005\n\n\u0004\u0015\u0005\n\n\u00062\u0014\u0015\u0014\u0015\u0014\u0018\u0006\n\u00066\u0014\u000f\u0014\u0015\u0014\n\u00062\u0014\u0015\u0014\u000f\u0014\u0018\u0006\n\u0006[\u0014\u000f\u0014\u0015\u0014\n\u0006[\u0014\u000f\u0014\u0015\u0014\n\n\u0004\t\b\n\n\u0004\t\b\n\n\u0002\u0011\u0010\n\n\u0010\u0013\u0012\n\n\u00062\u0014\u0015\u0014\u0015\u0014\n*\u000b\n\r\f\n\u00062\u0014\u0015\u0014\u000f\u0014\n*\u000b\n\n\u0006[\u0014\u000f\u0014\u0015\u0014\n*\u0016\n\u0017\f\n\u00066\u0014\u0015\u0014\u000f\u0014\n\u00062\u0014\u0015\u0014\u000f\u0014\n\n\u0002\u0011\u0010\n\nTable 1: Examples of Hyperkernels\n\nWe can \ufb01nd further hyperkernels,\nsimply by consulting tables on\npower series of functions. Ta-\nble 1 contains a list of suitable\nexpansions. Recall that expan-\nsions such as (12) were mainly\nchosen for computational conve-\nnience, in particular whenever it\nis not clear which particular class\nof kernels would be useful for the\nexpansion.\n\n1\n\n1\n\nB \u0002\u0001\n\n;\u0010<\u0012\n8\u001eo\n\ndfe\n\n8)\u0012\n\u0012/ \n\nExample 7 (Explicit Construction) If we know or have a reasonable guess as to which\nkernels could be potentially relevant (e.g., a range of scales of kernel width, polynomial\n\ndegrees, etc.), we may begin with a set of candidate kernels, say :\n\n\u0012#\u000e\n\n\f\n\nC`\\\"\u0010\n\n$\u001d\u0012\n\n\u0012\u0016\u000e\n\n\bW$\n\n\u0002 and de\ufb01ne\n\n\u0010 , . . . , :\n\u0012\u001b\u001a\u0011\u000e\n\b.$\u0003\u0005 , where \u000b\n\n$\u0019\u0018\n$\u001d\u0012\r\u000b\n\n(15)\n\n$i\u0012\n\n$\u001d\u0012\u0015\u0014\u000f\u0014\u0015\u0014\u001d\u0012\n\nis a hyperkernel, since :\n$\u0016$ .\n\nClearly :\n5 An Application: Minimization of the Regularized Risk\nRecall that in the case of the Regularized Risk functional, the regularized quality optimiza-\ntion problem takes on the form\n\n\f\n\n(16)\n\n\u000ePC\u0016\u0012\u0016\u001b\u001eC\n\u0012#\u000e\u0011$ , the second term h\n\nCW\\\"\u0010\n\n\u0010\u0013\u000eB\u0010\u001d\u001c\n\n\u000e\u0011\u0010\u0013\u0012\n\n\u000ePC\n\nYG \n\n$\u0016$\u001e\u0006\n\n:B \n\nregularized quality functional is:\n\n. Given a convex\n. The corresponding\n\nFor Y\nis a linear function of :\nloss function ] , the regularized quality functional (16) is convex in :\n\u0006\t\b\f\u0014\nsubsequently expressed in terms of the Lagrange multipliers\u001c\n\n, the problem can be formulated as a constrained minimization problem in Y\n\n, and\n. However, this minimum\n, and for ef\ufb01cient minimization we would like to compute the derivatives with\n. The following lemma tells us how (it is an extension of a result in [3] and we\n\nomit the proof for brevity):\n\n\u0003\u0006+-b\b\u0003\u0006\u0007\n+-=?>\n\n\u0003\u0006+-b\b\u0003\u0006\u0007\n\u0003\u0006+-b\n\n\u0019<A\u0011\f\n\n\u0019<A\n\n,\r\f\n\n,\r\f\n\n(17)\n\nits minimizer):\n\nminimize\n\n\u000e/\u0012\u0006\u001f\n\n$i\u0012\n\nand denote by YG \n\nis\n\n$ be the minimum of the following optimization problem (and\n\u000e\"\u0012\u0013\u001f\n$ , where (\n\nconvex functions, where Y\n\u0007! \n\nk denotes the derivative with respect to\n\n\u000e\u0011$\nE*) and &\n\n$ subject to ]\n\nfor all d\n\n\u0007#\"?\u0014\n\n(18)\n\nYG \n$\u001d\u0012\u0013\u001f\n\n.\n\nFor \ufb01xed:\ndepends on :\nrespect to :\nLemma 7 Let \u000e\nparameterized by \u001f . Let X\ndenote by \u000e\n\u001f\u001c$\n\n\u001f\u001c$\n\nThen $\nYG \nthe second argument of Y\n\n\f'&\n\n\u0015\u0018\u0017\n\nSince the minimizer of (17) can be written as a kernel expansion (by the representer theo-\nrem for Hyper-RKHS), the optimal regularized quality functional can be written as (using\n\n$\nX\n\u0014\n\u0016\n\u000e\n\u0001\nd\n\u0006\n\u0010\n\u0010\n\u0001\n\u0001\n\u0010\n\u0002\n\u0001\n\u0001\n\u0002\nH\n\u0003\n\u0001\n\u0004\n\u0010\n\u0001\n\u0006\n\u0007\n\u0001\n\u0006\nK\n\u000e\n\u000f\nk\n\u0001\nH\n\u0014\nE\n\n\u0003\n\u0001\nd\n\u0006\n\u0004\n*\nk\n\u0001\n\u000e\n\u000f\nk\n\u0002\n\u0012\n\u0001\nH\n\u0014\nm\n\u0003\n\u0001\n\u0004\n\u0010\n\u0006\n\u0007\n\u0006\n\u0004\nK\nk\n\u0010\ne\nD\n\u0001\n$\n\u0004\n\u0010\n\u0006\n\u0004\n*\nk\n\u0006\n\u0004\n\n\u0002\n:\n \n\u000e\n\b\n$\n\u0002\n\u0005\n]\nC\n:\nC\n \n\u000e\n$\n:\nC\n \n\u000e\n\b\n:\nC\n \n\u000e\n\u0002\n\u0014\n \n\u000e\n\f\n\u0001\n\u000b\n \n\u000e\n \n\u000e\n \n\u000e\n$\n \n\u000e\n]\n\u0010\n:\n\u0010\n \n\u000e\n\u000e\n]\nk\n:\nk\n \n\u000e\n\u000e\n]\n\u0002\n:\n\u0002\n \n\u000e\n\u0014\nC\n\u0015\n\u000b\nK\n\u001e\n\u0015\n\u000b\nd\n\u0004\n\u0017\n\u0005\n]\n \n\u0012\n\b\n\nh\nY\nh\nk\n\u000b\n\u0006\n\b\n\u0014\n\nh\n:\nh\nk\n\u000b\n\u0014\n\f\nZ\nC\n\u001c\nC\n\u000e\nC\nY\nh\nk\n\u000b\n9\n@\n:\n\u0012\n\n\u0012\n9\n@\n:\n\u0012\n\n\u0012\n\nh\n:\nh\nk\n\u000b\nE\n5\n\u0017\n]\nC\n\n5\n\u0017\nQ\n5\n \n\u001f\n \n\u001b\n\u0019\nC\n \n\u0007\n\u0002\nD\n%\nX\n \nD\nk\n\u000e\n \n\u001f\n\fthe soft margin loss and\n\nC\u001aD\u0001\n\n\u0019MA\n\n\u000e\u0018C\u0016\u0012#\u000e0D\u0013$i\u0012\n\n # \n\n\f\n\n$\u0016$ :\n\n\u000e\u0003%\u0012#\u000e\ndfe\n\n\u0003\u0004+-b\u0016\u0003\u0006\u0007\n\u0003\u0004+-b\n\n,\r\f\n\nC\rD\u0001\n\\\"\u0010\n\nC\u0004K\n\nD\u0016K\n\n\u0013K\n\n(19)\n\n(with\n\nsuf\ufb01ciently well.\n\n\u000eBC\b\u0012#\u000e0D\n\n \u0016 \n\n\u000e\u00118\n\n\\\"\u0010\n\nCW\\\"\u0010\n\u0013K\nD\u0016K\n\nC\u0004K\n\n\\\"\u0010\n\n\u0006\u001f\b\n\n.\n\nFor an explicit expansion of type (15) we can optimize in the expansion coef\ufb01cients of\n\nLow Rank Approximation While being \ufb01nite in the number of parameters (despite the\n), (19) still\n\n\u0002\u0006\u0005\nD\bK\nC\rD\nC\u001aD\u0001\nC\u001aD\u0001\nfor \ufb01xed\"\nMinimization of (19) is achieved by alternating between minimization over\u001c\n(this is a quadratic optimization problem), and subsequently minimization over\"\nC\rD\b\u0007\nto ensure positivity of the kernel matrix) for \ufb01xed\u001c\nand\u0003\noptimization over two possibly in\ufb01nite dimensional Hilbert spaces\u0003\nk coef\ufb01cients for\" ).\npresents a formidable optimization problem in practice (we have\u0004\n$ directly, which means that we simply have a quality functional with an l\nterms in (15). In the general case (or if \"\n\t\n$\u001d\u0012\n\u0002\u001a$ with d\ndescribed in [9, 8]. This means that we pick from :\nfraction of terms which approximate : on \n6 Experimental Results and Summary\nExperimental Setup To test our claims of kernel adaptation via regularized quality func-\ntionals we performed preliminary tests on datasets from the UCI repository (Pima, Iono-\nsphere, Wisconsin diagnostic breast cancer) and the USPS database of handwritten digits\ntest data, except for\nthe USPS data, where the provided split was used. The experiments were repeated over\n200 random 60/40 splits. We deliberately did not attempt to tune parameters and instead\nmade the following choices uniformly for all four sets:\n\npenalty on the expansion coef\ufb01cients. Such an approach is recommended if there are few\n), we resort to a low-rank approximation, as\na small\n\n(\u20196\u2019 vs. \u20199\u2019). The datasets were split into \u000b\ntraining data and\u000e\n\u0002\u0012\u0011 , where\u0011\n\u0010 The kernel width \u0018 was set to \u0018\n\f5d\n\u00024\u0002\n\b was adjusted so that\n(that is\u0013\nfor the Gaussian Harmonic Hyperkernel was chosen to be\u0002\ning adequate coverage over various kernel widths in (13) (small\b\nexclusively on wide kernels,\b\n\u000f close to d will treat all widths equally).\n\u0010 The hyperkernel regularization was set to\b\t\u0014\nthe same values chosen for \u0018 and\b and one for which\b\n\nis the dimensionality of the\ndata. We deliberately chose a too large value in comparison with the usual rules\nof thumb [8] to avoid good default kernels.\n\nWe compared the results with the performance of a generic Support Vector Machine with\n\nResults Despite the fact that we did not try to tune the parameters we were able to achieve\nhighly competitive results as shown in Table 2. It is also worth noticing that the number of\nhyperkernels required after a low-rank decomposition of the hyperkernel matrix contained\ntypically less than 10 hyperkernels, thus rendering the optimization problem not much\n\nmore costly than a standard Support Vector Machine (even with a very high quality d\n\napproximation of\nused. This dramatically reduced the computational burden.\nUsing the same non-optimized parameters for different data sets we achieved results com-\nparable to other recent work on classi\ufb01cation such as boosting, optimized SVMs, and kernel\ntarget alignment [10, 11, 7] (note that we use a much smaller part of the data for training:\n\n,\u0015\u0014\n) and that after the optimization of (19), typically less than\u0016 were being\n\n\u0018 had been hand-tuned using cross\n\nization of SVMs). This has commonly been reported to yield good results.\n\nthroughout, giv-\nfocus almost\n\nin the Vapnik-style parameter-\n\nvalidation.\n\n\u0001 .\n\n c\u0012\n\n\u0002\u000f\f\n\n\u0002\r\f\n\n\fHd\n\n\u0002\n\u0002\n:\n \n\u0002\n9\n@\n\u0002\n\u0012\n\u001c\n\u0012\n\"\n\u0012\n\n\u0012\n\f\nd\n\u0004\n\u0017\n\u0005\n\u0016\n\u0004\n\u0002\n\u0012\n\u001b\nC\n\u0017\n\u0005\n\nK\n\u0002\n\u001c\nD\n\"\n\n\u0002\n\u0002\n\u0006\n\b\n\n\u0017\n\u0005\n\u0002\n\u001c\nC\n\u001c\nD\n\"\n\n\u0002\n\u0002\n\u0002\n\b\n\n\u0017\n\u0005\n\u0002\n\"\n\"\n\n\u0002\n\u0002\n\u0002\n\"\n\u0002\n:\nC\n \n\u000e\n$\n:\nC\n \n\u000e\n\b\nk\n\u0004\n\u0007\n(\n\u0007\n\u0004\n'\n\n,\n\u0010\n\u0002\n\u0010\n\u0010\n/\n\u0017\n\f\nd\n\u0002\n\u0002\n\u0010\n\b\n\u000f\n\u0014\n\u000b\n\u000f\n\f\nd\n\u0002\n,\n\u0012\n\u0002\n\u0002\n\fData(size)\npima(768)\n\nionosph(351)\nwdbc(569)\nusps(1424)\n\n25.2\n13.4\n5.7\n\nTrain\n2.0\n2.0\n0.8\n2.1\n\n\u0003\u0006+-b\n\n26.2\n16.5\n5.7\n\nTest\n3.3\n3.4\n1.3\n3.4\n\n22.2\n10.9\n2.1\n\nTrain\n1.4\n1.5\n0.6\n1.5\n\n\u0003\u0006+-b\n\n23.2\n13.4\n2.7\n\nTest\n2.0\n2.4\n1.0\n2.8\n\nBest in\n[10, 11]\n23.5\n6.2\n3.2\nNA\n\nTuned\nSVM\n2.0\n1.9\n0.9\n2.5\n\n22.9\n6.1\n2.5\n\nTable 2: Training and test error in percent\n\nonly\u000b\n\n\u0002\r\f\n\n\u0002\u000f\f\n\nrather than\n\n(right most column), except for the ionosphere data. We suspect that this is due to the small\ntraining sample.\n\n). Results based on 9M\u0003\u0006+-b are comparable to hand tuned SVMs\n\nSummary and Outlook The regularized quality functional allows the systematic solu-\ntion of problems associated with the choice of a kernel. Quality criteria that can be used\ninclude target alignment, regularized risk and the log posterior. The regularization implicit\nin our approach allows the control of over\ufb01tting that occurs if one optimizes over a too\nlarge a choice of kernels.\nA very promising aspect of the current work is that it opens the way to theoretical analyses\nof kernels. Current and future\nresearch is devoted to working through this analysis and subsequently developing methods\nfor the design of good hyperkernels.\n\nof the price one pays by optimizing over a larger set N\n\nAcknowledgements This work was supported by a grant of the Australian Research\nCouncil. The authors thank Grace Wahba for helpful comments and suggestions.\n\nReferences\n[1] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the\n\nkernel matrix with semide\ufb01nite programming. In ICML. Morgan Kaufmann, 2002.\n\n[2] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to\nIn M. I. Jordan, editor, Learning and Inference in\n\nlinear prediction and beyond.\nGraphical Models. Kluwer Academic, 1998.\n\n[3] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing kernel parameters\n\nfor support vector machines. Machine Learning, 2002. Forthcoming.\n\n[4] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional\n\nConference Series in Applied Mathematics. SIAM, Philadelphia, 1990.\n\n[5] K. Crammer, J. Keshet, and Y. Singer. Kernel design using boosting. In Advances in\n\nNeural Information Processing Systems 15, 2002. In press.\n\n[6] O. Bousquet and D. Herrmann. On the complexity of learning the kernel matrix. In\n\nAdvances in Neural Information Processing Systems 15, 2002. In press.\n\n[7] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment.\nTechnical Report NC2-TR-2001-087, NeuroCOLT, http://www.neurocolt.com, 2001.\n\n[8] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.\n[9] S. Fine and K. Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representa-\n\ntion. Technical report, IBM Watson Research Center, New York, 2000.\n\n[10] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In ICML,\n\npages 148\u2013146. Morgan Kaufmann Publishers, 1996.\n\n[11] G. R\u00a8atsch, T. Onoda, and K. R. M\u00a8uller. Soft margins for adaboost. Machine Learning,\n\n42(3):287\u2013320, 2001.\n\nX\n9\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\u0001\n\f", "award": [], "sourceid": 2193, "authors": [{"given_name": "Cheng", "family_name": "Ong", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}