{"title": "Probabilistic Methods for Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 349, "page_last": 355, "abstract": null, "full_text": "Probabilistic methods for Support Vector \n\nMachines \n\nDepartment of Mathematics, King's College London \n\nStrand, London WC2R 2LS, U.K. Email: peter.sollich@kcl.ac.uk \n\nPeter Sollich \n\nAbstract \n\nI describe a framework for interpreting Support Vector Machines \n(SVMs) as maximum a posteriori (MAP) solutions to inference \nproblems with Gaussian Process priors. This can provide intuitive \nguidelines for choosing a 'good' SVM kernel. It can also assign \n(by evidence maximization) optimal values to parameters such as \nthe noise level C which cannot be determined unambiguously from \nproperties of the MAP solution alone (such as cross-validation er(cid:173)\nror) . I illustrate this using a simple approximate expression for the \nSVM evidence. Once C has been determined, error bars on SVM \npredictions can also be obtained. \n\n1 Support Vector Machines: A probabilistic framework \n\nSupport Vector Machines (SVMs) have recently been the subject of intense re(cid:173)\nsearch activity within the neural networks community; for tutorial introductions \nand overviews of recent developments see [1, 2, 3]. One of the open questions that \nremains is how to set the 'tunable' parameters of an SVM algorithm: While meth(cid:173)\nods for choosing the width of the kernel function and the noise parameter C (which \ncontrols how closely the training data are fitted) have been proposed [4, 5] (see \nalso, very recently, [6]), the effect of the overall shape of the kernel function remains \nimperfectly understood [1]. Error bars (class probabilities) for SVM predictions -\nimportant for safety-critical applications, for example -\nare also difficult to obtain. \nIn this paper I suggest that a probabilistic interpretation of SVMs could be used to \ntackle these problems. It shows that the SVM kernel defines a prior over functions \non the input space, avoiding the need to think in terms of high-dimensional feature \nspaces. It also allows one to define quantities such as the evidence (likelihood) for a \nset of hyperparameters (C, kernel amplitude Ko etc). I give a simple approximation \nto the evidence which can then be maximized to set such hyperparameters. The \nevidence is sensitive to the values of C and Ko individually, in contrast to properties \n(such as cross-validation error) of the deterministic solution, which only depends \non the product CKo. It can thfrefore be used to assign an unambiguous value to \nC, from which error bars can be derived. \n\n\f350 \n\nP. Sollich \n\nI focus on two-class classification problems. Suppose we are given a set D of n \ntraining examples (Xi, Yi) with binary outputs Yi = \u00b11 corresponding to the two \nclasses. The basic SVM idea is to map the inputs X onto vectors c/>(x) in some \nhigh-dimensional feature space; ideally, in this feature space, the problem should be \nlinearly separable. Suppose first that this is true. Among all decision hyperplanes \nw\u00b7c/>(x) + b = 0 which separate the training examples (Le. which obey Yi(W'c/>(Xi) + \nb) > 0 for all Xi E Dx , Dx being the set of training inputs), the SVM solution is \nchosen as the one with the largest margin, Le. the largest minimal distance from \nany of the training examples. Equivalently, one specifies the margin to be one and \nminimizes the squared length of the weight vector IIwI1 2 [1], subject to the constraint \nthat Yi(W'c/>(Xi) + b) 2:: 1 for all i. If the problem is not linearly separable, 'slack \nvariables' ~i 2:: 0 are introduced which measure how much the margin constraints \nare violated; one writes Yi(W'c/>(Xi) + b) 2:: 1 - ~i' To control the amount of slack \nallowed, a penalty term C Ei ~i is then added to the objective function ~ IIwI1 2 , \nwith a penalty coefficient C. Training examples with Yi(w \u00b7c/>(xd + b) 2:: 1 (and \nhence ~i = 0) incur no penalty; all others contribute C[l - Yi(W 'c/>(Xi) + b)] each. \nThis gives the SVM optimization problem: Find wand b to minimize \n\n~llwl12 + C Ei l(Yi[W 'c/>(Xi) + b]) \n\n(1) \n\nwhere l(z) is the (shifted) 'hinge loss', l(z) = (1- z)8(1- z). \n\nTo interpret SVMs probabilistically, one can regard (1) as defining a (negative) \nlog-posterior probability for the parameters wand b of the SVM, given a training \nset D. The first term gives the prior Q(w,b) \"\" exp(-~llwW - ~b2B-2). This \nis a Gaussian prior on W; the components of W are uncorrelated with each other \nI have chosen a Gaussian prior on b with variance B2; \nand have unit variance. \nthe flat prior implied by (1) can be recovered! by letting B -+ 00. Because only \nthe 'latent variable' values O(x) = w\u00b7c/>(x) + b -\nrather than wand b individually \nappear in the second, data dependent term of (1), it makes sense to express \n-\nthe prior directly as a distribution over these. The O(x) have a joint Gaussian \ndistribution because the components ofw do, with covariances given by (O(x)O(x')) \n= (( c/>(x) \u00b7w) (w\u00b7c/>(x'))) + B2 = c/>(x)\u00b7c/>(x') + B2. The SVM prior is therefore simply \na Gaussian process (GP) over the functions 0, with covariance function K(x,x') = \nc/>(x) \u00b7c/>(x') + B2 (and zero mean). This correspondence between SVMs and GPs \nhas been noted by a number of authors, e.g. [6, 7, 8, 9, 10J . \n\nThe second term in (1) becomes a (negative) log-likelihood if we define the proba(cid:173)\nbility of obtaining output Y for a given X (and 0) as \n\nQ(y =\u00b1llx, 0) = ~(C) exp[-Cl(yO(x))] \n\n(2) \n\nWe set ~(C) = 1/[1 + exp(-2C)] to ensure that the probabilities for Y \n\u00b11 \nnever add up to a value larger than one. The likelihood for the complete data set \nis then Q(DIO) = It Q(Yilxi, O)Q(Xi), with some input distribution Q(x) which \nremains essentially arbitrary at this point. However, this likelihood function is not \nnormalized, because \nlI(O(x)) = Q(llx, 0) + Q( -llx, 0) = ~(C){ exp[ -Cl(O(x))] + exp[-Cl( -O(x))]} < 1 \n\nlIn the probabilistic setting, it actually makes more sense to keep B finite (and small); \n\nfor B -+ 00, only training sets with all Yi equal have nonzero probability. \n\n\fProbabilistic Methods for Support Vector Machines \n\n351 \n\nexcept when IO(x)1 = 1. To remedy this, I write the actual probability model as \n\nP(D,9) = Q(DI9)Q(9)/N(D) . \n\n(3) \n\nIts posterior probability P(9ID) '\" Q(DI9)Q(9) is independent Qfthe normalization \nfactor N(D); by construction, the MAP value of 9 is therefore the SVM solution. \nThe simplest choice of N(D) which normalizes P(D, 9) is D-independent: \n\nN = Nn = Jd9Q(9)Nn(9), N(9) = JdxQ(x)lI(O(x)). \n\n(4) \n\nConceptually, this corresponds to the following procedure of sampling from P(D, 9): \nFirst, sample 9 from the GP prior Q(9) . Then, for each data point, sample x from \nQ(x). Assign outputs Y = \u00b11 with probability Q(ylx,9), respectively; with the \nremaining probability l-lI(O(x)) (the 'don't know' class probability in [11]), restart \nthe whole process by sampling a new 9. Because lI(O(x)) is smallest2 inside the 'gap' \nIO(x)1 < 1, functions 9 with many values in this gap are less likely to 'survive' until \na dataset of the required size n is built up. This is reflected in an n-dependent \nfactor in the (effective) prior, which follows from (3,4) as P(9) '\" Q(9)Nn(9). \nCorrespondingly, in the likelihood \n\nP(ylx,9) = Q(ylx, 9)/1I(O(x)), \n\nP(xI9) '\" Q(x) lI(O(x)) \n\n(5) \n\n(which now is normalized over y = \u00b11), the input density is influenced by the \nfunction 9 itself; it is reduced in the 'uncertainty gaps' IO(x)1 < 1. \nTo summarize, eqs. (2-5) define a probabilistic data generation model whose MAP \nsolution 9* = argmax P(9ID) for a given data set D is identical to a standard \nSVM. The effective prior P(9) is a GP prior modified by a data set size-dependent \nfactor; the likelihood (5) defines not just a conditional output distribution, but also \nan input distribution (relative to some arbitrary Q(x)). All relevant properties of \nthe feature space are encoded in the underlying GP prior Q(9), with covariance \nmatrix equal to the kernel K(x, Xl). The log-posterior of the model \n\nIn P(9ID) = -t J dx dxl O(X)K-l(X, Xl) O(XI) - C 'Ei l(YiO(xi)) + const \n\n(6) \n\nis just a transformation of (1) from wand b to 9. By differentiating w.r.t. the \nO(x) for non-training inputs, one sees that its maximum is of the standard form \nO*(x) = Ei (}:iYiK(X, Xi); for YiO*(Xi) > 1, < 1, and = lone has (}:i = 0, (}:i = C and \n(}:i E [0, C] respectively. I will call the training inputs Xi in the last group marginal; \nthey form a subset of all support vectors (the Xi with (}:i > 0). The sparseness of \nthe SVM solution (often the number of support vectors is \u00ab n) comes from the \nfact that the hinge loss l(z) is constant for z > 1. This contrasts with other uses \nof GP models for classification (see e.g. [12]), where instead of the likelihood (2) \na sigmoidal (often logistic) 'transfer function' with nonzerO gradient everywhere is \nused. Moreover, in the noise free limit, the sigmoidal transfer function becomes a \nstep function, and the MAP values 9* will tend to the trivial solution O*(x) = O. \nThis illuminates from an alternative point of view why the margin (the 'shift' in \nthe hinge loss) is important for SVMs. \n\nWithin the probabilistic framework, the main effect of the kernel in SVM classi(cid:173)\nfication is to change the properties of the underlying GP prior Q(9) in P(9) '\" \n\n2This is true for C > In 2. For smaller C, v( O( x\u00bb \n\nmodel makes less intuitive sense. \n\nis actually higher in the gap, and the \n\n\f352 \n\nP. Sollich \n\n(e) \n\n(h) \n\nFigure 1: Samples from SVM priors; the input space is the unit square [0,1]2. \n3d plots are samples 8(x) from the underlying Gaussian process prior Q(8). 2d \ngreyscale plots represent the output distributions obtained when 8(x) is used in the \nlikelihood model (5) with C = 2; the greyscale indicates the probability of y = 1 \n(black: 0, white: 1). \n(a,b) Exponential (Ornstein-Uhlenbeck) kernel/covariance \nfunction Koexp(-Ix - x/l/l), giving rough 8(x) and decision boundaries. Length \nscale l = 0.1, Ko = 10. (c) Same with Ko = 1, i.e. with a reduced amplitude of O(x); \nnote how, in a sample from the prior corresponding to this new kernel, the grey \n'uncertainty gaps' (given roughly by 18(x)1 < 1) between regions of definite outputs \n(black/white) have widened. (d,e) As first row, but with squared exponential (RBF) \nkernel Ko exp[-(x - X I )2/(2l2)], yielding smooth 8(x) and decision boundaries. (f) \nChanging l to 0.05 (while holding Ko fixed at 10) and taking a new sample shows how \nthis parameter sets the typical length scale for decision regions. (g,h) Polynomial \nkernel (1 + x\u00b7xl)P, with p = 5; (i) p = 10. The absence of a clear length scale and \nthe widely differing magnitudes of 8(x) in the bottom left (x = [0,0]) and top right \n(x = [1,1]) corners of the square make this kernel less plausible from a probabilistic \npoint of view. \n\n\fProbabilistic Methods for Support Vector Machines \n\n353 \n\nQ(O)Nn(o). Fig. 1 illustrates this with samples from Q(O) for three different types \nof kernels. The effect of the kernel on smoothness of decision boundaries, and typ(cid:173)\nical sizes of decision regions and 'uncertainty gaps' between them, can clearly be \nseen. When prior knowledge about these properties of the target is available, the \nprobabilistic framework can therefore provide intuition for a suitable choice of ker(cid:173)\nnel. Note that the samples in Fig. 1 are from Q(O), rather than from the effective \nprior P(O). One finds, however, that the n-dependent factor Nn(o) does not change \nthe properties of the prior qualitatively3. \n\n2 Evidence and error bars \n\nBeyond providing intuition about SVM kernels, the probabilistic framework dis(cid:173)\ncussed above also makes it possible to apply Bayesian methods to SVMs. For ex(cid:173)\nample, one can define the evidence, i.e. the likelihood of the data D, given the model \nas specified by the hyperparameters C and (some parameters defining) K(x, x'). It \nfollows from (3) as \n\nP(D) = Q(D)/Nn, \n\nQ(D) = J dO Q(DIO)Q(O). \n\n(7) \n\nThe factor Q(D) is the 'naive' evidence derived from the unnormalized likelihood \nmodel; the correction factor Nn ensures that P(D) is normalized over all data \nsets. This is crucial in order to guarantee that optimization of the (log) evidence \ngives optimal hyperparameter values at least on average (M Opper, private com(cid:173)\nmunication). Clearly, P(D) will in general depend on C and K(x,x') separately. \nThe actual SVM solution, on the other hand, i.e. the MAP values 0*, can be seen \nfrom (6) to depend on the product C K (x, x') only. Properties of the deterministi(cid:173)\ncally trained SVM alone (such as test or cross-validation error) cannot therefore be \nused to determine C and the resulting class probabilities (5) unambiguously. \n\nI now outline how a simple approximation to the naive evidence can be derived. \nQ(D) is given by an integral over all B(x), with the log integrand being (6) up to an \nadditive constant. After integrating out the Gaussian distributed B( x) with x \u00a2 Dx , \nan intractable integral over the B(Xi) remains. However, progress can be made by \nexpanding the log integrand around its maximum B*(Xi)' For all non-marginal \ntraining inputs this is equivalent to Laplace's approximation: the first terms in \nthe expansion are quadratic in the deviations from the maximum and give simple \nGaussian integrals. For the remaining B(Xi), the leading terms in the log integrand \nvary linearly near the maximum. Couplings between these B(Xi) only appear at the \nnext (quadratic) order; discarding these terms as subleading, the integral factorizes \nover the B(xd and can be evaluated. The end result of this calculation is: \nInQ(D) ~ -! LiYi