{"title": "A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 467, "page_last": 474, "abstract": "", "full_text": "A Probabilistic Interpretation of SVMs with an\n\nApplication to Unbalanced Classi\ufb01cation\n\nYves Grandvalet \u2217\n\nHeudiasyc, CNRS/UTC\n\n60205 Compi`egne cedex, France\n\nJohnny Mari\u00b4ethoz\n\nSamy Bengio\n\nIDIAP Research Institute\n1920 Martigny, Switzerland\n\ngrandval@utc.fr\n\n{marietho,bengio}@idiap.ch\n\nAbstract\n\nIn this paper, we show that the hinge loss can be interpreted as the\nneg-log-likelihood of a semi-parametric model of posterior probabilities.\nFrom this point of view, SVMs represent the parametric component of a\nsemi-parametric model \ufb01tted by a maximum a posteriori estimation pro-\ncedure. This connection enables to derive a mapping from SVM scores\nto estimated posterior probabilities. Unlike previous proposals, the sug-\ngested mapping is interval-valued, providing a set of posterior probabil-\nities compatible with each SVM score. This framework offers a new\nway to adapt the SVM optimization problem to unbalanced classi\ufb01ca-\ntion, when decisions result in unequal (asymmetric) losses. Experiments\nshow improvements over state-of-the-art procedures.\n\n1\n\nIntroduction\n\nIn this paper, we show that support vector machines (SVMs) are the solution of a relaxed\nmaximum a posteriori (MAP) estimation problem. This relaxed problem results from \ufb01tting\na semi-parametric model of posterior probabilities. This model is decomposed into two\ncomponents: the parametric component, which is a function of the SVM score, and the\nnon-parametric component which we call a nuisance function. Given a proper binding of\nthe nuisance function adapted to the considered problem, this decomposition enables to\nconcentrate on selected ranges of the probability spectrum. The estimation process can\nthus allocate model capacity to the neighborhoods of decision boundaries.\n\nThe connection to semi-parametric models provides a probabilistic interpretation of SVM\nscores, which may have several applications, such as estimating con\ufb01dences over the pre-\ndictions, or dealing with unbalanced losses. (which occur in domains such as diagnosis,\nintruder detection, etc). Several mappings relating SVM scores to probabilities have al-\nready been proposed (Sollich 2000, Platt 2000), but they are subject to arbitrary choices,\nwhich are avoided here by their integration to the nuisance function.\n\nThe paper is organized as follows. Section 2 presents the semi-parametric modeling ap-\nproach; Section 3 shows how we reformulate SVM in this framework; Section 4 proposes\nseveral outcomes of this formulation, including a new method to handle unbalanced losses,\nwhich is tested empirically in Section 5. Finally, Section 6 brie\ufb02y concludes the paper.\n\n\u2217This work was supported in part by the IST Programme of the European Community, under the\nPASCAL Network of Excellence IST-2002-506778. This publication only re\ufb02ects the authors\u2019 views.\n\n\f2 Semi-Parametric Classi\ufb01cation\n\nWe address the binary classi\ufb01cation problem of estimating a decision rule from a learning\nset Ln = {(xi, yi)}n\ni=1, where the ith example is described by the pattern xi \u2208 X and\nthe associated response yi \u2208 {\u22121, 1}. In the framework of maximum likelihood estima-\ntion, classi\ufb01cation can be addressed either via generative models, i.e. models of the joint\ndistribution P (X, Y ), or via discriminative methods modeling the conditional P (Y |X).\n\n2.1 Complete and Marginal Likelihood, Nuisance Functions\n\nLet p(1|x; \u03b8) denote the model of P (Y = 1|X = x), p(x; \u03c8) the model of P (X) and ti\nthe binary response variable such that ti = 1 when yi = 1 and ti = 0 when yi = \u22121.\nAssuming independent examples, the complete log-likelihood can be decomposed as\n\nL(\u03b8, \u03c8; Ln) = Xi\n\nti log(p(1|xi; \u03b8)) + (1 \u2212 ti) log(1 \u2212 p(1|xi; \u03b8)) + log(p(xi; \u03c8)) , (1)\n\nwhere the two \ufb01rst terms of the right-hand side represent the marginal or conditional like-\nlihood, that is, the likelihood of p(1|x; \u03b8).\nFor classi\ufb01cation purposes, the parameter \u03c8 is not relevant, and may thus be quali\ufb01ed as a\nnuisance parameter (Lindsay 1985). When \u03b8 can be estimated independently of \u03c8, maxi-\nmizing the marginal likelihood provides the estimate returned by maximizing the complete\nlikelihood with respect to \u03b8 and \u03c8. In particular, when no assumption whatsoever is made\non P (X), maximizing the conditional likelihood amounts to maximize the joint likelihood\n(McLachlan 1992). The density of inputs is then considered as a nuisance function.\n\n2.2 Semi-Parametric Models\n\nAgain, for classi\ufb01cation purposes, estimating P (Y |X) may be considered as too demand-\ning. Indeed, taking a decision only requires the knowledge of sign(2P (Y = 1|X = x)\u22121).\nWe may thus consider looking for the decision rule minimizing the empirical classi\ufb01cation\nerror, but this problem is intractable for non-trivial models of discriminant functions.\n\nHere, we brie\ufb02y explore how semi-parametric models (Oakes 1988) may be used to re-\nduce the modelization effort as compared to the standard likelihood approach. For this,\nwe consider a two-component semi-parametric model of P (Y = 1|X = x), de\ufb01ned as\np(1|x; \u03b8) = g(x; \u03b8) + \u03b5(x), where the parametric component g(x; \u03b8) is the function of in-\nterest, and where the non-parametric component \u03b5 is a constrained nuisance function. Then,\nwe address the maximum likelihood estimation of the semi-parametric model p(1|x; \u03b8)\n\nti log(p(1|xi; \u03b8)) + (1 \u2212 ti) log(1 \u2212 p(1|xi; \u03b8))\n\n\u2212Xi\n\nmin\n\u03b8,\u03b5\ns. t. p(1|x; \u03b8) = g(x; \u03b8) + \u03b5(x)\n\n(2)\n\n0 \u2264 p(1|x; \u03b8) \u2264 1\n\u03b5\u2212(x) \u2264 \u03b5(x) \u2264 \u03b5+(x)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\nwhere \u03b5\u2212 and \u03b5+ are user-de\ufb01ned functions, which place constraints on the non-parametric\ncomponent \u03b5. According to these constraints, one pursues different objectives, which can\nbe interpreted as either weakened or focused versions of the original problem of estimating\nprecisely P (Y |X) on the whole range [0, 1].\nAt the one extreme, when \u03b5\u2212 = \u03b5+, one recovers a parametric maximum likelihood prob-\nlem, where the estimate of posterior probabilities p(1|x; \u03b8) is simply g(x; \u03b8) shifted by the\nbaseline function \u03b5. At the other extreme, when \u03b5\u2212(x) \u2264 \u2212g(x) and \u03b5+(x) \u2265 1 \u2212 g(x),\np(1|\u00b7; \u03b8) perfectly explains (interpolates) any training sample for any \u03b8, and the optimiza-\ntion problem in \u03b8 is ill-posed. Note that the optimization problem in \u03b5 is always ill-posed,\nbut this is not of concern as we do not wish to estimate the nuisance function.\n\n\f 1\n1\u2212\u03b5\n\n|\n\n)\nx\n1\n(\np\n\n \u03b5\n 0\n\n 0 \u03b5\n\n)\nx\n(\n\n+\n\u03b5\n/\n)\nx\n(\n\n-\n\u03b5\n\n+\u03b5\n 0\n\u2212\u03b5\n\n 0 \u03b5 \n\n1\u2212\u03b5 1 \n\ng(x)\n\n1\u2212\u03b5 1\n\ng(x)\n\n0.5\n\n0\n\n)\nx\n(\n\n+\n\u03b5\n/\n)\nx\n(\n\n-\n\u03b5\n\n\u22120.5\n0\n\n1\n\n|\n\n)\nx\n1\n(\np\n\n0.5\n\n1\n\n0\n0\n\n0.5\ng(x)\n\n0.5\ng(x)\n\n1\n\nFigure 1: Two examples of \u03b5\u2212(x) (dashed) and \u03b5+(x) (plain) vs. g(x) and resulting \u01eb-tube\nof possible values for the estimate of P (Y = 1|X = x) (gray zone) vs. g(x).\n\nGenerally, as \u03b5 is not estimated, the estimate of posterior probabilities p(1|x; \u03b8) is only\nknown to lie within the interval [g(x; \u03b8) + \u03b5\u2212(x), g(x; \u03b8) + \u03b5+(x)]. In what follows, we\nonly consider functions \u03b5\u2212 and \u03b5+ expressed as functions of the argument g(x), for which\nthe interval can be recovered from g(x) alone. We also require \u03b5\u2212(x) \u2264 0 \u2264 \u03b5+(x), in\norder to ensure that g(x; \u03b8) is an admissible value of p(1|x; \u03b8).\nTwo simple examples are displayed in Figure 1. The two \ufb01rst graphs represent \u03b5\u2212 and \u03b5+\ndesigned to estimate posterior probabilities up to precision \u01eb, and the corresponding \u01eb-tube\nof admissible estimates knowing g(x). The two last graphs represent the same functions\nfor \u03b5\u2212 and \u03b5+ de\ufb01ned to focus on the only relevant piece of information regarding decision:\nestimating where P (Y |X) is above 1/2. 1\n\n2.3 Estimation of the Parametric Component\n\nThe de\ufb01nitions of \u03b5\u2212 and \u03b5+ affect the estimation of the parametric component. Regarding\n\u03b8, when the values of g(x; \u03b8) + \u03b5\u2212(x) and g(x; \u03b8) + \u03b5+(x) lie within [0, 1], problem (2)\nis equivalent to the following relaxed maximum likelihood problem\n\nti log(g(xi; \u03b8) + \u03b5i) + (1 \u2212 ti) log(1 \u2212 g(xi; \u03b8) \u2212 \u03b5i)\n\n(3)\n\n\u03b5\u2212(xi) \u2264 \u03b5i \u2264 \u03b5+(xi)\n\ni = 1, . . . , n\n\n\u2212Xi\n\nmin\n\u03b8,\u03b5\ns. t.\n\n\uf8f1\uf8f2\n\uf8f3\n\nwhere \u03b5 is an n-dimensional vector of slack variables. The problem is quali\ufb01ed as relaxed\ncompared to the the maximum likelihood estimation of posterior probabilities by g(xi; \u03b8),\nbecause modeling posterior probabilities by g(xi; \u03b8) + \u03b5i is a looser objective.\nThe monotonicity of the objective function with respect to \u03b5i implies that the constraints\n\u03b5\u2212(xi) \u2264 \u03b5i and \u03b5i \u2264 \u03b5+(xi) are saturated at the solution of (3) for ti = 0 or ti = 1 re-\nspectively. Thus, the loss in (3) is the neg-log-likelihood of the lower or the upper bound on\np(1|xi; \u03b8) respectively. Provided that g, \u03b5\u2212 and \u03b5+ are de\ufb01ned such that \u03b5\u2212(x) \u2264 \u03b5+(x),\n0 \u2264 g(x) + \u03b5\u2212(x) \u2264 1 and 0 \u2264 g(x) + \u03b5+(x) \u2264 1, the optimization problem with respect\nto \u03b8 reduces to\n\nmin\n\n\u03b8\n\n\u2212Xi\n\nti log(g(xi; \u03b8) + \u03b5+(xi)) + (1 \u2212 ti) log(1 \u2212 g(xi; \u03b8) \u2212 \u03b5\u2212(xi)) .\n\n(4)\n\nFigure 2 displays the losses for positive examples corresponding to the choices of \u03b5\u2212 and\n\u03b5+ depicted in Figure 1 (the losses are symmetrical around 0.5 for negative examples).\nNote that the convexity of the objective function with respect to g depends on the choices\nof \u03b5\u2212 and \u03b5+. One can show that, providing \u03b5+ and \u03b5\u2212 are respectively concave and convex\nfunctions of g, then the loss (4) is convex in g.\nWhen \u03b5\u2212(x) \u2264 0 \u2264 \u03b5+(x), g(x) is an admissible estimate of P (Y = 1|x). However,\nthe relaxed loss (4) is optimistic, below the neg-log-likelihood of g. This optimism usually\n\n1Of course, this naive attempt to minimize the training classi\ufb01cation error is doomed to failure.\n\nReformulating the problem does not affect its convexity: it remains NP-hard.\n\n\f)\n1\n,\n)\nx\n(\ng\n(\nL\n\n 0 \n\n1\u2212\u03b5 1 \n\ng(x)\n\n)\n1\n\n,\n)\nx\n(\ng\n(\nL\n\n0\n\n0.5\ng(x)\n\n1\n\nFigure 2: Losses for positive examples (plain) and neg-log-likelihood of g(x) (dotted) vs.\ng(x). Left: for the function \u03b5+ displayed on the left-hand side of Figure 1; right: for the\nfunction \u03b5+ displayed on the right-hand side of Figure 1.\n\nresults in a non-consistent estimation of posterior probabilities (i.e g(x) does not converge\ntowards P (Y = 1|X = x) as the sample size goes to in\ufb01nity), a common situation in\nsemi-parametric modeling (Lindsay 1985). This lack of consistency should not be a con-\ncern here, since the non-parametric component is purposely introduced to address a looser\nestimation problem. We should therefore restrict consistency requirements to the primary\ngoal of having posterior probabilities in the \u01eb-tube [g(x) + \u03b5\u2212(x), g(x) + \u03b5+(x)].\n\n3 Semi-Parametric Formulation of SVMs\n\nSeveral authors pointed the closeness of SVM and the MAP approach to Gaussian pro-\ncesses (Sollich (2000) and references therein). However, this similarity does not provide a\nproper mapping from SVM scores to posterior probabilities. Here, we resolve this dif\ufb01culty\nthanks to the additional degrees of freedom provided by semi-parametric modelling.\n\n3.1 SVMs and Gaussian Processes\nIn its primal Lagrangian formulation, the SVM optimization problem reads\n\nmin\nf,b\n\n1\n2\n\nkf k2\n\nH + CXi\n\n[1 \u2212 yi(f (xi) + b)]+ ,\n\n(5)\n\nwhere H is a reproducing kernel Hilbert space with norm k \u00b7 kH, C is a regularization\nparameter and [f ]+ = max(f, 0).\nThe penalization term in (5) can be interpreted as a Gaussian prior on f, with a covariance\nfunction proportional to the reproducing kernel of H (Sollich 2000). Then, the interpreta-\ntion of the hinge loss as a marginal log-likelihood requires to identify an af\ufb01ne function of\nthe last term of (5) with the two \ufb01rst terms of (1). We thus look for two constants c0 and\nc1 6= 0, such that, for all values of f (x) + b, there exists a value 0 \u2264 p(1|x) \u2264 1 such that\n\n(cid:26)\n\np(1|x) = exp \u2212(c0 + c1[1 \u2212 (f (x) + b)]+)\n1 \u2212 p(1|x) = exp \u2212(c0 + c1[1 + (f (x) + b)]+)\n\n.\n\n(6)\n\nThe system (6) has a solution over the whole range of possible values of f (x) + b if and\nonly if c0 = log(2) and c1 = 0. Thus, the SVM optimization problem does not implement\nthe MAP approach to Gaussian processes.\n\nTo proceed with a probabilistic interpretation of SVMs, Sollich (2000) proposed a nor-\nmalized probability model. The normalization functional was chosen arbitrarily, and the\nconsequences of this choice on the probabilistic interpretation was not evaluated. In what\nfollows, we derive an imprecise mapping, with interval-valued estimates of probabilities,\nrepresenting the set of all admissible semi-parametric formulations of SVM scores.\n\n3.2 SVMs and Semi-Parametric Models\nWith the semi-parametric models of Section 2.2, one has to identify an af\ufb01ne function of\nthe hinge loss with the two terms of (4). Compared to the previous situation, one has the\n\n\f|\n\n)\nx\n1\n(\np\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\nf(x)+b\n\n)\n1\n,\n)\nx\n(\ng\n(\nL\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\u22126\n\n1\n\n|\n\n)\nx\n1\n(\np\n\n0.5\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\nf(x)+b\n\n0\n0\n\n0.25\n\n0.75\n\n1\n\n0.5\ng(x)\n\nFigure 3: Left:\nlower (dashed) and upper (plain) posterior probabilities [g(x) +\n\u03b5\u2212(x), g(x) + \u03b5+(x)] vs. SVM scores f (x) + b; center: corresponding neg-log-likelihood\nof g(x) for positive examples vs. f (x)+b. right: lower (dashed) and upper (plain) posterior\nprobabilities vs. g(x), for g de\ufb01ned in (8).\n\nfreedom to de\ufb01ne the slack functions \u03b5\u2212 and \u03b5+. The identi\ufb01cation problem is now\n\ng(x) + \u03b5+(x) = exp \u2212(c0 + c1[1 \u2212 (f (x) + b)]+)\n1 \u2212 g(x) \u2212 \u03b5\u2212(x) = exp \u2212(c0 + c1[1 + (f (x) + b)]+)\ns.t.\n\n0 \u2264 g(x) + \u03b5\u2212(x) \u2264 1\n0 \u2264 g(x) + \u03b5+(x) \u2264 1\n\u03b5\u2212(x) \u2264 \u03b5+(x)\n\n.\n\n(7)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\nProvided c0 = 0 and 0 < c1 \u2264 log(2), there are functions g, \u03b5\u2212 and \u03b5+ such that the\nabove problem has a solution. Hence, we obtain a set of probabilistic interpretations fully\ncompatible with SVM scores. The solutions indexed by c1 are nested, in the sense that, for\nany x, the length of the uncertainty interval, \u03b5+(x)\u2212\u03b5\u2212(x), is monotonically decreasing in\nc1: the interpretation of SVM scores as posterior probabilities gets tighter as c1 increases.\nThe most restricted subset of admissible interpretations, with the shortest uncertainty inter-\nvals, obtained for c1 = log(2), is represented in the left-hand side of Figure 3. The loss\nincurred by a positive example is represented on the central graph, where the gray zone rep-\nresents the neg-log-likelihood of all admissible solutions of g(x). Note that the hinge loss\nis proportional to the neg-log-likelihood of the upper posterior probability g(x) + \u03b5+(x),\nwhich is the loss for positive examples in the semi-parametric model in (4). Conversely, the\nhinge loss for negative examples is reached for g(x) + \u03b5\u2212(x). An important observation,\nthat will be useful in Section 4.2 is that the neg-log-likelihood of any admissible functions\ng(x) is tangent to the hinge loss at f (x) + b = 0.\nThe solution is unique in terms of the admissible interval [g + \u03b5\u2212, g + \u03b5+], but many\nde\ufb01nitions of (\u03b5\u2212, \u03b5+, g) solve (7). For example, g may be de\ufb01ned as\n\ng(x; \u03b8) =\n\n2\u2212[1\u2212(f (x)+b)]+\n\n2\u2212[1+(f (x)+b)]+ + 2\u2212[1\u2212(f (x)+b)]+\n\n,\n\n(8)\n\nwhich is essentially the posterior probability model proposed by Sollich (2000), represented\ndotted in the \ufb01rst two graphs of Figure 3.\n\nThe last graph of Figure 3 displays the mapping from g(x) to admissible values of p(1|x)\nwhich results from the choice described in (8). Although the interpretation of SVM scores\ndoes not require to specify g, it may worth to list some features common to all options.\nFirst, g(x) + \u03b5\u2212(x) = 0 for all g(x) below some threshold g0 > 0, and conversely, g(x) +\n\u03b5+(x) = 1 for all g(x) above some threshold g1 < 1. These two features are responsible\nfor the sparsity of the SVM solution. Second, the estimation of posterior probabilities is\naccurate at 0.5, and the length of the uncertainty interval on p(1|x) monotonically increases\nin [g0, 0.5] and then monotonically decreases in [0.5, g1]. Hence, the training objective of\nSVMs is intermediate between the accurate estimation of posterior probabilities on the\nwhole range [0, 1] and the minimization of the classi\ufb01cation risk.\n\n\f4 Outcomes of the Probabilistic Interpretation\n\nThis section gives two consequences of our probabilistic interpretation of SVMs. Further\noutcomes, still reserved for future research are listed in Section 6.\n\n4.1 Pointwise Posterior Probabilities from SVM Scores\n\nPlatt (2000) proposed to estimate posterior probabilities from SVM scores by \ufb01tting a lo-\ngistic function over the SVM scores. The only logistic function compatible with the most\nstringent interpretation of SVMs in the semi-parametric framework,\n\ng(x; \u03b8) =\n\n1 + 4\u2212(f (x)+b))\n\n1\n\n,\n\n(9)\n\nis identical to the model of Sollich (2000) (8) when f (x) + b lies in the interval [\u22121, 1].\nOther logistic functions are compatible with the looser interpretations obtained by letting\nc1 < log(2), but their use as pointwise estimates is questionable, since the associated\ncon\ufb01dence interval is wider.\nIn particular, the looser interpretations do not ensure that\nf (x) + b = 0 corresponds to g(x) = 0.5. Then, the decision function based on the\nestimated posterior probabilities by g(x) may differ from the SVM decision function.\nBeing based on an arbitrary choice of g(x), pointwise estimates of posterior probabilities\nderived from SVM scores should be handled with caution. As discussed by Zhang (2004),\nthey may only be consistent at f (x) + b = 0, where they may converge towards 0.5.\n\n4.2 Unbalanced Classi\ufb01cation Losses\n\nSVMs are known to perform well regarding misclassi\ufb01cation error, but they provide skewed\ndecision boundaries for unbalanced classi\ufb01cation losses, where the losses associated with\nincorrect decisions differ according to the true label. The mainstream approach used to\naddress this problem consists in using different losses for positive and negative examples\n(Morik et al. 1999, Veropoulos et al. 1999), i.e.\n\nmin\nf,b\n\n1\n2\n\nkf k2\n\nH +C + X{i|yi=1}\n\n[1 \u2212 (f (xi) + b)]+ +C \u2212 X{i|yi=\u22121}\n\n[1 + (f (xi) + b)]+ , (10)\n\nwhere the coef\ufb01cients C + and C \u2212 are constants, whose ratio is equal to the ratio of the\nlosses \u2113FN and \u2113FP pertaining to false negatives and false positives, respectively (Lin et al.\n2002).2 Bayes\u2019 decision theory de\ufb01nes the optimal decision rule by positive classi\ufb01cation\nwhen P (y = 1|x) > P0, where P0 = \u2113FP\n. We may thus rewrite C + = C \u00b7 (1 \u2212 P0)\nand C \u2212 = C \u00b7 P0. With such de\ufb01nitions, the optimization problem may be interpreted\nas an upper-bound on the classi\ufb01cation risk de\ufb01ned from \u2113FN and \u2113FP. However, the ma-\nchinery of Section 3.2 unveils a major problem: the SVM decision function provided by\nsign(f (xi) + b) is not consistent with the probabilistic interpretation of SVM scores.\nWe address this problem by deriving another criterion, by requiring that the neg-log-\nlikelihood of any admissible functions g(x) is tangent to the hinge loss at f (x) + b = 0.\nThis leads to the following problem:\n\n\u2113FP+\u2113FN\n\nH + C\uf8eb\n\nkf k2\n\nmin\nf,b\n\n1\n2\n\n\uf8ed X{i|yi=1}\nX{i|yi=\u22121}\n\n[\u2212 log(P0) \u2212 (1 \u2212 P0)(f (xi) + b)]+ +\n\n[\u2212 log(1 \u2212 P0) + P0(f (xi) + b)]+\uf8f6\n\uf8f8 .\n\n(11)\n\n2False negatives/positives respectively designate positive/negative examples incorrectly classi\ufb01ed.\n\n\f|\n\n)\nx\n1\n(\np\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\u221210\n\n5\n\n4\n\n3\n\n2\n\n1\n\n)\n1\n\n,\n)\nx\n(\ng\n(\nL\n\n0\n\u221210\n\n0\n\n10\n\n20\n\nf(x)+b\n\n1\n\n|\n\n)\nx\n1\n(\np\n\n0.5\n\n0\n\n10\n\n20\n\nf(x)+b\n\n0\n0\n\n0.25\n\n0.75\n\n1\n\n0.5\ng(x)\n\nFigure 4: Left:\nlower (dashed) and upper (plain) posterior probabilities [g(x) +\n\u03b5\u2212(x), g(x) + \u03b5+(x)] vs. SVM scores f (x) + b obtained from (11) with P0 = 0.25; center:\ncorresponding neg-log-likelihood of g(x) for positive examples vs. f (x) + b. right: lower\n(dashed) and upper (plain) posterior probabilities vs. g(x), for g de\ufb01ned by \u03b5+(x) = 0 for\nf (x) + b \u2264 0 and \u03b5\u2212(x) = 0 for f (x) + b \u2265 0.\n\nThis loss differs from (10), in the respect that the margin for positive examples is smaller\nthan the one for negative examples when P0 < 0.5. In particular, (10) does not affect\nthe SVM solution for separable problems, while in (11), the decision boundary moves\ntowards positive support vectors when P0 decreases. The analogue of Figure 3, displayed\non Figure 4, shows that one recovers the characteristics of the standard SVM loss, except\nthat the focus is now on the posterior probability P0 de\ufb01ned by Bayes\u2019 decision rule.\n\n5 Experiments with Unbalanced Classi\ufb01cations Losses\n\nIt is straightforward to implement (11) in standard SVM packages. For experimenting with\ndif\ufb01cult unbalanced two-class problems, we used the Forest database, the largest available\nUCI dataset (http://kdd.ics.uci.edu/databases/covertype/). We con-\nsider the subproblem of discriminating the positive class Krummholz (20510 examples)\nagainst the negative class Spruce/Fir (211840 examples). The ratio of negative to positive\nexamples is high, a feature commonly encountered with unbalanced classi\ufb01cation losses.\n\nThe training set was built by random selection of size 11 000 (1000 and 10 000 examples\nfrom the positive and negative class respectively); a validation set, of size 11 000 was drawn\nidentically among the other examples; \ufb01nally, the test set, of size 99 000, was drawn among\nthe remaining examples.\nThe performance was measured by the weighted risk function R = 1\nn (NFN\u2113FN+NFP\u2113FP),\nwhere NFN and NFP are the number of false negatives and false positives, respectively. The\nloss \u2113FP was set to one, and \u2113FN was successively set to 1, 10 and 100, in order to penalize\nmore and more heavily errors from the under-represented class.\n\nAll approaches were tested using SVMs with a Gaussian kernel on normalized data. The\nhyper-parameters were tuned on the validation set for each of the \u2113FN values. We addition-\nally considered three tuning for the bias b: \u02c6b is the bias returned by the algorithm; \u02c6bv the\nbias returned by minimizing R on the validation set, which is an optimistic estimate of the\nbias that could be computed by cross-validation. We also provide results for b\u2217, the optimal\nbias computed on the test set. This \u201ccrystal ball\u201d tuning may not represent an achievable\ngoal, but it shows how far we are from the optimum. Table 1 compares the risk R obtained\nwith the three approaches for the different values of \u2113FN.\nThe \ufb01rst line, with \u2113FN = 1 corresponds to the standard classi\ufb01cation error, where all\ntraining criteria are equivalent in theory and in practice. The bias returned by the algorithm\nis very close to the optimal one. For \u2113FN = 10 and \u2113FN = 100, the models obtained\nby optimizing C +/C \u2212 (10) and P0 (11) achieve better results than the baseline with the\ncrystal ball bias. While the solutions returned by C +/C \u2212 can be signi\ufb01cantly improved\n\n\fTable 1: Errors for 3 different criteria and for 3 different models over the Forest database\n\n\u2113FN\n\n1\n10\n100\n\nBaseline, problem (5) C +/C \u2212, problem (10)\n\n\u02c6b\n\n0.027\n0.167\n1.664\n\nb\u2217\n\n0.026\n0.108\n0.406\n\n\u02c6b\n\n0.027\n0.105\n0.403\n\n\u02c6bv\n0.027\n0.104\n0.291\n\nb\u2217\n\n0.026\n0.094\n0.289\n\n0.027\n0.095\n0.295\n\nP0, problem (11)\n\u02c6b\nb\u2217\n\n\u02c6bv\n0.027\n0.104\n0.291\n\n0.026\n0.094\n0.289\n\nby tuning the bias, our criterion provides results that are very close to the optimum, in the\nrange of the performances obtained with the bias optimized on an independant validation\nset. The new optimization criterion can thus outperform standard approaches for highly\nunbalanced problems.\n\n6 Conclusion\n\nThis paper introduced a semi-parametric model for classi\ufb01cation which provides an inter-\nesting viewpoint on SVMs. The non-parametric component provides an intuitive means of\ntransforming the likelihood into a decision-oriented criterion. This framework was used\nhere to propose a new parameterization of the hinge loss, dedicated to unbalanced classi\ufb01-\ncation problems, yielding signi\ufb01cant improvements over the classical procedure.\n\nAmong other prospectives, we plan to apply the same framework to investigate hinge-like\ncriteria for decision rules including a reject option, where the classi\ufb01er abstains when a\npattern is ambiguous. We also aim at de\ufb01ning losses encouraging sparsity in probabilistic\nmodels, such as kernelized logistic regression. We could thus build sparse probabilistic\nclassi\ufb01ers, providing an accurate estimation of posterior probabilities on a (limited) pre-\nde\ufb01ned range of posterior probabilities. In particular, we could derive decision-oriented\ncriteria for multi-class probabilistic classi\ufb01ers. For example, minimizing classi\ufb01cation er-\nror only requires to \ufb01nd the class with highest posterior probability, and this search does\nnot require precise estimates of probabilities outside the interval [1/K, 1/2], where K is\nthe number of classes.\n\nReferences\nY. Lin, Y. Lee, and G. Wahba. Support vector machines for classi\ufb01cation in non-standard situations.\n\nMachine Learning, 46:191\u2013202, 2002.\n\nB. G. Lindsay. Nuisance parameters. In S. Kotz, C. B. Read, and D. L. Banks, editors, Encyclopedia\n\nof Statistical Sciences, volume 6. Wiley, 1985.\n\nG. J. McLachlan. Discriminant analysis and statistical pattern recognition. Wiley, 1992.\nK. Morik, P. Brockhausen, and T. Joachims. Combining statistical learning with a knowledge-based\n\napproach - a case study in intensive care monitoring. In Proceedings of ICML, 1999.\n\nD. Oakes. Semi-parametric models. In S. Kotz, C. B. Read, and D. L. Banks, editors, Encyclopedia\n\nof Statistical Sciences, volume 8. Wiley, 1988.\n\nJ. C. Platt. Probabilities for SV machines. In A. J. Smola, P. L. Bartlett, B. Sch\u00a8olkopf, and D. Schu-\n\nurmans, editors, Advances in Large Margin Classi\ufb01ers, pages 61\u201374. MIT Press, 2000.\n\nP. Sollich. Probabilistic methods for support vector machines. In S. A. Solla, T. K. Leen, and K.-R.\n\nM\u00a8uller, editors, Advances in Neural Information Processing Systems 12, pages 349\u2013355, 2000.\n\nK. Veropoulos, C. Campbell, and N. Cristianini. Controlling the sensitivity of support vector ma-\n\nchines. In T. Dean, editor, Proc. of the IJCAI, pages 55\u201360, 1999.\n\nT. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk mini-\n\nmization. Annals of Statistics, 32(1):56\u201385, 2004.\n\n\f", "award": [], "sourceid": 2763, "authors": [{"given_name": "Yves", "family_name": "Grandvalet", "institution": null}, {"given_name": "Johnny", "family_name": "Mariethoz", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}]}