{"title": "An Infinity-sample Theory for Multi-category Large Margin Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1077, "page_last": 1084, "abstract": "", "full_text": "An In\ufb01nity-sample Theory for Multi-category\n\nLarge Margin Classi\ufb01cation\n\nTong Zhang\n\nIBM T.J. Watson Research Center\n\nYorktown Heights, NY 10598\n\ntzhang@watson.ibm.com\n\nAbstract\n\nThe purpose of this paper is to investigate in\ufb01nity-sample properties of\nrisk minimization based multi-category classi\ufb01cation methods. These\nmethods can be considered as natural extensions to binary large margin\nclassi\ufb01cation. We establish conditions that guarantee the in\ufb01nity-sample\nconsistency of classi\ufb01ers obtained in the risk minimization framework.\nExamples are provided for two speci\ufb01c forms of the general formulation,\nwhich extend a number of known methods. Using these examples, we\nshow that some risk minimization formulations can also be used to ob-\ntain conditional probability estimates for the underlying problem. Such\nconditional probability information will be useful for statistical inferenc-\ning tasks beyond classi\ufb01cation.\n\n1 Motivation\nConsider a binary classi\ufb01cation problem where we want to predict label y \u2208 {\u00b11} based\non observation x. One of the most signi\ufb01cant achievements for binary classi\ufb01cation in\nmachine learning is the invention of large margin methods, which include support vector\nmachines and boosting algorithms. Based on a set of observations (X1, Y1), . . . , (Xn, Yn),\na large margin classi\ufb01cation algorithm produces a decision function \u02c6fn by empirically min-\nimizing a loss function that is often a convex upper bound of the binary classi\ufb01cation error\nfunction. Given \u02c6fn, the binary decision rule is to predict y = 1 if \u02c6fn(x) \u2265 0, and to predict\ny = \u22121 otherwise (the decision rule at \u02c6fn(x) = 0 is not important). In the literature, the\nfollowing form of large margin binary classi\ufb01cation is often encountered: we minimize the\nempirical risk associated with a convex function \u03c6 in a pre-chosen function class Cn:\n\n\u02c6fn = arg min\nf\u2208Cn\n\n1\nn\n\nnX\n\ni=1\n\n\u03c6(f(Xi)Yi).\n\n(1)\n\nOriginally such a scheme was regarded as a compromise to avoid computational dif\ufb01culties\nassociated with direct classi\ufb01cation error minimization, which often leads to an NP-hard\nproblem. The current view in the statistical literature interprets such methods as algorithms\nto obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related\nstudies. This point of view allows people to show the consistency of various large margin\n\n\fthat is, in the large sample limit, the obtained classi\ufb01ers achieve the optimal\nmethods:\nBayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning\nmethod is certainly a very desirable property, and one may argue that a good classi\ufb01cation\nmethod should be consistent in the large sample limit.\n\nAlthough statistical properties of binary classi\ufb01cation algorithms based on the risk min-\nimization formulation (1) are quite well-understood due to many recent works such as\nthose mentioned above, there are much fewer studies on risk minimization based multi-\ncategory problems which generalizes the binary large margin method (1). The complexity\nof possible generalizations may be one reason. Another reason may be that one can al-\nways estimate the conditional probability for a multi-category problem using the binary\nclassi\ufb01cation formulation (1) for each category, and then pick the category with the high-\nest estimated conditional probability (or score).1 However, it is still useful to understand\nwhether there are more natural alternatives, and what kind of risk minimization formulation\nwhich generalizes (1) can be used to yield consistent classi\ufb01ers in the large sample limit.\nAn important step toward this direction has recently been taken in [5], where the authors\nproposed a multi-category extension of the support vector machine that is Bayes consistent\n(note that there were a number of earlier proposals that were not consistent). The purpose\nof this paper is to generalize their investigation so as to include a much wider class of risk\nminimization formulations that can lead to consistent classi\ufb01ers in the in\ufb01nity-sample limit.\nWe shall see that there is a rich structure in risk minimization based multi-category classi-\n\ufb01cation formulations. Multi-category large margin methods have started to draw more at-\ntention recently. For example, in [2], learning bounds for some multi-category convex risk\nminimization methods were obtained, although the authors did not study possible choices\nof Bayes consistent formulations.\n\n2 Multi-category classi\ufb01cation\n\nWe consider the following K-class classi\ufb01cation problem: we would like to predict the\nlabel y \u2208 {1, . . . , K} of an input vector x. In this paper, we only consider the simplest\nscenario with 0 \u2212 1 classi\ufb01cation loss: we have a loss of 0 for correct prediction, and loss\nof 1 for incorrect prediction.\n\nIn binary classi\ufb01cation, the class label can be determined using the sign of a decision func-\ntion. This can be generalized to K class classi\ufb01cation problem as follows: we consider K\ndecision functions fc(x) where c = 1, . . . , K and we predict the label y of x as:\n\nT (f(x)) = arg max\n\nc\u2208{1,...,K} fc(x),\n\n(2)\n\nwhere we denote by f(x) the vector function f(x) = [f1(x), . . . , fK(x)].\nNote that if two or more components of f achieve the same maximum value, then we\nmay choose any of them as T (f). In this framework, fc(x) is often regarded as a scoring\nfunction for category c that is correlated with how likely x belongs to category c (compared\nwith the remaining k \u2212 1 categories). The classi\ufb01cation error is given by:\n\n\u2018(f) = 1 \u2212 EX P (Y = T (X)|X).\n\nNote that only the relative strength of fc compared with the alternatives is important. In\nparticular, the decision rule given in (2) does not change when we add the same numerical\nquantity to each component of f(x). This allows us to impose one constraint on the vector\nf(x) which decreases the degree of freedom K of the K-component vector f(x) to K \u2212 1.\n1This approach is often called one-versus-all or ranking in machine learning. Another main ap-\nproach is to encode a multi-category classi\ufb01cation problem into binary classi\ufb01cation sub-problems.\nThe consistency of such encoding schemes can be dif\ufb01cult to analyze, and we shall not discuss them.\n\n\fFor example, in the binary classi\ufb01cation case, we can enforce f1(x)+f2(x) = 0, and hence\nf(x) can be represented as [f1(x),\u2212f1(x)]. The decision rule in (2), which compares\nf1(x) \u2265 f2(x), is equivalent to f1(x) \u2265 0. This leads to the binary classi\ufb01cation rule\nmentioned in the introduction.\n\nconstraintPK\n\nIn the multi-category case, one may also interpret the possible constraint on the vector\nfunction f, which reduces its degree of freedom from K to K \u2212 1 based on the following\nreasoning. In many cases, we seek fc(x) as a function of p(Y = c|x). Since we have a\nc=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is\nK \u2212 1), the degree of freedom for f is also K \u2212 1 (instead of K). However, we shall point\nout that in the algorithms we formulate below, we may either enforce such a constraint\nthat reduces the degree of freedom of f, or we do not impose any constraint, which keeps\nthe degree of freedom of f to be K. The advantage of the latter is that it allows the\ncomputation of each fc to be decoupled. It is thus much simpler both conceptually and\nnumerically. Moreover, it directly handles multiple-label problems where we may assign\neach x to multiple labels of y \u2208 {1, . . . , K}. In this scenario, we do not have a constraint.\nIn this paper, we consider an empirical risk minimization method to solve a multi-category\nproblem, which is of the following general form:\n\n\u02c6fn = arg min\nf\u2208Cn\n\n1\nn\n\nnX\n\ni=1\n\n\u03a8Yi(f(Xi)).\n\n(3)\n\nAs we shall see later, this method is a natural generalization of the binary classi\ufb01cation\nmethod (1). Note that one may consider an even more general form with \u03a8Y (f(X)) re-\nplaced by \u03a8Y (f(X), X), which we don\u2019t study in this paper.\nFrom the standard learning theory, one can expect that with appropriately chosen Cn, the\nsolution \u02c6fn of (3) approximately minimizes the true risk R( \u02c6f) with respect to the unknown\nunderlying distribution within the function class Cn,\n\nR(f) = EX,Y \u03a8Y (f(X)) = EX L(P (\u00b7|X), f(X)),\n\nwhere P (\u00b7|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and\n\nKX\n\nL(q, f) =\n\nqc\u03a8c(f).\n\n(4)\n\n(5)\n\nc=1\n\nIn order to understand the large sample behavior of the algorithm based on solving (3), we\n\ufb01rst need to understand the behavior of a function f that approximately minimizes R(f).\nWe introduce the following de\ufb01nition (also referred to as classi\ufb01cation calibrated in [1]):\n\nDe\ufb01nition 2.1 Consider \u03a8c(f) in (4). We say that the formulation is admissible (clas-\nsi\ufb01cation calibrated) on a closed set \u2126 \u2286 [\u2212\u221e,\u221e]K if the following conditions hold:\n\u2200c, \u03a8c(\u00b7) : \u2126 \u2192 (\u2212\u221e,\u221e] is bounded below and continuous; \u2229c{f : \u03a8c(f) < \u221e} is\nnon-empty and dense in \u2126; \u2200q, if L(q, f\u2217) = inf f L(q, f), then f\u2217\nk implies\nqc = supk qk.\nSince we allow \u03a8c(f) = \u221e, we use the convention that qc\u03a8c(f) = 0 when qc = 0 and\n\u03a8c(f) = \u221e. The following result relates the approximate minimization of the \u03a8 risk to\nthe approximate minimization of classi\ufb01cation error:\nTheorem 2.1 Let B be the set of all Borel measurable functions. For a closed set \u2126 \u2282\n[\u2212\u221e,\u221e]K, let B\u2126 = {f \u2208 B : \u2200x, f(x) \u2208 \u2126}. If \u03a8c(\u00b7) is admissible on \u2126, then for a\nBorel measurable distribution, R(f) \u2192 inf g\u2208B\u2126 R(g) implies \u2018(f) \u2192 inf g\u2208B \u2018(g).\n\nc = supk f\u2217\n\n\ffk} \u2265 inf\ng\u2208\u2126\n\nk\n\ninf\n\n{L(q, f) : fc = sup\n\nProof Sketch. First we show that the admissibility implies that \u2200\u0001 > 0, \u2203\u03b4 > 0 such that \u2200q\nand x:\n\nL(q, g) + \u03b4.\n\nk , qm\n\nqc\u2264supk qk\u2212\u0001\ncm \u2264 supk qm\n\n(6)\nIf (6) does not hold, then \u2203\u0001 > 0, and a sequence of (cm, f m, qm) with f m \u2208 \u2126 such that\nk \u2212 \u0001, and L(qm, f m) \u2212 inf g\u2208\u2126 L(qm, g) \u2192 0. Taking a\ncm = supk f m\nf m\nlimit point of (cm, f m, qm), and using the continuity of \u03a8c(\u00b7), we obtain a contradiction\n(technical details handling the in\ufb01nity case are skipped). Therefore (6) must be valid.\nNow we consider a vector function f(x) \u2208 \u2126B. Let q(x) = P (\u00b7|x). Given X, if P (Y =\nT (f(X))|X) \u2265 P (Y = T (q(X))|X)+\u0001, then equation (6) implies that L(q(X), f(X)) \u2265\ninf g\u2208\u2126 L(q(X), g) + \u03b4. Therefore\n\n\u2018(f) \u2212 inf\n\ng\u2208B \u2018(g) =EX[P (Y = T (q(X))|X) \u2212 P (Y = T (f(X))|X)]\n\nLX(q(X), f(X)) \u2212 inf g\u2208B\u2126 LX(q(X), g)\n\n\u2264\u0001 + EX I(P (Y = T (q(X))|X) \u2212 P (Y = T (f(X))|X) > \u0001)\n\u2264\u0001 + EX\n=\u0001 + R(f) \u2212 inf g\u2208B\u2126 R(g)\n\n\u03b4\n\n.\n\n\u03b4\n\nIn the above derivation we use I to denote the indicator function. Since \u0001 and \u03b4 are arbitrary,\nwe obtain the theorem by letting \u0001 \u2192 0. 2\nClearly, based on the above theorem, an admissible risk minimization formulation is suit-\nable for multi-category classi\ufb01cation problems. The classi\ufb01er obtained from minimiz-\ning (3) can approach the Bayes error rate if we can show that with appropriately chosen\nfunction class Cn, approximate minimization of (3) implies approximate minimization\nof (4). Learning bounds of this forms have been very well-studied in statistics and ma-\nchine learning. For example, for large margin binary classi\ufb01cation, such bounds can be\nfound in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various\nlarge margin methods. In order to achieve consistency, it is also necessary to take a se-\nquence of function classes Cn (C1 \u2282 C2 \u2282 \u00b7\u00b7\u00b7 ) such that \u222anCn is dense in the set of\nBorel measurable functions. The set Cn has the effect of regularization, which ensures that\nR( \u02c6fn) \u2248 inf f\u2208Cn R(f). It follows that as n \u2192 \u221e, R( \u02c6fn) P\u2192 inf f\u2208B R(f). Theorem 2.1\nthen implies that \u2018( \u02c6fn) P\u2192 inf f\u2208B \u2018(f).\nThe purpose of this paper is not to study similar learning bounds that relate approximate\nminimization of (3) to the approximate minimization of (4). See [2] for a recent investi-\ngation. We shall focus on the choices of \u03a8 that lead to admissible formulations. We pay\nspecial attention to the case that each \u03a8c(f) is a convex function of f, so that the resulting\nformulation becomes computational more tractable. Instead of working with the general\nform of \u03a8c in (4), we focus on two speci\ufb01c choices listed in the next two sections.\n\n3 Unconstrained formulations\n\nWe consider unconstrained formulation with the following choice of \u03a8:\n\n KX\n\n!\n\n\u03a8c(f) = \u03c6(fc) + s\n\nt(fk)\n\n,\n\n(7)\n\nwhere \u03c6, s and t are appropriately chosen functions that are continuously differentiable.\n\nThe \ufb01rst term, which has a relatively simple form, depends on the label c. The second\nterm is independent of the label, and can be regarded as a normalization term. Note that\n\nk=1\n\n\fthis function is symmetric with respect to components of f. This choice treats all potential\nclasses equally. It is also possible to treat different classes differently (e.g. replacing \u03c6(fc)\nby \u03c6c(fc)), which can be useful if we associate different classi\ufb01cation loss to different\nkinds of errors.\n\n3.1 Optimality equation and probability model\n\nUsing (7), the conditional true risk (5) can be written as:\n\nL(q, f) =\n\nqc\u03c6(fc) + s\n\nKX\n\nc=1\n\n KX\n\n!\n\nt(fc)\n\n.\n\nc=1\n\nc ) = 0\n\nqc\u03c60(f\u2217\n\n(c = 1, . . . , K).\n\nIn the following, we study the property of the optimal vector f\u2217 that minimizes L(q, f)\nfor a \ufb01xed q. Given q, the optimal solution f\u2217 of L(q, f) satis\ufb01es the following \ufb01rst order\ncondition:\n\nwhere quantity \u00b5f\u2217 = s0(PK\n\nc ) + \u00b5f\u2217 t0(f\u2217\nk=1 t(f\u2217\nk )) is independent of k.\nClearly this equation relates qc to f\u2217\nc for each component c. The relationship of q and f\u2217\nde\ufb01ned by (8) can be regarded as the (in\ufb01nite sample-size) probability model associated\nwith the learning method (3) with \u03a8 given by (7).\nThe following result presents a simple criterion to check admissibility. We skip the proof\nfor simplicity. Most of our examples satisfy the condition.\nProposition 3.1 Consider (7). Assume \u03a6c(f) is continuous on [\u2212\u221e,\u221e]K and bounded\nbelow. If s0(u) \u2265 0 and \u2200p > 0, p\u03c60(f) + t0(f) = 0 has a unique solution fp that is an\nincreasing function of p, then the formulation is admissible.\nIf s(u) = u, the condition \u2200p > 0 in Proposition 3.1 can be replaced by \u2200p \u2208 (0, 1).\n\n(8)\n\n3.2 Decoupled formulations\n\nWe let s(u) = u in (7). The optimality condition (8) becomes\n\nqc\u03c60(f\u2217\n\nc ) + t0(f\u2217\n\nc ) = 0\n\n(c = 1, . . . , K).\n\n(9)\nThis means that we have K decoupled equalities, one for each fc. This is the simplest and\nin the author\u2019s opinion, the most interesting formulation. Since the estimation problem in\n(3) is also decoupled into K separate equations, one for each component of \u02c6fn, this class\nof methods are computationally relatively simple and easy to parallelize. Although this\nmethod seems to be preferable for multi-category problems, it is not the most ef\ufb01cient way\nfor two-class problem (if we want to treat the two classes in a symmetric manner) since we\nhave to solve two separate equations. We only need to deal with one equation in (1) due\nto the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of\nequations. This variable elimination has little impact if there are many categories.\n\nIn the following, we list some examples of multi-category risk minimization formulations.\nThey all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship\nof the optimal optimizer function f\u2217(q) and the conditional probability q. For simplicity,\nwe focus on the choice \u03c6(u) = \u2212u.\n3.2.1 \u03c6(u) = \u2212u and t(u) = eu\nWe obtain the following probability model: qc = ef\u2217\n\nto the maximum-likelihood estimate with conditional model qc = efc /PK\n\nc . This formulation is closely related\nk=1 efk (logistic\n\n\ftionPK\nour function class such thatP\nfunction classes that do not satisfy the normalization constraintP\n\nregression). In particular, if we choose a function class such that the normalization condi-\nk=1 efk = 1 holds, then the two formulations are identical. However, they become\ndifferent when we do not impose such a normalization condition.\nAnother very important and closely related formulation is the choice of \u03c6(u) = \u2212 ln u\nand t(u) = u. This is an extension of maximum-likelihood estimate with probability\nmodel qc = fc. The resulting method is identical to maximum-likelihood if we choose\nk fk = 1. However, the formulation also allows us to use\nk fk = 1. Therefore this\n\nc )\u22121. Again this is an unnormalized model.\n\nc )|f\u2217\nc \u2265 0. One may modify it such that we allow f\u2217\n\nmethod is more \ufb02exible.\n3.2.2 \u03c6(u) = \u2212u and t(u) = ln(1 + eu)\nThis version uses binary logistic regression loss, and we have the following probability\nmodel: qc = (1 + e\u2212f\u2217\n3.2.3 \u03c6(u) = \u2212u and t(u) = 1\nWe obtain the following probability model: qc = sign(f\u2217\nsolution, f\u2217\nprobability qc = 0.\n3.2.4 \u03c6(u) = \u2212u and t(u) = 1\nc , 0)p\u22121. The\nIn this probability model, we have the following relationship: qc = max(f\u2217\nc \u2264 0 to model the conditional probability qc = 0. There-\nequation implies that we allow f\u2217\nfore, with a \ufb01xed function class, this model is more powerful than the previous one. How-\nc \u2264 1. This requirement can be further alleviated with the\never, at the optimal solution, f\u2217\nfollowing modi\ufb01cation.\n3.2.5 \u03c6(u) = \u2212u and t(u) = 1\n\nc |p\u22121. This means that at the\nc \u2264 0 to model the condition\n\np min(max(u, 0)p, p(u \u2212 1) + 1) (p > 1)\n\np|u|p (p > 1)\n\np max(u, 0)p (p > 1)\n\nIn this probability model, we have the following relationship at the exact solution: qc =\nmin(max(f c\u2217 , 0), 1)p\u22121. Clearly this model is more powerful than the previous model since\nthe function value f\u2217\n\nc \u2265 1 can be used to model qc = 1.\n\n3.3 Coupled formulations\nIn the coupled formulation with s(u) 6= u, the probability model can be normalized in a\ncertain way. We list a few examples.\n3.3.1 \u03c6(u) = \u2212u, and t(u) = eu, and s(u) = ln(u)\nThis is the standard logistic regression model. The probability model is:\n\nKX\n\nqc(x) = exp(f\u2217\n\nc (x))(\n\nexp(f\u2217\n\nc (x)))\u22121.\n\nc=1\n\nThe right hand side is always normalized (sum up to 1). Note that the model is not contin-\nuous at in\ufb01nities, and thus not admissible in our de\ufb01nition. However, we may consider the\nregion \u2126 = {f : supk fk = 0}, and it is easy to check that this model is admissible in \u2126.\nc = fc \u2212 supk fk \u2208 \u2126, then f \u2126 has the same decision rule as f and R(f) = R(f \u2126).\nLet f \u2126\nTherefore Theorem 2.1 implies that R(f) \u2192 inf g\u2208B R(g) implies \u2018(f) \u2192 inf g\u2208B \u2018(g).\n\n\fKX\nKX\n\nk=1\n\n3.3.2 \u03c6(u) = \u2212u, and t(u) = |u|p0\n\n, and s(u) = 1\n\np|u|p/p0\n\n(p, p0 > 1)\n\nThe probability model is:\n\nqc(x) = (\n\n|f\u2217\nk (x)|p0\n\n)(p\u2212p0)/p0\n\nsign(f\u2217\n\nc (x))|f\u2217\n\nc (x)|p0\u22121.\n\nWe may replace t(u) by t(u) = max(0, u)p, and the probability model becomes:\n\nqc(x) = (\n\nmax(f\u2217\n\nk (x), 0)p0\n\n)(p\u2212p0)/p0\n\nmax(f\u2217\n\nc (x), 0)p0\u22121.\n\nk=1\n\nThese formulations do not seem to have advantages over the decoupled counterparts. Note\np0\u22121 -th power of the right hand side \u2192 1. In a\nthat if we let p \u2192 1, then the sum of the\np0\nway, this means that the model is normalized in the limit of p \u2192 1.\n\n4 Constrained formulations\n\nAs pointed out, one may impose constraints on possible choices of f. We may impose\nsuch a condition when we specify the function class Cn. However, for clarity, we shall\ndirectly impose a condition into our formulation. If we impose a constraint into (7), then\nits effect is rather similar to that of the second term in (7). In this section, we consider\na direct extension of binary large-margin method (1) to multi-category case. The choice\ngiven below is motivated by [5], where an extension of SVM was proposed. We use a risk\nformulation that is different from (7), and for simplicity, we will consider linear equality\nconstraint only:\n\n\u03c6(\u2212fk),\n\ns.t.\n\nf \u2208 \u2126,\n\n(10)\n\nk=1,k6=c\n\nKX\nKX\n\nk=1\n\nKX\n\nwhere we de\ufb01ne \u2126 as:\n\n\u03a8c(f) =\n\n\u2126 = {f :\n\nfk = 0} \u222a {f : sup\n\nfk = \u221e}.\n\nk\n\nWe may interpret the added constraint as a restriction on the function class Cn in (3) such\nthat every f \u2208 Cn satis\ufb01es the constraint. Note that with K = 2, this leads to the usually\nbinary large margin method. Using (10), the conditional true risk (5) can be written as:\n\nL(q, f) =\n\n(1 \u2212 qc)\u03c6(\u2212fc),\n\ns.t. f \u2208 \u2126.\n\n(11)\n\nc=1\n\nThe following result provides a simple way to check the admissibility of (10).\nProposition 4.1 If \u03c6 is a convex function which is bounded below and \u03c60(0) < 0, then (10)\nis admissible on \u2126.\n\nProof Sketch. The continuity condition is straight-forward to verify. We may also assume\nthat \u03c6(\u00b7) \u2265 0 without loss of generality. Now let f achieves the minimum of L(q,\u00b7). If\nfc = \u221e, then it is clear that qc = 1 and thus qk = 0 for k 6= c. This implies that\nfor k 6= c, \u03c6(\u2212fk) = inf f \u03c6(\u2212f), and thus fk < 0. If fc = supk fk < \u221e, then the\nconstraint implies fc \u2265 0. It is easy to see that \u2200k, qc \u2265 qk since otherwise, we must have\n\u03c6(\u2212fk) > \u03c6(\u2212fc), and thus \u03c60(\u2212fk) > 0 and \u03c60(\u2212fc) < 0, implying that with suf\ufb01cient\nsmall \u03b4 > 0, \u03c6(\u2212(fk + \u03b4)) < \u03c6(\u2212fk) and \u03c6(\u2212(fc \u2212 \u03b4)) < \u03c6(\u2212fc). A contradiction. 2\n\n\fUsing the above criterion, we can convert any admissible convex \u03c6 for the binary formula-\ntion (1) into an admissible multi-category classi\ufb01cation formulation (10).\nIn [5] the special case of SVM (with loss function \u03c6(u) = max(0, 1\u2212 u)) was studied. The\nauthors demonstrated the admissibility by direct calculation, although no results similar to\nTheorem 2.1 were established. Such a result is needed to prove consistency. The treatment\npresented here generalizes their study. Note that for the constrained formulation, it is more\ndif\ufb01cult to relate fc at the optimal solution to a probability model, since such a model will\nhave a much more complicated form compared with the unconstrained counterpart.\n\n5 Conclusion\n\nIn this paper we proposed a family of risk minimization methods for multi-category classi-\n\ufb01cation problems, which are natural extensions of binary large margin classi\ufb01cation meth-\nods. We established admissibility conditions that ensure the consistency of the obtained\nclassi\ufb01ers in the large sample limit. Two speci\ufb01c forms of risk minimization were pro-\nposed and examples were given to study the induced probability models. As an implication\nof this work, we see that it is possible to obtain consistent (conditional) density estimation\nusing various non-maximum likelihood estimation methods. One advantage of some of the\nnewly proposed methods is that they allow us to model zero density directly. Note that for\nthe maximum-likelihood method, near zero density may cause serious robustness problems\nat least in theory.\n\nReferences\n\n[1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classi\ufb01cation, and risk\nbounds. Technical Report 638, Statistics Department, University of California, Berke-\nley, 2003.\n\n[2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classi\ufb01-\n\ncation based on convex losses. In COLT, 2003.\n\n[3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical\n\nview of boosting. The Annals of Statistics, 28(2):337\u2013407, 2000. With discussion.\n\n[4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with\n\ndiscussion.\n\n[5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and\napplication to the classi\ufb01cation of microarray data and satellite radiance data. Journal\nof American Statistical Association, 2002. accepted.\n\n[6] Yi Lin. Support vector machines and the bayes rule in classi\ufb01cation. Data Mining\n\nand Knowledge Discovery, pages 259\u2013275, 2002.\n\n[7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting\n\nmethods. The Annals of Statistics, 32, 2004. with discussion.\n\n[8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classi\ufb01cation - con-\nsistency, convergence rates, and adaptivity. Journal of Machine Learning Research,\n4:713\u2013741, 2003.\n\n[9] Robert E. Schapire and Yoram Singer.\n\nImproved boosting algorithms using\n\ncon\ufb01dence-rated predictions. Machine Learning, 37:297\u2013336, 1999.\n\n[10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity,\n\n18:768\u2013791, 2002.\n\n[11] Tong Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on\n\nconvex risk minimization. The Annals of Statitics, 32, 2004. with discussion.\n\n\f", "award": [], "sourceid": 2451, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}