{"title": "Multiclass Learning Approaches: A Theoretical Comparison with Implications", "book": "Advances in Neural Information Processing Systems", "page_first": 485, "page_last": 493, "abstract": "We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension $d$, and in particular from the class of halfspaces over $\\reals^d$. We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the \\emph{approximation error} of hypothesis classes. This is in sharp contrast to most, if not all, previous uses of VC theory, which only deal with estimation error.", "full_text": "Multiclass Learning Approaches:\n\nA Theoretical Comparison with Implications\n\nAmit Daniely\n\nDepartment of Mathematics\n\nThe Hebrew University\n\nJerusalem, Israel\n\nSivan Sabato\n\nMicrosoft Research\n1 Memorial Drive\n\nCambridge, MA 02142, USA\n\nShai Shalev-Shwartz\nSchool of CS and Eng.\nThe Hebrew University\n\nJerusalem, Israel\n\nAbstract\n\nWe theoretically analyze and compare the following \ufb01ve popular multiclass clas-\nsi\ufb01cation methods: One vs. All, All Pairs, Tree-based classi\ufb01ers, Error Correcting\nOutput Codes (ECOC) with randomly generated code matrices, and Multiclass\nSVM. In the \ufb01rst four methods, the classi\ufb01cation is based on a reduction to binary\nclassi\ufb01cation. We consider the case where the binary classi\ufb01er comes from a class\nof VC dimension d, and in particular from the class of halfspaces over Rd. We\nanalyze both the estimation error and the approximation error of these methods.\nOur analysis reveals interesting conclusions of practical relevance, regarding the\nsuccess of the different approaches under various conditions. Our proof technique\nemploys tools from VC theory to analyze the approximation error of hypothesis\nclasses. This is in contrast to most previous uses of VC theory, which only deal\nwith estimation error.\n\n1\n\nIntroduction\n\nIn this work we consider multiclass prediction: The problem of classifying objects into one of\nseveral possible target classes. Applications include, for example, categorizing documents according\nto topic, and determining which object appears in a given image. We assume that objects (a.k.a.\ninstances) are vectors in X = Rd and the class labels come from the set Y = [k] = {1, . . . , k}.\nFollowing the standard PAC model, the learner receives a training set of m examples, drawn i.i.d.\nfrom some unknown distribution, and should output a classi\ufb01er which maps X to Y.\nThe centrality of the multiclass learning problem has spurred the development of various approaches\nfor tackling the task. Perhaps the most straightforward approach is a reduction from multiclass\nclassi\ufb01cation to binary classi\ufb01cation. For example, the One-vs-All (OvA) method is based on a\nreduction of the multiclass problem into k binary problems, each of which discriminates between\none class to all the rest of the classes (e.g. Rumelhart et al. [1986]). A different reduction is the All-\nPairs (AP) approach in which all pairs of classes are compared to each other [Hastie and Tibshirani,\n1998]. These two approaches have been uni\ufb01ed under the framework of Error Correction Output\nCodes (ECOC) [Dietterich and Bakiri, 1995, Allwein et al., 2000]. A tree-based classi\ufb01er (TC) is\nanother reduction in which the prediction is obtained by traversing a binary tree, where at each node\nof the tree a binary classi\ufb01er is used to decide on the rest of the path (see for example Beygelzimer\net al. [2007]).\nAll of the above methods are based on reductions to binary classi\ufb01cation. We pay special attention\nto the case where the underlying binary classi\ufb01ers are linear separators (halfspaces). Formally, each\nw \u2208 Rd+1 de\ufb01nes the linear separator hw(x) = sign((cid:104)w, \u00afx(cid:105)), where \u00afx = (x, 1) \u2208 Rd+1 is the\nconcatenation of the vector x and the scalar 1. While halfspaces are our primary focus, many of our\nresults hold for any underlying binary hypothesis class of VC dimension d + 1.\n\n1\n\n\fOther, more direct approaches to multiclass classi\ufb01cation over Rd have also been proposed (e.g.\nVapnik [1998], Weston and Watkins [1999], Crammer and Singer [2001]). In this paper we analyze\nthe Multiclass SVM (MSVM) formulation of Crammer and Singer [2001], in which each hypothesis\nis of the form hW (x) = argmaxi\u2208[k](W \u00afx)i, where W is a k \u00d7 (d + 1) matrix and (W \u00afx)i is the i\u2019th\nelement of the vector W \u00afx \u2208 Rk.\nWe theoretically analyze the prediction performance of the aforementioned methods, namely, OvA,\nAP, ECOC, TC, and MSVM. The error of a multiclass predictor h : Rd \u2192 [k] is de\ufb01ned to be the\nprobability that h(x) (cid:54)= y, where (x, y) is sampled from the underlying distribution D over Rd\u00d7 [k],\nnamely, Err(h) = P(x,y)\u223cD[h(x) (cid:54)= y]. Our main goal is to understand which method is preferable\nin terms of the error it will achieve, based on easy-to-verify properties of the problem at hand.\nOur analysis pertains to the type of classi\ufb01ers each method can potentially \ufb01nd, and does not depend\non the speci\ufb01c training algorithm. More precisely, each method corresponds to a hypothesis class,\nH, which contains the multiclass predictors that may be returned by the method. For example, the\nhypothesis class of MSVM is H = {x (cid:55)\u2192 argmaxi\u2208[k](W \u00afx)i : W \u2208 Rk\u00d7(d+1)}.\nA learning algorithm, A, receives a training set, S = {(xi, yi)}m\ni=1, sampled i.i.d. according to\nD, and returns a multiclass predictor which we denote by A(S) \u2208 H. A learning algorithm\nis called an Empirical Risk Minimizer (ERM) if it returns a hypothesis in H that minimizes the\nempirical error on the sample. We denote by h(cid:63) a hypothesis in H with minimal error,1 that is,\nh(cid:63) \u2208 argminh\u2208H Err(h).\nWhen analyzing the error of A(S), it is convenient to decompose this error as a sum of approxima-\ntion error and estimation error:\n\nErr(A(S)) = Err(h(cid:63))\n\n+ Err(A(S)) \u2212 Err(h(cid:63))\n\n.\n\n(1)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\napproximation\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nestimation\n\n(cid:125)\n\n\u2022 The approximation error is the minimum error achievable by a predictor in the hypothesis\nclass, H. The approximation error does not depend on the sample size, and is determined\nsolely by the allowed hypothesis class2.\n\n\u2022 The estimation error of an algorithm is the difference between the approximation error,\nand the error of the classi\ufb01er the algorithm chose based on the sample. This error exists\nboth for statistical reasons, since the sample may not be large enough to determine the best\nhypothesis, and for algorithmic reasons, since the learning algorithm may not output the\nbest possible hypothesis given the sample. For the ERM algorithm, the estimation error can\nH (analogous to the VC dimension) and m is the sample size. A similar term also bounds\nthe estimation error from below for any algorithm. Thus C(H) is an estimate of the best\nachievable estimation error for the class.\n\nbe bounded from above by order of(cid:112)C(H)/m where C(H) is a complexity measure of\n\nWhen studying the estimation error of different methods, we follow the standard distribution-free\nanalysis. Namely, we will compare the algorithms based on the worst-case estimation error, where\nworst-case is over all possible distributions D. Such an analysis can lead us to the following type\nof conclusion: If two hypothesis classes have roughly the same complexity, C(H1) \u2248 C(H2),\nand the number of available training examples is signi\ufb01cantly larger than this value of complexity,\nthen for both hypothesis classes we are going to have a small estimation error. Hence, in this\ncase the difference in prediction performance between the two methods will be dominated by the\napproximation error and by the success of the learning algorithm in approaching the best possible\nestimation error. In our discussion below we disregard possible differences in optimality which stem\nfrom algorithmic aspects and implementation details. A rigorous comparison of training heuristics\nwould certainly be of interest and is left to future work.\nFor the approximation error we will provide even stronger results, by comparing the approximation\nerror of classes for any distribution. We rely on the following de\ufb01nition.\n\n1For simplicity, we assume that the minimum is attainable.\n2Note that, when comparing different hypothesis classes over the same distribution, the Bayes error is\n\nconstant. Thus, in the de\ufb01nition of approximation error, we do not subtract the Bayes error.\n\n2\n\n\fDe\ufb01nition 1.1. Given two hypothesis classes, H,H(cid:48), we say that H essentially contains H(cid:48) if for\nany distribution, the approximation error of H is at most the approximation error of H(cid:48). H strictly\ncontains H(cid:48) if, in addition, there is a distribution for which the approximation error of H is strictly\nsmaller than that of H(cid:48).\nOur main \ufb01ndings are as follows (see a full comparison in Table 1). The formal statements are given\nin Section 3.\n\n\u2022 The estimation errors of OvA, MSVM, and TC are all roughly the same, in the sense that\nC(H) = \u02dc\u0398(dk) for all of the corresponding hypothesis classes. The complexity of AP is\n\u02dc\u0398(dk2). The complexity of ECOC with a code of length l and code-distance \u03b4 is at most\n\u02dcO(dl) and at least d\u03b4/2. It follows that for randomly generated codes, C(H) = \u02dc\u0398(dl).\nNote that this analysis shows that a larger code-distance yields a larger estimation error and\nmight therefore hurt performance. This contrasts with previous \u201creduction-based\u201d analyses\nof ECOC, which concluded that a larger code distance improves performance.\n\u2022 We prove that the hypothesis class of MSVM essentially contains the hypothesis classes\nof both OvA and TC. Moreover, these inclusions are strict. Since the estimation errors of\nthese three methods are roughly the same, it follows that the MSVM method dominates\nboth OvA and TC in terms of achievable prediction performance.\n\n\u2022 In the TC method, one needs to associate each leaf of the tree to a label.\n\nIf no prior\nknowledge on how to break the symmetry is known, it is suggested in Beygelzimer et al.\n[2007] to break symmetry by choosing a random permutation of the labels. We show that\nwhenever d (cid:28) k, for any distribution D, with high probability over the choice of a random\npermutation, the approximation error of the resulting tree would be close to 1/2. It follows\nthat a random choice of a permutation is likely to yield a poor predictor.\n\u2022 We show that if d (cid:28) k, for any distribution D, the approximation error of ECOC with a\n\u2022 We show that the hypothesis class of AP essentially contains the hypothesis class of MSVM\n(hence also that of OvA and TC), and that there can be a substantial gap in the containment.\nTherefore, as expected, the relative performance of AP and MSVM depends on the well-\nknown trade-off between estimation error and approximation error.\n\nrandomly generated code matrix is likely to be close to 1/2.\n\nEstimation error\nApproximation\nerror\nTesting run-time\n\nTC\ndk\n\n\u2265 MSVM\n\n\u2248 1/2 when d (cid:28) k\n\nOvA\ndk\n\nMSVM\ndk\n\u2265 MSVM \u2265 AP\n\nAP\ndk2\n\nsmallest\n\nrandom ECOC\n\ndl\n\nincomparable\n\n\u2248 1/2 when d (cid:28) k\n\nd log(k)\n\ndk\n\ndk\n\ndk2\n\ndl\n\nTable 1: Summary of comparison\n\nThe above \ufb01ndings suggest that in terms of performance, it may be wiser to choose MSVM over OvA\nand TC, and especially so when d (cid:28) k. We note, however, that in some situations (e.g. d = k) the\nprediction success of these methods can be similar, while TC has the advantage of having a testing\nrun-time of d log(k), compared to the testing run-time of dk for OvA and MSVM. In addition, TC\nand ECOC may be a good choice when there is additional prior knowledge on the distribution or on\nhow to break symmetry between the different labels.\n\n1.1 Related work\n\nAllwein et al. [2000] analyzed the multiclass error of ECOC as a function of the binary error. The\nproblem with such a \u201creduction-based\u201d analysis is that such analysis becomes problematic if the\nunderlying binary problems are very hard. Indeed, our analysis reveals that the underlying binary\nproblems would be too hard if d (cid:28) k and the code is randomly generated. The experiments in All-\nwein et al. [2000] show that when using kernel-based SVM or AdaBoost as the underlying classi\ufb01er,\nOvA is inferior to random ECOC. However, in their experiments, the number of classes is small rel-\native to the dimension of the feature space, especially if working with kernels or with combinations\nof weak learners.\n\n3\n\n\fCrammer and Singer [2001] presented experiments demonstrating that MSVM outperforms OvA\non several data sets. Rifkin and Klautau [2004] criticized the experiments of Crammer and Singer\n[2001], Allwein et al. [2000], and presented another set of experiments demonstrating that all meth-\nods perform roughly the same when the underlying binary classi\ufb01er is very strong (SVM with a\nGuassian kernel). As our analysis shows, it is not surprising that with enough data and powerful\nbinary classi\ufb01ers, all methods should perform well. However, in many practical applications, we\nwill prefer not to employ kernels (either because of shortage of examples, which might lead to a\nlarge estimation error, or due to computational constraint), and in such cases we expect to see a\nlarge difference between the methods.\nBeygelzimer et al. [2007] analyzed the regret of a speci\ufb01c training method for trees, called Filter\nTree, as a function of the regret of the binary classi\ufb01er. The regret is de\ufb01ned to be the difference\nbetween the learned classi\ufb01er and the Bayes-optimal classi\ufb01er for the problem. Here again we show\nthat the regret values of the underlying binary classi\ufb01ers are likely to be very large whenever d (cid:28) k\nand the leaves of the tree are associated to labels in a random way. Thus in this case the regret\nanalysis is problematic. Several authors presented ways to learn better splits, which corresponds to\nlearning the association of leaves to labels (see for example Bengio et al. [2011] and the references\ntherein). Some of our negative results do not hold for such methods, as these do not randomly attach\nlabels to tree leaves.\nDaniely et al. [2011] analyzed the properties of multiclass learning with various ERM learners, and\nhave also provided some bounds on the estimation error of multiclass SVM and of trees. In this\npaper we both improve these bounds, derive new bounds for other classes, and also analyze the\napproximation error of the classes.\n\n2 De\ufb01nitions and Preliminaries\n\nWe \ufb01rst formally de\ufb01ne the hypothesis classes that we analyze in this paper.\nMulticlass SVM (MSVM): For W \u2208 Rk\u00d7(d+1) de\ufb01ne hW : Rd \u2192 [k] by hW (x) =\nargmaxi\u2208[k](W \u00afx)i and let L = {hW : W \u2208 Rk\u00d7(d+1)}. Though NP-hard in general, solving\nthe ERM problem with respect to L can be done ef\ufb01ciently in the realizable case (namely, whenever\nexists a hypothesis with zero empirical error on the sample).\n\nTree-based classi\ufb01ers (TC): A tree-based multiclass classi\ufb01er is a full binary tree whose leaves\nare associated with class labels and whose internal nodes are associated with binary classi\ufb01ers. To\nclassify an instance, we start with the root node and apply the binary classi\ufb01er associated with\nit. If the prediction is 1 we traverse to the right child. Otherwise, we traverse to the left child.\nThis process continues until we reach a leaf, and then we output the label associated with the leaf.\nFormally, a tree for k classes is a full binary tree T together with a bijection \u03bb : leaf(T ) \u2192 [k],\nwhich associates a label to each of the leaves. We usually identify T with the pair (T, \u03bb). The set\nof internal nodes of T is denoted by N (T ). Let H \u2282 {\u00b11}X be a binary hypothesis class. Given a\nmapping C : N (T ) \u2192 H, de\ufb01ne a multiclass predictor, hC : X \u2192 [k], by setting hC(x) = \u03bb(v)\nwhere v is the last node of the root-to-leaf path v1, . . . vm = v such that vi+1 is the left (resp. right)\nchild of vi if C(vi)(x) = \u22121 (resp. C(vi)(x) = 1). Let HT = {hC | C : N (T ) \u2192 H}. Also, let\nHtrees = \u222aT is a tree for k classes HT . If H is the class of linear separators over Rd, then for any tree T\nthe ERM problem with respect to HT can be solved ef\ufb01ciently in the realizable case. However, the\nERM problem is NP-hard in the non-realizable case.\nError Correcting Output Codes (ECOC): An ECOC is a code M \u2208 Rk\u00d7l along with a bijection\n\u03bb : [k] \u2192 [k]. We sometimes identify \u03bb with the identity function and M with (M, \u03bb)3. Given a\ncode M, and the result of l binary classi\ufb01ers represented by a vector u \u2208 {\u22121, 1}l, the code selects\na label via \u02dcM : {\u22121, 1}l \u2192 [k], de\ufb01ned by \u02dcM (u) = \u03bb\n. Given\nbinary classi\ufb01ers h1, . . . , hl for each column in the code matrix, the code assigns to the instance\nx \u2208 X the label \u02dcM (h1(x), . . . , hl(x)). Let H \u2282 {\u00b11}X be a binary hypothesis class. Denote by\n\narg maxi\u2208[k]\n\n(cid:16)\n\n(cid:80)l\n\nj=1 Mijuj\n\n(cid:17)\n\n3The use of \u03bb here allows us to later consider codes with random association of rows to labels.\n\n4\n\n\fHM \u2286 [k]X the hypotheses class HM = {h : X \u2192 [k] | \u2203(h1, . . . , hl) \u2208 Hl s.t. \u2200x \u2208 X , h(x) =\n\u02dcM (h1(x), . . . , hl(x))}.\nThe distance of a binary code, denoted by \u03b4(M ) for M \u2208 {\u00b11}k\u00d7l, is the minimal hamming\ndistance between any two pairs of rows in the code matrix. Formally, the hamming distance between\nu, v \u2208 {\u22121, +1}l is \u2206h(u, v) = |{r : u[r] (cid:54)= v[r]}|, and \u03b4(M ) = min1\u2264i 0, mA(\u0001, \u03b4) is the smallest integer such that for every m \u2265 mA(\u0001, \u03b4) and every distribution\nD on X \u00d7 Y, with probability of > 1 \u2212 \u03b4 over the choice of an i.i.d. sample S of size m,\n\nh\u2208H Err(h) + \u0001 .\n\n(2)\nThe \ufb01rst term on the right-hand side is the approximation error of H. Therefore, the sample com-\nplexity is the number of examples required to ensure that the estimation error of A is at most \u0001 (with\nhigh probability). We denote the sample complexity of a class H by mH(\u0001, \u03b4) = inf A mA(\u0001, \u03b4),\nwhere the in\ufb01mum is taken over all learning algorithms.\nTo bound the sample complexity of a hypothesis class we rely on upper and lower bounds on the\nsample complexity in terms of two generalizations of the VC dimension for multiclass problems,\ncalled the Graph dimension and the Natarajan dimension and denoted dG(H) and dN (H). For\ncompleteness, these dimensions are formally de\ufb01ned in the appendix.\nTheorem 2.1. Daniely et al. [2011] For every hypothesis class H, and for every ERM rule,\n\n(cid:18) min{dN (H) ln(|Y|), dG(H)} + ln( 1\n\n(cid:19)\n\nErr(A(Sm)) \u2264 min\n\n(cid:18) dN (H) + ln( 1\n\n\u03b4 )\n\n(cid:19)\n\n\u2126\n\n\u00012\n\n\u2264 mH(\u0001, \u03b4) \u2264 mERM(\u0001, \u03b4) \u2264 O\n\n\u03b4 )\n\n\u00012\n\nWe note that the constants in the O, \u2126 notations are universal.\n\n3 Main Results\n\nIn Section 3.1 we analyze the sample complexity of the different hypothesis classes. We provide\nlower bounds on the Natarajan dimensions of the various hypothesis classes, thus concluding, in\nlight of Theorem 2.1, a lower bound on the sample complexity of any algorithm. We also provide\nupper bounds on the graph dimensions of these hypothesis classes, yielding, by the same theorem,\nan upper bound on the estimation error of ERM. In Section 3.2 we analyze the approximation error\nof the different hypothesis classes.\n\n3.1 Sample Complexity\n\nTogether with Theorem 2.1, the following theorems estimate, up to logarithmic factors, the sample\ncomplexity of the classes under consideration. We note that these theorems support the rule of thumb\nthat the Natarajan and Graph dimensions are of the same order of the number of parameters. The\n\ufb01rst theorem shows that the sample complexity of MSVM depends on \u02dc\u0398(dk).\nTheorem 3.1. d(k \u2212 1) \u2264 dN (L) \u2264 dG(L) \u2264 O(dk log(dk)).\nNext, we analyze the sample complexities of TC and ECOC. These methods rely on an underlying\nhypothesis class of binary classi\ufb01ers. While our main focus is the case in which the binary hypoth-\nesis class is halfspaces over Rd, the upper bounds on the sample complexity we derive below holds\nfor any binary hypothesis class of VC dimension d + 1.\n\n5\n\n\fTheorem 3.2. For every binary hypothesis class of VC dimension d + 1, and for any tree T ,\ndG(HT ) \u2264 dG(Htrees) \u2264 O(dk log(dk)). If the underlying hypothesis class is halfspaces over\nRd, then also\n\nd(k \u2212 1) \u2264 dN (HT ) \u2264 dG(HT ) \u2264 dG(Htrees) \u2264 O(dk log(dk)).\n\n(cid:16) dk\n\nTheorems 3.1 and 3.2 improve results from Daniely et al. [2011] where it was shown that (cid:98) d\n2(cid:99) \u2264\ndN (L) \u2264 O(dk log(dk)), and for every tree dG(HT ) \u2264 O(dk log(dk)). Further it was shown that\nif H is the set of halfspaces over Rd, then \u2126\nWe next turn to results for ECOC, and its special cases OvA and AP.\nTheorem 3.3. For every M \u2208 Rk\u00d7l and every binary hypothesis class of VC dimension d,\ndG(HM ) \u2264 O(dl log(dl)). Moreover, if M \u2208 {\u00b11}k\u00d7l and the underlying hypothesis class is\nhalfspaces over Rd, then\n\n(cid:17) \u2264 dN (HT ).\n\n2(cid:99)(cid:98) k\n\nlog(k)\n\nd \u00b7 \u03b4(M )/2 \u2264 dN (HM ) \u2264 dG(HM ) \u2264 O(dl log(dl)) .\n\nWe note if the code has a large distance, which is the case, for instance, in random codes, then\n\u03b4(M ) = \u2126(l). In this case, the bound is tight up to logarithmic factors.\nTheorem 3.4. For any binary hypothesis class of VC dimension d, dG(HOvA) \u2264 O(dk log(dk)) and\ndG(HAP) \u2264 O(dk2 log(dk)). If the underlying hypothesis class is halfspaces over Rd we also have:\n\nd(cid:0)k\u22121\n\nd(k \u2212 1) \u2264 dN (HOvA) \u2264 dG(HOvA) \u2264 O(dk log(dk))\n\n(cid:1) \u2264 dN (HAP) \u2264 dG(HAP) \u2264 O(dk2 log(dk)).\n\n2\n\nand\n\n3.2 Approximation error\nWe \ufb01rst show that the class L essentially contains HOvA and HT for any tree T , assuming, of\ncourse, that H is the class of halfspaces in Rd. We \ufb01nd this result quite surprising, since the sample\ncomplexity of all of these classes is of the same order.\nTheorem 3.5. L essentially contains Htrees and HOvA. These inclusions are strict for d \u2265 2 and\nk \u2265 3.\nOne might suggest that a small increase in the dimension would perhaps allow us to embed L in HT\nfor some tree T or for OvA. The next result shows that this is not the case.\nTheorem 3.6. Any embedding into a higher dimension that allows HOvA or HT (for some tree T\nfor k classes) to essentially contain L, necessarily embeds into a dimension of at least \u02dc\u2126(dk).\n\nThe next theorem shows that the approximation error of AP is better than that of MSVM (and hence\nalso better than OvA and TC). This is expected as the sample complexity of AP is considerably\nhigher, and therefore we face the usual trade-off between approximation and estimation error.\nTheorem 3.7. HAP essentially contains L. Moreover, there is a constant k\u2217 > 0, independent of d,\nsuch that the inclusion is strict for all k \u2265 k\u2217.\n\nFor a random ECOC of length o(k), it is easy to see that it does not contain MSVM, as MSVM has\nhigher complexity. It is also not contained in MSVM, as it generates non-convex regions of labels.\nWe next derive absolute lower bounds on the approximation errors of ECOC and TC when d (cid:28) k.\nRecall that both methods are built upon binary classi\ufb01ers that should predict h(x) = 1 if the label\nof x is in L, for some L \u2282 [k], and should predict h(x) = \u22121 if the label of x is not in L. As the\nfollowing lemma shows, when the partition of [k] into the two sets L and [k] \\ L is arbitrary and\nbalanced, and k (cid:29) d, such binary classi\ufb01ers will almost always perform very poorly.\nLemma 3.8. There exists a constant C > 0 for which the following holds. Let H \u2286 {\u00b11}X be any\nhypothesis class of VC-dimension d, let \u00b5 \u2208 (0, 1/2], and let D be any distribution over X \u00d7 [k]\nk . Let \u03c6 : [k] \u2192 {\u00b11} be a randomly chosen function which\nsuch that \u2200i P(x,y)\u223cD(y = i) \u2264 10\nis sampled according to one of the following rules: (1) For each i \u2208 [k], each coordinate \u03c6(i)\nis chosen independently from the other coordinates and P(\u03c6(i) = \u22121) = \u00b5; or (2) \u03c6 is chosen\nuniformly among all functions satisfying |{i \u2208 [k] : \u03c6(i) = \u22121}| = \u00b5k.\n\n6\n\n\f\u03bd2\n\nit with (x, \u03c6(y)). Then, for any \u03bd > 0, if k \u2265 C \u00b7(cid:16) d+ln( 1\n\nLet D\u03c6 be the distribution over X \u00d7{\u00b11} obtained by drawing (x, y) according to D and replacing\n, then with probability of at least 1 \u2212 \u03b4\n\n(cid:17)\n\n\u03b4 )\n\nk , k2\n\nover the choice of \u03c6, the approximation error of H with respect to D\u03c6 will be at least \u00b5 \u2212 \u03bd.\nAs the corollaries below show, Lemma 3.8 entails that when k (cid:29) d, both random ECOCs with a\nsmall code length, and balanced trees with a random labeling of the leaves, are expected to perform\nvery poorly.\nCorollary 3.9. There is a constant C > 0 for which the following holds. Let (T, \u03bb) be a tree\nfor k classes such that \u03bb : leaf(T ) \u2192 [k] is chosen uniformly at random. Denote by kL and kR\n(cid:17)\nthe number of leaves of the left and right sub-trees (respectively) that descend from root, and let\n\u00b5 = min{ k1\nk }. Let H \u2286 {\u00b11}X be a hypothesis class of VC-dimension d, let \u03bd > 0, and let D\nbe any distribution over X \u00d7 [k] such that \u2200i P(x,y)\u223cD(y = i) \u2264 10\n,\nwith probability of at least 1 \u2212 \u03b4 over the choice of \u03bb, the approximation error of HT with respect\nto D is at least \u00b5 \u2212 \u03bd.\nCorollary 3.10. There is a constant C > 0 for which the following holds. Let (M, \u03bb) be an ECOC\nwhere M \u2208 Rk\u00d7l, and assume that the bijection \u03bb : [k] \u2192 [k] is chosen uniformly at random. Let\nH \u2286 {\u00b11}X be a hypothesis class of VC-dimension d, let \u03bd > 0, and let D be any distribution over\nX \u00d7 [k] such that \u2200i P(x,y)\u223cD(y = i) \u2264 10\n, with probability\nof at least 1 \u2212 \u03b4 over the choice of \u03bb, the approximation error of HM with respect to D is at least\n1/2 \u2212 \u03bd.\n\nk . Then, for k \u2265 C \u00b7(cid:16) dl log(dl)+ln( 1\n\nk . Then, for k \u2265 C \u00b7(cid:16) d+ln( 1\n\n(cid:17)\n\n\u03b4 )\n\n\u03bd2\n\n\u03b4 )\n\n\u03bd2\n\nNote that the \ufb01rst corollary holds even if only the top level of the binary tree is balanced and splits\nthe labels randomly to the left and the right sub-trees. The second corollary holds even if the code\nitself is not random (nor does it have to be binary), and only the association of rows with labels is\nrandom. In particular, if the length of the code is O(log(k)), as suggested in Allwein et al. [2000],\nand the number of classes is \u02dc\u2126(d), then the code is expected to perform poorly.\nFor an ECOC with a matrix of length \u2126(k) and d = o(k), we do not have such a negative result as\nstated in Corollary 3.10. Nonetheless, Lemma 3.8 implies that the prediction of the binary classi\ufb01ers\nwhen d = o(k) is just slightly better than a random guess, thus it seems to indicate that the ECOC\nmethod will still perform poorly. Moreover, most current theoretical analyses of ECOC estimate the\nerror of the learned multiclass hypothesis in terms of the average error of the binary classi\ufb01ers. Alas,\n2.\nwhen the number of classes is large, Lemma 3.8 shows that this average will be close to 1\nFinally, let us brie\ufb02y discuss the tightness of Lemma 3.8. Let x1, . . . , xd+1 \u2208 Rd be af\ufb01nely inde-\npendent and let D be the distribution over Rd\u00d7[d+1] de\ufb01ned by P(x,y)\u223cD((x, y) = (xi, i)) = 1\nd+1.\nIs is not hard to see that for every \u03c6 : [d + 1] \u2192 {\u00b11}, the approximation error of the class of half-\nspaces with respect to D\u03c6 is zero. Thus, in order to ensure a large approximation error for every\ndistribution, the number of classes must be at least linear in the dimension, so in this sense, the\nlemma is tight. Yet, this example is very simple, since each class is concentrated on a single point\nand the points are linearly independent. It is possible that in real-world distributions, a large approx-\nimation error will be exhibited even when k < d.\nWe note that the phenomenon of a large approximation error, described in Corollaries 3.9 and 3.10,\ndoes not reproduce in the classes L,HOvA and HAP , since these classes are symmetric.\n\n4 Proof Techniques\n\nDue to lack of space, the proofs for all the results stated above are provided in the appendix. In this\nsection we give a brief description of our main proof techniques.\nMost of our proofs for the estimation error results, stated in Section 3.1, are based on a similar\nmethod which we now describe. Let L : {\u00b11}l \u2192 [k] be a multiclass-to-binary reduction (e.g., a\ntree), and for H \u2286 {\u00b11}X , denote L(H) = {x (cid:55)\u2192 L(h1(x), . . . , hl(x)) | h1, . . . , hl \u2208 H}. Our\nupper bounds for dG(L(H)) are mostly based on the following simple lemma.\nLemma 4.1. If VC(H) = d then dG(L(H)) = O(ld ln(ld)).\n\n7\n\n\fThe technique for the lower bound on dN (L(W)) when W is the class of halfspaces in Rd is more\ninvolved, and quite general. We consider a binary hypothesis class G \u2286 {\u00b11}[d]\u00d7[l] which consists\nof functions having an arbitrary behaviour over [d] \u00d7 {i}, and a very uniform behaviour on other\ninputs (such as mapping all other inputs to a constant). We show that L(G) N-shatters the set [d]\u00d7[l].\nSince G is quite simple, this is usually not very hard to show. Finally, we show that the class of\nhalfspaces is richer than G, in the sense that the inputs to G can be mapped to points in Rd such that\nthe functions of G can be mapped to halfspaces. We conclude that dN (L(W)) \u2265 dN (L(G)).\nTo prove the approximation error lower bounds stated in Section 3.2, we use the techniques of VC\ntheory in an unconventional way. The idea of this proof is as follows: Using a uniform convergence\nargument based on the VC dimension of the binary hypothesis class, we show that there exists a\nsmall labeled sample S whose approximation error for the hypothesis class is close to the approx-\nimation error for the distribution, for all possible label mappings. This allows us to restrict our\nattention to a \ufb01nite set of hypotheses, by their restriction to the sample. For these hypotheses, we\nshow that with high probability over the choice of label mapping, the approximation error on the\nsample is high. A union bound on the \ufb01nite set of possible hypotheses shows that the approximation\nerror on the distribution will be high, with high probability over the choice of the label mapping.\n\n5\n\nImplications\n\nThe \ufb01rst immediate implication of our results is that whenever the number of examples in the training\nset is \u02dc\u2126(dk), MSVM should be preferred to OvA and TC. This is certainly true if the hypothesis\nclass of MSVM, L, has a zero approximation error (the realizable case), since the ERM is then\nsolvable with respect to L. Note that since the inclusions given in Theorem 3.5 are strict, there are\ncases where the data is realizable with MSVM but not with HOvA or with respect to any tree.\nIn the non-realizable case, implementing the ERM is intractable for all of these methods. Nonethe-\nless, for each method there are reasonable heuristics to approximate the ERM, which should work\nwell when the approximation error is small. Therefore, we believe that MSVM should be the method\nof choice in this case as well due to its lower approximation error. However, variations in the opti-\nmality of algorithms for different hypothesis classes should also be taken into account in this anal-\nysis. We leave this detailed analysis of speci\ufb01c training heuristics for future work. Our analysis\nalso implies that it is highly unrecommended to use TC with a randomly selected \u03bb or ECOC with a\nrandom code whenever k > d. Finally, when the number of examples is much larger than dk2, the\nanalysis implies that it is better to choose the AP approach.\nTo conclude this section, we illustrate the relative performance of MSVM, OvA, TC, and ECOC, by\nconsidering the simplistic case where d = 2, and each class is concentrated on a single point in R2.\nIn the leftmost graph below, there are two classes in R2, and the approximation error of all algorithms\nis zero. In the middle graph, there are 9 classes ordered on the unit circle of R2. Here, both MSVM\nand OvA have a zero approximation error, but the error of TC and of ECOC with a random code will\nmost likely be large. In the rightmost graph, we chose random points in R2. MSVM still has a zero\napproximation error. However, OvA cannot learn the binary problem of distinguishing between the\nmiddle point and the rest of the points and hence has a larger approximation error.\n\nMSVM\nOvA\nTC/ECOC\n\n\u0013\n\u0013\n\u0013\n\n\u0013\n\u0013\n\u0017\n\n\u0013\n\u0017\n\u0017\n\nAcknowledgements: Shai Shalev-Shwartz was supported by the John S. Cohen Senior Lecture-\nship in Computer Science. Amit Daniely is a recipient of the Google Europe Fellowship in Learning\nTheory, and this research is supported in part by this Google Fellowship.\n\n8\n\n\fReferences\nE. L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach\n\nfor margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141, 2000.\nS. Ben-David, N. Cesa-Bianchi, D. Haussler, and P. Long. Characterizations of learnability for\nclasses of {0, . . . , n}-valued functions. Journal of Computer and System Sciences, 50:74\u201386,\n1995.\n\nS. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In NIPS,\n\n2011.\n\nA. Beygelzimer, J. Langford, and P. Ravikumar. Multiclass classi\ufb01cation with \ufb01lter trees. Preprint,\n\nJune, 2007.\n\nK. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector\n\nmachines. Journal of Machine Learning Research, 2:265\u2013292, 2001.\n\nA. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz. Multiclass learnability and the erm\n\nprinciple. In COLT, 2011.\n\nT. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output\n\ncodes. Journal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, January 1995.\n\nTrevor Hastie and Robert Tibshirani. Classi\ufb01cation by pairwise coupling. The Annals of Statistics,\n\n26(1):451\u2013471, 1998.\n\nRyan Rifkin and Aldebaro Klautau.\n\nIn defense of one-vs-all classi\ufb01cation. Journal of Machine\n\nLearning Research, 5:101\u2013141, 2004.\n\nDavid E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations\nby error propagation. In David E. Rumelhart and James L. McClelland, editors, Parallel Dis-\ntributed Processing \u2013 Explorations in the Microstructure of Cognition, chapter 8, pages 318\u2013362.\nMIT Press, 1986.\n\nG. Takacs. Convex polyhedron learning and its applications. PhD thesis, Budapest University of\n\nTechnology and Economics, 2009.\n\nV. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\nJ. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proceed-\n\nings of the Seventh European Symposium on Arti\ufb01cial Neural Networks, April 1999.\n\n9\n\n\f", "award": [], "sourceid": 251, "authors": [{"given_name": "Amit", "family_name": "Daniely", "institution": null}, {"given_name": "Sivan", "family_name": "Sabato", "institution": null}, {"given_name": "Shai", "family_name": "Shwartz", "institution": null}]}