{"title": "Multiclass Learning by Probabilistic Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 969, "page_last": 1000, "abstract": null, "full_text": "Multiclass Learning by Probabilistic Embeddings\n\nOfer Dekel and Yoram Singer\n\nSchool of Computer Science & Engineering\n\nThe Hebrew University, Jerusalem 91904, Israel\n\nfoferd,singerg@cs.huji.ac.il\n\nAbstract\n\nWe describe a new algorithmic framework for learning multiclass catego-\nrization problems. In this framework a multiclass predictor is composed\nof a pair of embeddings that map both instances and labels into a common\nspace. In this space each instance is assigned the label it is nearest to. We\noutline and analyze an algorithm, termed Bunching, for learning the pair\nof embeddings from labeled data. A key construction in the analysis of\nthe algorithm is the notion of probabilistic output codes, a generaliza-\ntion of error correcting output codes (ECOC). Furthermore, the method\nof multiclass categorization using ECOC is shown to be an instance of\nBunching. We demonstrate the advantage of Bunching over ECOC by\ncomparing their performance on numerous categorization problems.\n\n1 Introduction\n\nThe focus of this paper is supervised learning from multiclass data. In multiclass problems\nthe goal is to learn a classi\ufb01er that accurately assigns labels to instances where the set of\nlabels is of \ufb01nite cardinality and contains more than two elements. Many machine learning\napplications employ a multiclass categorization stage. Notable examples are document\nclassi\ufb01cation, spoken dialog categorization, optical character recognition (OCR), and part-\nof-speech tagging. Dietterich and Bakiri [6] proposed a technique based on error correcting\noutput coding (ECOC) as a means of reducing a multiclass classi\ufb01cation problem to several\nbinary classi\ufb01cation problems and then solving each binary problem individually to obtain\na multiclass classi\ufb01er. More recent work of Allwein et al. [1] provided analysis of the\nempirical and generalization errors of ECOC-based classi\ufb01ers. In the above papers, as well\nas in most previous work on ECOC, learning the set of binary classi\ufb01ers and selecting a\nparticular error correcting code are done independently. An exception is a method based\non continuous relaxation of the code [3] in which the code matrix is post-processed once\nbased on the learned binary classi\ufb01ers.\n\nThe inherent decoupling of the learning process from the class representation problem em-\nployed by ECOC is both a blessing and a curse. On one hand it offers great \ufb02exibility and\nmodularity, on the other hand, the resulting binary learning problems might be unnatural\nand therefore potentially dif\ufb01cult. We instead describe and analyze an approach that ties the\nlearning problem with the class representation problem. The approach we take perceives\nthe set of binary classi\ufb01ers as an embedding of the instance space and the code matrix as\nan embedding of the label set into a common space. In this common space each instance is\nassigned the label from which it\u2019s divergence is smallest. To construct these embeddings,\nwe introduce the notion of probabilistic output codes. We then describe an algorithm that\nconstructs the label and instance embeddings such that the resulting classi\ufb01er achieves a\nsmall empirical error. The result is a paradigm that includes ECOC as a special case.\n\n\fThe algorithm we describe, termed Bunching, alternates between two steps. One step im-\nproves the embedding of the instance space into the common space while keeping the\nembedding of the label set \ufb01xed. This step is analogous to the learning stage of the ECOC\ntechnique, where a set of binary classi\ufb01ers are learned with respect to a prede\ufb01ned code.\nThe second step complements the \ufb01rst by updating the label embedding while keeping the\ninstance embedding \ufb01xed. The two alternating steps resemble the steps performed by the\nEM algorithm [5] and by Alternating Minimization [4]. The techniques we use in the de-\nsign and analysis of the Bunching algorithm also build on recent results in classi\ufb01cation\nlearning using Bregman divergences [8, 2].\n\nThe paper is organized as follows. In the next section we give a formal description of\nthe multiclass learning problem and of our classi\ufb01cation setting.\nIn Sec. 3 we give an\nalternative view of ECOC which naturally leads to the de\ufb01nition of probabilistic output\ncodes presented in Sec. 4. In Sec. 5 we cast our learning problem as a minimization problem\nof a continuous objective function and in Sec. 6 we present the Bunching algorithm. We\ndescribe experimental results that demonstrate the merits of our approach in Sec. 7 and\nconclude in Sec. 8.\n\n2 Problem Setting\n\nLet X be a domain of instance encodings from m and let Y be a set of r labels that can\nbe assigned to each instance from X . Given a training set of instance-label pairs S =\nj=1 such that each xj is in X and each yj is in Y, we are faced with the problem of\n(xj ; yj)n\nlearning a classi\ufb01cation function that predicts the labels of instances from X . This problem\nis often referred to as multiclass learning. In other multiclass problem settings it is common\nto encode the set Y as a pre\ufb01x of the integers f1; : : : ; rg, however in our setting it will prove\nuseful to assume that the labels are encoded as the set of r standard unit vectors in r . That\nis, the i\u2019th label in Y is encoded by the vector whose i\u2019th component is set to 1, and all of\nits other components are set to 0.\nThe classi\ufb01cation functions we study in this paper are\ncomposed of a pair of embeddings from the spaces X and\nY into a common space Z, and a measure of divergence\nbetween vectors in Z. That is, given an instance x 2 X ,\nwe embed it into Z along with all of the label vectors\nin Y and predict the label that x is closest to in Z. The\nmeasure of distance between vectors in Z builds upon the\nde\ufb01nitions given below:\n\ns\n\nThe logistic transformation (cid:27) : s ! (0; 1)s is de\ufb01ned\n\n8k = 1; :::; s (cid:27)k(!) = (1 + e(cid:0)!k )(cid:0)1\n\nFigure 1: An illustration of\nthe embedding model used.\n\nThe entropy of a multivariate Bernoulli random variable with parameter p 2 [0; 1]s is\n\ns\n\nH[p] = (cid:0)\n\nXk=1\n\n[pk log(pk) + (1 (cid:0) pk) log(1 (cid:0) pk)] :\n\nThe Kullback-Leibler (KL) divergence between a pair of multivariate Bernoulli random\nvariables with respective parameters p; q 2 [0; 1]s is\n\ns\n\nD[p k q] =\n\nXk=1\n\n(cid:20)pk log(cid:18) pk\n\nqk(cid:19) + (1 (cid:0) pk) log(cid:18) 1 (cid:0) pk\n\n1 (cid:0) qk(cid:19)(cid:21) :\n\n(1)\n\nReturning to our method of classi\ufb01cation, let s be some positive integer and let Z denote\nthe space [0; 1]s. Given any two linear mappings T : m ! s and C : r ! s , where\nT is given as a matrix in s(cid:2)m and C as a matrix in s(cid:2)r , instances from X are embedded\ninto Z by (cid:27)(T x) and labels from Y are embedded into Z by (cid:27)(Cy). An illustration of the\ntwo embeddings is given in Fig. 1.\n\n\fWe de\ufb01ne the divergence between any two points z1; z2 2 Z as the sum of the KL-\ndivergence between them and the entropy of z1, D[z1 k z2] + H[z1]. We now de\ufb01ne\nthe loss \u2018 of each instance-label pair as the divergence of their respective images,\n\n\u2018(x; yjC; T ) = D[(cid:27)(Cy) k (cid:27)(T x)] + H[(cid:27)(Cy)] :\n\n(2)\nThis loss is clearly non-negative and can be zero iff x and y are embedded to the same\npoint in Z and the entropy of this point is zero. \u2018 is our means of classifying new instances:\ngiven a new instance we predict its label to be ^y if\n\n^y = argmin\n\n\u2018(x; yjC; T ) :\n\ny2Y\n\n(3)\n\nFor brevity, we restrict ourselves to the case where only a single label attains the minimum\nloss, and our classi\ufb01er is thus always well de\ufb01ned. We point out that our analysis is still\nvalid when this constraint is relaxed. We name the loss over the entire training set S the\nempirical loss and use the notation\n\nL(SjC; T ) = X(x;y)2S\n\n\u2018(x; yjC; T ) :\n\n(4)\n\nOur goal is to learn a good multiclass prediction function by \ufb01nding a pair (C; T ) that\nattains a small empirical loss. As we show in the sequel, the rationale behind this choice\nof empirical loss lies in the fact that it bounds the (discrete) empirical classi\ufb01cation error\nattained by the classi\ufb01cation function.\n\n3 An Alternative View of Error Correcting Output Codes\n\nThe technique of ECOC uses error correcting codes to reduce an r-class classi\ufb01cation prob-\nlem to multiple binary problems. Each binary problem is then learned independently via\nan external binary learning algorithm and the learned binary classi\ufb01ers are combined into\none r-class classi\ufb01er. We begin by giving a brief overview of ECOC for the case where the\nbinary learning algorithm used is a logistic regressor.\n\nA binary output code C is a matrix in f0; 1gs(cid:2)r where each of C\u2019s columns is an s-bit\ncode word that corresponds to a label in Y. Recall that the set of labels Y is assumed to\nbe the standard unit vectors in r . Therefore, the code word corresponding to the label y\nis simply the product of the matrix C and the vector y, Cy. The distance (cid:26) of a code C is\nde\ufb01ned as the minimal Hamming distance between any two code words, formally\n\ns\n\n(cid:26)(C) = min\ni6=j\n\nXk=1\n\nCk;i(1 (cid:0) Ck;j ) + Ck;j (1 (cid:0) Ck;i) :\n\nFor any k 2 f1; : : : ; sg, the k\u2019th row of C, denoted henceforth by Ck, de\ufb01nes a partition\nof the set of labels Y into two disjoint subsets: the \ufb01rst subset constitutes labels for which\nCk (cid:1) y = 0 (i.e., the set of labels in Y which are mapped according to Ck to the binary\nlabel 0) and the labels for which Ck (cid:1) y = 1. Thus, each Ck induces a binary classi\ufb01cation\nproblem from the original multiclass problem. Formally, we construct for each k a binary-\nlabeled sample Sk = f(xj ; Ck (cid:1) yj)gn\nj=1 and for each Sk we learn a binary classi\ufb01cation\nusing a logistic regression algorithm. That is, for each original\nfunction Tk : X ! \ninstance xj and induced binary label Ck (cid:1) yj we posit a logistic model that estimates the\nconditional probability that Ck (cid:1) yj equals 1 given xj,\n\nGiven a prede\ufb01ned code matrix C the learning task at hand is to \ufb01nd T ?\nlog-likelihood of the labelling given in Sk,\n\nn\n\nP r[Ck (cid:1) yj = 1j xj ; Tk] = (cid:27)(Tk (cid:1) xj ) :\n\n(5)\nk that maximizes the\n\nT ?\n\nk = argmax\n\nTk 2\n\nm\n\nlog(P r[Ck (cid:1) yj j xj ; Tk]) :\n\n(6)\n\nXj=1\n\n\n\fDe\ufb01ning 0 log 0 = 0, we can use the logistic estimate in Eq. (5) and the KL-divergence\nfrom Eq. (1) to rewrite Eq. (6) as follows\nn\n\nT ?\n\nk = argmin\n\nTk 2\n\nm\n\nD[Ck (cid:1) yj k (cid:27)(Tk (cid:1) xj )] :\n\nXj=1\n\nIn words, a good set of binary predictors is found by minimizing the sample-averaged KL-\ndivergence between the binary vectors induced by C and the logistic estimates induced by\nT1; : : : ; Ts. Let T ? be the matrix in s(cid:2)m constructed by the concatenation of the row\nvectors fT ?\nk=1. For any instance x 2 X , (cid:27)(T ?x) is a vector of probability estimates\nthat the label of x is 1 for each of the s induced binary problems. We can summarize the\nlearning task de\ufb01ned by the code C as the task of \ufb01nding a matrix T ? such that\n\nk gs\n\nT ? = argmin\n\nT 2\n\ns(cid:2)m\n\nn\n\nXj=1\n\nD[Cyj k (cid:27)(T xj)] :\n\nGiven a code matrix C and a transformation T we classify a new instance as follows,\n\n^y = argmin\n\nD[Cy k (cid:27)(T x)] :\n\ny2Y\n\n(7)\n\nA classi\ufb01cation error occurs if the predicted label ^y is different from the correct label y.\nBuilding on Thm. 1 from Allwein et al. [1] it is straightforward to show that the empirical\nclassi\ufb01cation error (^y 6= y) is bounded above by the empirical KL-divergence between the\ncorrect code word Cy and the estimated probabilities (cid:27)(T x) divided by the code distance,\n\njf^yj 6= yjgn\n\nj=1j (cid:20) Pn\n\nj=1 D[Cyj k (cid:27)(T xj)]\n\n(cid:26)(C)\n\n:\n\n(8)\n\nThis bound is a special case of the bound given below in Thm. 1 for general probabilistic\noutput codes. We therefore defer the discussion on this bound to the following section.\n\n4 Probabilistic Output Codes\n\nWe now describe a relaxation of binary output codes by de\ufb01ning the notion of probabilistic\noutput codes. We give a bound on the empirical error attained by a classi\ufb01er that uses\nprobabilistic output codes which generalizes the bound in Eq. (8). The rationale for our\nconstruction is that the discrete nature of ECOC can potentially induce dif\ufb01cult binary\nclassi\ufb01cation problems. In contrast, probabilistic codes induce real-valued problems that\nmay be easier to learn.\n\nAnalogous to discrete codes, A probabilistic output code C is a matrix in s(cid:2)r used in\nconjunction with the logistic transformation to produce a set of r probability vectors that\ncorrespond to the r labels in Y. Namely, C maps each label y 2 Y to the probabilistic code\nword (cid:27)(Cy) 2 [0; 1]s. As before, we assume that Y is the set of r standard unit vectors\nin f0; 1gr and therefore each probabilistic code word is the image of one of C\u2019s columns\nunder the logistic transformation. The natural extension of code distance to probabilistic\ncodes is achieved by replacing Hamming distance with expected Hamming distance. If\nfor each y 2 Y and k 2 f1; : : : ; sg we view the k\u2019th component of the code word that\ncorresponds to y as a Bernoulli random variable with parameter p = (cid:27)k(Cy) then the\nexpected Hamming distance between the code word for classes i and j is,\n\ns\n\nXk=1\n\n(cid:27)k(Cyi)(1 (cid:0) (cid:27)k(Cyj)) + (cid:27)k(Cyj )(1 (cid:0) (cid:27)k(Cyi)) :\n\nAnalogous to discrete codes we de\ufb01ne the distance (cid:26) of a code C as the minimum expected\nHamming distance between all pairs of code words in C, that is,\n\n\n\n\f(cid:26)(C) = min\ni6=j\n\ns\n\nXk=1\n\n(cid:27)k(Cyi)(1 (cid:0) (cid:27)k(Cyj)) + (cid:27)k(Cyj)(1 (cid:0) (cid:27)k(Cyi)) :\n\nPut another way, we have relaxed the de\ufb01nition of code words from deterministic vectors\nto multivariate Bernoulli random variables. The matrix C now de\ufb01nes the distributions of\nthese random variables. When C\u2019s entries are all (cid:6)1 then the logistic transformation of\nC\u2019s entries de\ufb01nes a deterministic code and the two de\ufb01nitions of (cid:26) coincide.\nGiven a probabilistic code matrix C 2 s(cid:2)r and a transformation T 2 s(cid:2)m we associate\na loss \u2018(x; yjC; T ) with each instance-label pair (x; y) using Eq. (2) and we measure the\nempirical loss over the entire training set S as de\ufb01ned in Eq. (4). We classify new instances\nby \ufb01nding the label ^y that attains the smallest loss as de\ufb01ned in Eq. (3). This construction is\nequivalent to the classi\ufb01cation method discussed in Sec. 2 that employs embeddings except\nthat instead of viewing C and T as abstract embeddings C is interpreted as a probabilistic\noutput code and the rows of T are viewed as binary classi\ufb01ers.\nNote that when all of the entries of C are (cid:6)1 then the classi\ufb01cation rule from Eq. (3) is\nreduced to the classi\ufb01cation rule for ECOC from Eq. (7) since the entropy of (cid:27)(Cy) is zero\nfor all y. We now give a theorem that builds on our construction of probabilistic output\ncodes and relates the classi\ufb01cation rule from Eq. (3) with the empirical loss de\ufb01ned by\nEq. (4). As noted before, the theorem generalizes the bound given in Eq. (8).\n\nTheorem 1 Let Y be a set of r vectors in r . Let C 2 s(cid:2)r be a probabilistic output\ncode with distance (cid:26)(C) and let T 2 s(cid:2)m be a transformation matrix. Given a sample\ni=j of instance-label pairs where xj 2 X and yj 2 Y, denote by L the loss\nS = f(xj ; yj)gn\non S with respect to C and T as given by Eq. (4) and denote by ^yj the predicted label of\nxj according to the classi\ufb01cation rule given in Eq. (3). Then,\n\nThe proof of the theorem is omitted due to the lack of space.\n\njf^yj 6= yjgn\n\nj=1j (cid:20)\n\nL(SjC; T )\n\n(cid:26)(C)\n\n:\n\n5 The Learning Problem\n\nWe now discuss how our formalism of probabilistic output codes via embeddings and the\naccompanying Thm. 1 lead to a learning paradigm in which both T and C are found con-\ncurrently. Thm. 1 implies that the empirical error over S can be reduced by minimizing the\nempirical loss over S while maintaining a large distance (cid:26)(C). A naive modi\ufb01cation of C\nso as to minimize the loss may result in a probabilistic code whose distance is undesirably\nsmall. Therefore, we assume that we are initially provided with a \ufb01xed reference matrix\nC0 2 s(cid:2)r that is known to have a large code distance. We now require that the learned\nmatrix C remain relatively close to C0 (in a sense de\ufb01ned shortly) throughout the learning\nprocedure. Rather than requiring that C attain a \ufb01xed distance to C0 we add a penalty\nproportional to the distance between C and C0 to the loss de\ufb01ned in Eq. (4). This penalty\non C can be viewed as a form of regularization (see for instance [10]). Similar paradigms\nhave been used extensively in the pioneering work of Warmuth and his colleagues on on-\nline learning (see for instance [7] and the references therein) and more recently for incor-\nporating prior knowledge into boosting [11]. The regularization factor we employ is the\nKL-divergence between the images of C and C0 under the logistic transformation,\n\nR(SjC; C0) =\n\nn\n\nXj=1\n\nD[(cid:27)(Cyj ) k (cid:27)(C0yj)] :\n\nThe in\ufb02uence of this penalty term is controlled by a parameter (cid:11) 2 [0; 1]. The resulting\nobjective function that we attempt to minimize is\n\nO(SjC; T ) = L(SjC; T ) + (cid:11)R(SjC; C0)\n\n(9)\n\n\fwhere (cid:11) and C0 are \ufb01xed parameters. The goal of learning boils down to \ufb01nding a pair\n(C ?; T ?) that minimizes the objective function de\ufb01ned in Eq. (9). We would like to note\nthat this objective function is not convex due to the concave entropic term in the de\ufb01ni-\ntion of \u2018. Therefore, the learning procedure described in the sequel converges to a local\nminimum or a saddle point of O.\n\n6 The Learning Algorithm\n\nThe goal of the learning algorithm is to\n\ufb01nd C and T that minimize the objec-\ntive function de\ufb01ned above. The algo-\nrithm alternates between two comple-\nmenting steps for decreasing the ob-\njective function. The \ufb01rst step, called\nIMPROVE-T, improves T leaving C\nunchanged, and the second step, called\nIMPROVE-C, \ufb01nds the optimal matrix\nC for any given matrix T . The al-\ngorithm is provided with initial matri-\nces C0 and T0, where C0 is assumed\nto have a large code distance (cid:26). The\nIMPROVE-T step makes the assump-\ntion that all of the instances in S sat-\ni=1 xi (cid:20) 1 and\nfor all i 2 f1; 2; :::; mg, 0 (cid:20) xi. Any\n\ufb01nite training set can be easily shifted\nand scaled to conform with these con-\nstraints and therefore they do not im-\npose any real limitation.\nIn addition,\nthe IMPROVE-C step is presented for\nthe case where Y is the set of standard\nunit vectors in r .\n\nisfy the constraints Pm\n\n+; C0 2\n\ns(cid:2)r ; T0 2\n\ns(cid:2)m\n\nFor t = 1; 2; :::\n\nBUNCH S; (cid:11) 2\n\nIMPROVE-T (S, C, T )\n\nTt = IMPROVE-T (S, Ct(cid:0)1, Tt(cid:0)1)\nCt = IMPROVE-C (S, (cid:11), Tt, C0)\n\nFor k = 1; 2; :::; s and i = 1; 2; :::; m\n\nW +\n\nk;i =\n\nW (cid:0)\n\nk;i =\n\n(cid:27)(Cky) (cid:27)((cid:0)Tkx) xi\n\n(cid:27)((cid:0)Cky) (cid:27)(Tkx) xi\n\n\u0003(x;y)2S\n\u0003(x;y)2S\nln\u0004 W +\n\nk;i\nW (cid:0)\nk;i\n\n(cid:2)k;i =\n\n1\n2\nReturn T + (cid:2)\n\nIMPROVE-C (S, (cid:11), T , C0)\n\nFor each y 2 Y\n\nSy = f(x; (cid:22)y) 2 S : (cid:22)y = yg\n\nC (y) = C (y)\n\n0 +\n\n1\n\n(cid:11)jSyj\n\nT x\n\nReturn C =\n\nC (1); : : : ; C (r)\n\n\u0003x2Sy\n\nFigure 2: The Bunching Algorithm.\n\nSince the regularization factor R is independent of T we can restrict our description and\nanalysis of the IMPROVE-T step to considering only the loss term L of the objective func-\ntion O. The IMPROVE-T step receives the current matrices C and T as input and calculates\na matrix (cid:2) that is used for updating the current T additively. Denoting the iteration index\nby t, the update is of the form Tt+1 = Tt + (cid:2). The next theorem states that updating T by\nthe IMPROVE-T step decreases the loss or otherwise T remains unchanged and is globally\noptimal with respect to C. Again, the proof is omitted due to space constraints.\n\nTheorem 2 Given matrices C 2 s(cid:2)r and T 2 s(cid:2)m , let W +\nin the IMPROVE-T step of Fig. 2. Then, the decrease in the loss L is bounded below by,\n\nk;i and (cid:2) be as de\ufb01ned\n\nk;i, W (cid:0)\n\nm\n\ns\n\nXk=1\n\nXi=1 (cid:16)qW +\n\nk;i (cid:0)qW (cid:0)\nk;i(cid:17)\n\n2\n\n(cid:20) L(SjC; T ) (cid:0) L(SjC; T + (cid:2)) :\n\nBased on the theorem above we can derive the following corollary\n\nCorollary 1 If (cid:2) is generated by a call to IMPROVE-T and L(SjC; T + (cid:2)) = L(SjC; T )\nthen (cid:2) is the zero matrix and T is globally optimal with respect to C.\n\nIn the IMPROVE-C step we \ufb01x the current matrix T and \ufb01nd a code matrix C that globally\nminimizes the objective function. According to the discussion above, the matrix C de\ufb01nes\n\n\u0001\n\u0001\n\u0001\n\u0002\n\u0005\n\u0006\n\u0007\n\fan embedding of the label vectors from Y into Z and the images of this embedding con-\nstitute the classi\ufb01cation rule. For each y 2 Y denote its image under C and the logistic\ntransformation by py = (cid:27)(Cy) and let Sy be the subset of S that is labeled y. Note that the\nobjective function can be decomposed into r separate summands according to y,\n\nwhere\n\nO(SjC; T ) = Xy2Y\n\nO(SyjC; T ) ;\n\nO(SyjC; T ) = X(x;y)2Sy\n\nD[py k (cid:27)(T x)] + H[py] + (cid:11)D[py k (cid:27)(C0y0)] :\n\nWe can therefore \ufb01nd for each y 2 Y the vector py that minimizes O(Sy) independently\nand then reconstruct the code matrix C that achieves these values. It is straightforward to\nshow that O(Sy) is convex in py, and our task is reduced to \ufb01nding it\u2019s stationary point.\nWe examine the derivative of O(Sy) with respect to py;k and get,\n\n@Oy(Sy)\n\n@py;k\n\n= X(x;y)2Sy\n\n(cid:0) log(cid:18) (cid:27)(Tk (cid:1) x)\n\n1 (cid:0) (cid:27)(Tk (cid:1) x)(cid:19) (cid:0) (cid:11)jSyj(cid:18)C0;k (cid:1) y + log(cid:18) py;k\n\n1 (cid:0) py;k(cid:19)(cid:19) :\n\nWe now plug py = (cid:27)(Cy) into the equation above and evaluate it at zero to get that,\n\nCy = C0y +\n\n1\n\n(cid:11)jSyj X(x;y)2Sy\n\nT x :\n\nSince Y was assumed to be the set of standard unit vectors, Cy is a column of C and the\nabove is simply a column wise assignment of C.\nWe have shown that each call to IMPROVE-T followed by IMPROVE-C decreases the ob-\njective function until convergence to a pair (C ?; T ?) such that C ? is optimal given T ? and\nT ? is optimal given C ?. Therefore O(SjC ?; T ?) is either a minimum or a saddle point.\n\n7 Experiments\n\nTo assess the merits of Bunching we\ncompared it\nto a standard ECOC-\nbased algorithm on numerous mul-\nticlass problems.\nFor the ECOC-\nbased algorithm we used a logistic re-\ngressor as the binary learning algo-\nrithm, trained using the parallel up-\ndate described in [2]. The two ap-\nproaches share the same form of clas-\nsi\ufb01ers (logistic regressors) and differ\nsolely in the coding matrix they em-\nploy: while ECOC uses a \ufb01xed code\nmatrix Bunching adapts its code ma-\ntrix during the learning process.\n\n \n\n%\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nR\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\nrandom\none\u2212vs\u2212rest\n\nglass\n\nisolet\n\nletter\n\nmnist\n\nsatimage\n\nsoybean\n\nvowel\n\nFigure 3: The relative performance of Bunching\ncompared to ECOC on various datasets.\n\nWe selected the following multiclass\ndatasets: glass, isolet, letter,\nsatimage, soybean and vowel\nfrom the UCI repository (www.ics.uci.edu/(cid:24)mlearn/MLRepository.html) and the mnist\ndataset available from LeCun\u2019s homepage (yann.lecun.com/exdb/mnist/index.html). The\nonly dataset not supplied with a test set is glass for which we use 5-fold cross validation.\nFor each dataset, we compare the test error rate attained by the ECOC classi\ufb01er and the\nBunching classi\ufb01er. We conducted the experiments for two families of code matrices. The\n\n\f\ufb01rst family corresponds to the one-vs-rest approach in which each class is trained against\nthe rest of the classes and the corresponding code is a matrix whose logistic transformation\nis simply the identity matrix. The second family is the set of random code matrices with\nr log2 r rows where r is the number of different labels. These matrices are used as C0\nfor Bunching and as the \ufb01xed code for ECOC. Throughout all of the experiments with\nBunching, we set the regularization parameter (cid:11) to 1.\nA summary of the results is depicted in Fig. 3. The height of each bar is proportional to\n(eE (cid:0) eB)=eE where eE is the test error attained by the ECOC classi\ufb01er and eB is the\ntest error attained by the Bunching classi\ufb01er. As shown in the \ufb01gure, for almost all of\nthe experiments conducted Bunching outperforms standard ECOC. The improvement is\nmore signi\ufb01cant when using random code matrices. This can be explained by the fact that\nrandom code matrices tend to induce unnatural and rather dif\ufb01cult binary partitions of the\nset of labels. Since Bunching modi\ufb01es the code matrix C along its run, it can relax dif\ufb01cult\nbinary problems. This suggests that Bunching can improve the classi\ufb01cation accuracy in\nproblems where, for instance, the one-vs-rest approach fails to give good results or when\nthere is a need to add error correction properties to the code matrix.\n\n8 A Brief Discussion\n\nIn this paper we described a framework for solving multiclass problems via pairs of embed-\ndings. The proposed framework can be viewed as a generalization of ECOC with logistic\nregressors. It is possible to extend our framework in a few ways. First, the probabilistic\nembeddings can be replaced with non-negative embeddings by replacing the logistic trans-\nformation with the exponential function. In this case, the KL divergence is replaced with\nits unormalized version [2, 9]. The resulting generalized Bunching algorithm is somewhat\nmore involved and less intuitive to understand. Second, while our work focuses on linear\nembeddings, our algorithm and analysis can be adapted to more complex mappings by em-\nploying kernel operators. This can be achieved by replacing the k\u2019th scalar-product Tk (cid:1) x\nwith an abstract inner-product (cid:20)(Tk; x). Last, we would like to note that it is possible to\ndevise an alternative objective function to the one given in Eq. (9) which is jointly convex\nin (T; (cid:27)(C)) and for which we can state a bound of a form similar to the bound in Thm. 1.\n\nReferences\n[1] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach\n\nfor margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141, 2000.\n\n[2] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and\n\nbregman distances. Machine Learning, 47(2/3):253\u2013285, 2002.\n\n[3] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass prob-\nlems. In Proc. of the Thirteenth Annual Conference on Computational Learning Theory, 2000.\nInformation geometry and alternaning minimization procedures.\n\n[4] I. Csisz\u00b4ar and G. Tusn\u00b4ady.\n\nStatistics and Decisions, Supplement Issue, 1:205\u2013237, 1984.\n\n[5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society, Ser. B, 39:1\u201338, 1977.\n\n[6] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-\ncorrecting output codes. Journal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, January 1995.\n[7] Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for\n\nlinear prediction. Information and Computation, 132(1):1\u201364, January 1997.\n\n[8] John D. Lafferty. Additive models, boosting and inference for generalized divergences.\nProceedings of the Twelfth Annual Conference on Computational Learning Theory, 1999.\n\nIn\n\n[9] S. Della Pietra, V. Della Pietra, and J. Lafferty. Duality and auxilary functions for Bregman\n\ndistances. Technical Report CS-01-10, CMU, 2002.\n\n[10] T. Poggio and F. Girosi. Networks for approximation and learning. Proc. of IEEE, 78(9), 1990.\n[11] R.E. Schapire, M. Rochery, M. Rahim, and N. Gupta.\nIncorporating prior knowledge into\nboosting. In Machine Learning: Proceedings of the Nineteenth International Conference, 2002.\n\n\f", "award": [], "sourceid": 2279, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}