{"title": "Efficient Loss-Based Decoding on Graphs for Extreme Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 7233, "page_last": 7244, "abstract": "In extreme classification problems, learning algorithms are required to map instances to labels from an extremely large label set.\n  We build on a recent extreme classification framework with logarithmic time and space (LTLS), and on a general approach for error correcting output coding (ECOC) with loss-based decoding, and introduce a flexible and efficient approach accompanied by theoretical bounds.\n  Our framework employs output codes induced by graphs, for which we show how to perform efficient loss-based decoding to potentially improve accuracy.\n  In addition, our framework offers a tradeoff between accuracy, model size and prediction time.\n  We show how to find the sweet spot of this tradeoff using only the training data.\nOur experimental study demonstrates the validity of our assumptions and claims,  and shows that our method is competitive with state-of-the-art algorithms.", "full_text": "Ef\ufb01cient Loss-Based Decoding on Graphs for\n\nExtreme Classi\ufb01cation\n\nItay Evron\n\nComputer Science Dept.\n\nThe Technion, Israel\n\nEdward Moroshko\n\nElectrical Engineering Dept.\n\nThe Technion, Israel\n\nKoby Crammer\n\nElectrical Engineering Dept.\n\nThe Technion, Israel\n\nevron.itay@gmail.com\n\nedward.moroshko@gmail.com\n\nkoby@ee.technion.ac.il\n\nAbstract\n\nIn extreme classi\ufb01cation problems, learning algorithms are required to map in-\nstances to labels from an extremely large label set. We build on a recent extreme\nclassi\ufb01cation framework with logarithmic time and space [19], and on a general\napproach for error correcting output coding (ECOC) with loss-based decoding [1],\nand introduce a \ufb02exible and ef\ufb01cient approach accompanied by theoretical bounds.\nOur framework employs output codes induced by graphs, for which we show how\nto perform ef\ufb01cient loss-based decoding to potentially improve accuracy. In addi-\ntion, our framework offers a tradeoff between accuracy, model size and prediction\ntime. We show how to \ufb01nd the sweet spot of this tradeoff using only the training\ndata. Our experimental study demonstrates the validity of our assumptions and\nclaims, and shows that our method is competitive with state-of-the-art algorithms.\n\n1\n\nIntroduction\n\nMulticlass classi\ufb01cation is the task of assigning instances with a category or class from a \ufb01nite set. Its\nnumerous applications range from \ufb01nding a topic of a news item, via classifying objects in images,\nvia spoken words detection, to predicting the next word in a sentence. Our ability to solve multiclass\nproblems with larger and larger sets improves with computation power. Recent research focuses on\nextreme classi\ufb01cation where the number of possible classes K is extremely large.\nIn such cases, previously developed methods, such as One-vs-One (OVO) [17], One-vs-Rest (OVR) [9]\nand multiclass SVMs [34, 6, 11, 25], that scale linearly in the number of classes K, are not feasible.\nThese methods maintain too large models, that cannot be stored easily. Moreover, their training and\ninference times are at least linear in K, and thus do not scale for extreme classi\ufb01cation problems.\nRecently, Jasinska and Karampatziakis [19] proposed a Log-Time Log-Space (LTLS) approach,\nrepresenting classes as paths on graphs. LTLS is very ef\ufb01cient, but has a limited representation,\nresulting in an inferior accuracy compared to other methods. More than a decade earlier, Allwein et\nal. [1] presented a uni\ufb01ed view of error correcting output coding (ECOC) for classi\ufb01cation, as well\nas the loss-based decoding framework. They showed its superiority over Hamming decoding, both\ntheoretically and empirically.\nIn this work we build on these two works and introduce an ef\ufb01cient (i.e. O(log K) time and space)\nloss-based learning and decoding algorithm for any loss function of the binary learners\u2019 margin. We\nshow that LTLS can be seen as a special case of ECOC. We also make a more general connection\nbetween loss-based decoding and graph-based representations and inference. Based on the theoretical\nframework and analysis derived by [1] for loss-based decoding, we gain insights on how to improve\non the speci\ufb01c graphs proposed in LTLS by using more general trellis graphs \u2013 which we name\nWide-LTLS (W-LTLS). Our method pro\ufb01ts from the best of both worlds: better accuracy as in loss-\nbased decoding, and the logarithmic time and space of LTLS. Our empirical study suggests that by\n\n32nd Conference on Neural Information Processing Systems (NIPS 2018), Montr\u00e9al, Canada.\n\n\femploying coding matrices induced by different trellis graphs, our method allows tradeoffs between\naccuracy, model size, and inference time, especially appealing for extreme classi\ufb01cation.\n\n2 Problem setting\n\nWe consider multiclass classi\ufb01cation with K classes, where K is very large. Given a training set of m\nexamples (xi, yi) for xi \u2208 X \u2286 Rd and yi \u2208 Y = {1, ..., K} our goal is to learn a mapping from X\nto Y. We focus on the 0/1 loss and evaluate the performance of the learned mapping by measuring its\naccuracy on a test set \u2013 i.e. the fraction of instances with a correct prediction. Formally, the accuracy\nof a mapping h : X \u2192 Y on a set of n pairs, {(xi, yi)}n\ni=1 1h(xi)=yi, where\n1z equals 1 if the predicate z is true, and 0 otherwise.\n\ni=1, is de\ufb01ned as 1\n\n(cid:80)n\n\nn\n\n3 Error Correcting Output Coding (ECOC)\n\n2\n\nj=1\n\n1\u2212Ma,j Mb,j\n\nis de\ufb01ned as \u03c1(a, b) (cid:44) (cid:80)(cid:96)\n\nDietterich and Bakiri [14] employed ideas from coding theory [23] to create Error Correcting Output\nCoding (ECOC) \u2013 a reduction from a multiclass classi\ufb01cation problem to multiple binary classi\ufb01cation\nsubproblems. In this scheme, each class is assigned with a (distinct) binary codeword of (cid:96) bits (with\nvalues in {\u22121, +1}). The K codewords create a matrix M \u2208 {\u22121, +1}K\u00d7(cid:96) whose rows are the\ncodewords and whose columns induce (cid:96) partitions of the classes into two subsets. Each of these\npartitions induces a binary classi\ufb01cation subproblem. We denote by Mk the kth row of the matrix,\nand by Mk,j its (k, j) entry. In the jth partition, class k is assigned with the binary label Mk,j.\nECOC introduces redundancy in order to acquire error-correcting capabilities such as a minimum\nHamming distance between codewords. The Hamming distance between two codewords Ma, Mb\n, and the minimum Hamming distance of M is \u03c1 =\nmina(cid:54)=b \u03c1(a, b). A high minimum distance of the coding matrix potentially allows overcoming binary\nclassi\ufb01cation errors during inference time.\nAt training time, this scheme generates (cid:96) binary classi\ufb01cation training sets of the form {xi, Myi,j}m\ni=1\nfor j = 1, . . . , (cid:96), and executes some binary classi\ufb01cation learning algorithm that returns (cid:96) classi\ufb01ers,\neach trained on one of these sets. We assume these classi\ufb01ers are margin-based, that is, each classi\ufb01er\n(cid:80)m\nis a real-valued function, fj : X \u2192 R, whose binary prediction for an input x is sign (fj (x)). The\nbinary classi\ufb01cation learning algorithm de\ufb01nes a margin-based loss L : R \u2192 R+, and minimizes the\ni=1 L (Myi,jf (xi)), where F is\naverage loss over the induced set. Formally, fj = arg minf\u2208F 1\nm\na class of functions, such as the class of bounded linear functions. Few well known loss functions are\nthe hinge loss L(z) (cid:44) max (0, 1 \u2212 z), used by SVM, its square, the log loss L (z) (cid:44) log (1 + e\u2212z)\nused in logistic regression, and the exponential loss L (z) (cid:44) e\u2212z used in AdaBoost [30].\nOnce these classi\ufb01ers are trained, a straightforward inference is performed. Given an input x,\nthe algorithm \ufb01rst applies the (cid:96) functions on x and computes a {\u00b11}-vector of size (cid:96), that is\n(sign (f1 (x)) . . . sign (f(cid:96) (x))). Then, the class k which is assigned to the codeword closest in\nHamming distance to this vector is returned. This inference scheme is often called Hamming\ndecoding.\nThe Hamming decoding uses only the binary prediction of the binary learners, ignoring the con\ufb01dence\neach learner has in its prediction per input. Allwein et al. [1] showed that this margin or con\ufb01dence\nholds valuable information for predicting a class y \u2208 Y, and proposed the loss-based decoding\nframework for ECOC1. In loss-based decoding, the margin is incorporated via the loss function L(z).\nSpeci\ufb01cally, the class predicted is the one minimizing the total loss\n\n(cid:96)(cid:88)\n\nj=1\n\nk\u2217 = arg min\n\nk\n\nL (Mk,jfj (x)) .\n\n(1)\n\nThey [1] also developed error bounds and showed theoretically and empirically that loss-based\ndecoding outperforms Hamming decoding.\n\n1 Another contribution of their work, less relevant to our work, is a unifying approach for multiclass\nclassi\ufb01cation tasks. They showed that many popular approaches are uni\ufb01ed into a framework of sparse (ternary)\ncoding schemes with a coding matrix M \u2208 {\u22121, 0, 1}K\u00d7(cid:96). For example, One-vs-Rest (OVR) could be thought\nof as K \u00d7 K matrix whose diagonal elements are 1, and the rest are -1.\n\n2\n\n\fFigure 1: Path codeword representation. An\nentry containing 1 means that the correspond-\ning edge is a part of the illustrated bold blue\npath. The green dashed rectangle shows a verti-\ncal slice.\n\nFigure 2: Two closest paths. Predicting Path II\n(red) instead of I (blue), will result in a predic-\ntion error. The Hamming distance between the\ncorresponding codewords is 4. The highlighted\nentries correspond to the 4 disagreement edges.\n\nOne drawback of their method is that given a loss function L, loss-based decoding requires an\nexhaustive evaluation of the total loss for each codeword Mk (each row of the coding matrix). This\nimplies a decoding time at least linear in K, making it intractable for extreme classi\ufb01cation. We\naddress this problem below.\n\n4 LTLS\n\nA recent extreme classi\ufb01cation approach, proposed by Jasinska and Karampatziakis [19], performs\ntraining and inference in time and space logarithmic in K, by embedding the K classes into K paths\nof a directed-acyclic trellis graph T , built compactly with (cid:96) = O (log K) edges. We denote the set\nof vertices V and set of edges E. A multiclass model is de\ufb01ned using (cid:96) functions from the feature\nspace to the reals, wj (x), one function per edge in E = {ej}(cid:96)\nj=1. Given an input, the algorithm\nassigns weights to the edges, and computes the heaviest path using the Viterbi [32] algorithm in\nO (|E|) = O (log K) time. It then outputs the class (from Y) assigned to the heaviest path.\nJasinska and Karampatziakis [19] proposed to train the model in an online manner. The algorithm\nmaintains (cid:96) functions fj(x) and works in rounds. In each training round a speci\ufb01c input-output pair\n(xi, yi) is considered, the algorithm performs inference using the (cid:96) functions to predict a class \u02c6yi,\nand the functions fj(x) are modi\ufb01ed to improve the overall prediction for xi according to yi, \u02c6yi.\nThe inference performed during train and test times, includes using the obtained functions fj(x) to\ncompute the weights wj(x) of each input, by simply setting wj(x) = fj(x). Speci\ufb01cally, they used\nmargin-based learning algorithms, where fj(x) is the margin of a binary prediction.\nOur \ufb01rst contribution is the observation that the LTLS approach can be thought of as an ECOC scheme,\nin which the codewords (rows) represent paths in the trellis graph, and the columns correspond to\nedges on the graph. Figure 1 illustrates how a codeword corresponds to a path on the graph.\nIt might seem like this approach can represent only numbers of classes K which are powers of 2.\nHowever, in Appendix C.1 we show how to create trellis graphs with exactly K paths, for any K \u2208 N.\n\n4.1 Path assignment\n\nLTLS requires a bijective mapping between paths to classes and vice versa. It was proposed in [19]\nto employ a greedy assignment policy suitable for online learning, where during training, a sample\nwhose class is yet unassigned with a path, is assigned with the heaviest unassigned path. One could\nalso consider a naive random assignment between paths and classes.\n\n4.2 Limitations\n\nThe elegant LTLS construction suffers from two limitations:\n\n1. Dif\ufb01cult induced binary subproblems: The induced binary subproblems are hard, especially\nwhen learned with linear classi\ufb01ers. Each path uses one of four edges between every two adjacent\nvertical slices. Therefore, each edge is used by 1\n4 K subproblem.\n\n4 of the classes, inducing a 1\n\n4 K-vs- 3\n\n3\n\n\ud835\udc521\ud835\udc520\ud835\udc522\ud835\udc525\ud835\udc523\ud835\udc524\ud835\udc526\ud835\udc529\ud835\udc527\ud835\udc528\ud835\udc5211\ud835\udc5210Edge index01234567891011Path1-11-1-1-1-11-1-1-11Path I1-11-1-1-1-11-1-1-11Path II1-1-11-1-1-1-1-11-11\fSimilarly, the edges connected to the source or sink induce 1\n2 K subproblems. In both\ncases classes are split into two groups, almost arbitrarily, with no clear semantic interpretation\nfor that partition. For comparison, in 1-vs-Rest (OVR) the induced subproblems are considered\nmuch simpler as they require classifying only one class vs the rest2 (meaning they are much less\nbalanced).\n\n2 K-vs- 1\n\n2. Low minimum distance: In the LTLS trellis architecture, every path has another (closest) path\nwithin 2 edge deletions and 2 edge insertions (see Figure 2). Thus, the minimum Hamming\ndistance in the underlying coding matrix is restrictively small: \u03c1 = 4, which might imply a\npoor error correcting capability. The OVR coding matrix also suffers from a small minimum\ndistance (\u03c1 = 2), but as we explained, the induced subproblems are very simple, allowing a higher\nclassi\ufb01cation accuracy in many cases.\n\nWe focus on improving the multiclass accuracy by tackling the \ufb01rst limitation, namely making the\nunderlying binary subproblems easier. Addressing the second limitation is deferred to future work.\n\n5 Ef\ufb01cient loss-based decoding\n\nj=1 to the edges {ej}(cid:96)\n\nthe weight of the path assigned to this class, w (Pk) (cid:44) (cid:80)\n(cid:80)(cid:96)\n\nWe now introduce another contribution \u2013 a new algorithm performing ef\ufb01cient loss-based decoding\n(inference) for any loss function by exploiting the structure of trellis graphs. Similarly to [19], our\ndecoding algorithm performs inference in two steps. First, it assigns (per input x to be classi\ufb01ed)\nweights {wj (x)}(cid:96)\nj=1 of the trellis graph. Second, it \ufb01nds the shortest path\n(instead of the heaviest) Pk\u2217 by an ef\ufb01cient dynamic programming (Viterbi) algorithm and predicts\nthe class k\u2217. Unlike [19], our strategy for assigning edge weights ensures that for any class k,\nwj (x), equals the total loss\nj=1 L (Mk,jfj (x)) for the classi\ufb01ed input x. Therefore, \ufb01nding the shortest path on the graph is\nequivalent to minimizing the total loss, which is the aim in loss-based decoding. In other words, we\ndesign a new weighting scheme that links loss-based decoding to the shortest path in a graph.\nWe now describe our algorithm in more detail for the case when the number of classes K is a\npower of 2 (see Appendix C.2 for extension to arbitrary K). Consider a directed edge ej \u2208 E\nand denote by (uj, vj) the two vertices it connects. Denote by S (ej) the set of edges outgoing\nfrom the same vertical slice as ej. Formally, S (ej) = {(u, u(cid:48)) : \u03b4 (u) = \u03b4 (uj)}, where \u03b4 (v) is the\nshortest distance from the source vertex to v (in terms of number of edges). For example, in Figure 1,\nS (e0) = S (e1) = {e0, e1}, S (e2) = S (e3) = S (e4) = S (e5) = {e2, e3, e4, e5}. Given a loss\nfunction L (z) and an input instance x, we set the weight wj for edge ej as following,\n\nj:ej\u2208Pk\n\nwj (x) = L (1 \u00d7 fj(x)) +\n\nL ((\u22121) \u00d7 fj(cid:48)(x)) .\n\n(2)\n\n(cid:88)\n\nj(cid:48):ej(cid:48)\u2208S(ej )\\{ej}\n\nFor example, in Figure 1 we have,\n\nw0 (x) = L (1 \u00d7 f0(x)) + L ((\u22121) \u00d7 f1(x))\nw2 (x) = L (1 \u00d7 f2(x)) + L ((\u22121) \u00d7 f3(x)) + L ((\u22121) \u00d7 f4(x)) + L ((\u22121) \u00d7 f5(x)) .\n\nThe next theorem states that for our choice of weights, \ufb01nding the shortest path in the weighted\ngraph is equivalent to loss-based decoding. Thus, algorithmically we can enjoy fast decoding (i.e.\ninference), and statistically we can enjoy better performance by using loss-based decoding.\n\nTheorem 1 Let L (z) be any loss function of the margin. Let T be a trellis graph with an underlying\ncoding matrix M. Assume that for any x \u2208 X the edge weights are calculated as in Eq. (2). Then,\nthe weight of any path Pk equals to the loss suffered by predicting its corresponding class k, i.e.\n\nw(Pk) =(cid:80)(cid:96)\n\nj=1 L (Mk,jfj(x)).\n\nThe proof appears in Appendix A. In the next lemma we claim that LTLS decoding is a special case\nof loss-based decoding with the squared loss function. See Appendix B for proof.\n\n2 A similar observation is given in Section 6 of Allwein et al. [1] regarding OVR.\n\n4\n\n\fFigure 3: Different graphs for K = 64 classes. From left to right: the LTLS graph with a slice width\nof b = 2, W-LTLS with b = 4, and the widest W-LTLS graph with b = 64, corresponding to OVR.\n\nLemma 2 Denote the squared loss function by Lsq(z) (cid:44) (1 \u2212 z)2. Given a trellis graph represented\nusing a coding matrix M \u2208 {\u22121, +1}K\u00d7(cid:96), and (cid:96) functions fj (x), for j = 1 . . . (cid:96), the decoding\nmethod of LTLS (mentioned in Section 4) is a special case of loss-based decoding with the squared\nloss, that is arg maxk w (Pk) = arg mink\n\n(cid:110)(cid:80)\n\nj Lsq (Mk,jfj (x))\n\n(cid:111)\n\n.\n\nWe next build on the framework of [1] to design graphs with a better multiclass accuracy.\n\n6 Wide-LTLS (W-LTLS)\n\nAllwein et al. [1] derived error bounds for loss-based decoding with any convex loss function L.\nThey showed that the training multiclass error with loss-based decoding is upper bounded by:\n\n(3)\n\n(4)\n\nwhere \u03c1 is the minimum Hamming distance of the code and\n\n(cid:96) \u00d7 \u03b5\n\u03c1 \u00d7 L(0)\n\nm(cid:88)\n\n(cid:96)(cid:88)\n\ni=1\n\nj=1\n\n\u03b5 =\n\n1\nm(cid:96)\n\nL (Myi,jfj(xi))\n\nis the average binary loss on the training set of the learned functions {fj}(cid:96)\nj=1 with respect to a coding\nmatrix M and a loss L. One approach to reduce the bound, and thus hopefully also the multiclass\ntraining error (and under some conditions also the test error) is to reduce the total error of the binary\nproblems (cid:96) \u00d7 \u03b5. We now show how to achieve this by generalizing the LTLS framework to a more\n\ufb02exible architecture which we call W-LTLS 3.\nMotivated by the error bound of [1], we propose a generalization of the LTLS model. By increasing\nthe slice width of the trellis graph, and consequently increasing the number of edges between adjacent\nvertical slices, the induced subproblems become less balanced and potentially easier to learn (see\nRemark 2). For simplicity we choose a \ufb01xed slice width b \u2208 {2, . . . , K} for the entire graph (e.g.\nb2 K-vs-rest (corresponding to\nsee Figure 3). In such a graph, most of the induced subproblems are 1\nb K-vs-rest (the ones connected to the source or to the\nedges between adjacent slices) and some are 1\nsink). As b increases, the graph representation becomes less compact and requires more edges, i.e. (cid:96)\nincreases. However, the induced subproblems potentially become easier, improving the multiclass\naccuracy. This suggests that our model allows an accuracy vs model size tradeoff.\nIn the special case where b = K we get the widest graph containing 2K edges (see Figure 3). All the\nsubproblems are now 1-vs-rest: the kth path from the source to the sink contains two edges (one from\nthe source and one to the sink) which are not a part of any other path. Thus, the corresponding two\ncolumns in the underlying coding matrix are identical \u2013 having 1 at their kth entry and (\u22121) at the\nrest. This implies that the distinct columns of the matrix could be rearranged as the diagonal coding\nmatrix corresponding to OVR, making our model when b = K an implementation of OVR.\nIn Section 7 we show empirically that W-LTLS improves the multiclass accuracy of LTLS. In\nAppendix E.2 we show that the binary subproblems indeed become easier, i.e. we observe a decrease\nin the average binary loss \u03b5, lowering the bound in (3). Note that the denominator \u03c1 \u00d7 L (0) is left\nuntouched \u2013 the minimum distance of the coding matrices corresponding to different architectures of\nW-LTLS is still 4, like in the original LTLS model (see Section 4.2).\n\n3Code is available online at https://github.com/ievron/wltls/\n\n5\n\n\u2026636210\f6.1 Time and space complexity analysis\n\nW-LTLS requires training and storing a binary learner for every edge. For most linear classi\ufb01ers\n(with d parameters each) we get4 a total model size complexity and an inference time complexity\n(see Appendix D for further details). Moreover, many extreme\nclassi\ufb01cation datasets are sparse \u2013 the average number of non-zero features in a sample is de (cid:28) d.\n\nof O (d|E|) = O(cid:16)\nThe inference time complexity thus decreases to O(cid:16)\n\nd b2\nlog b log K\n\n(cid:17)\n\n(cid:17)\n\nde\n\nb2\nlog b log K\n\n.\n\nThis is a signi\ufb01cant advantage: while inference with loss-based decoding for general matrices requires\nO (de(cid:96) + K(cid:96)) time, our model performs it in only O (de(cid:96) + (cid:96)) = O (de(cid:96)).\nSince training requires learning (cid:96) binary subproblems, the training time complexity is also sublinear\nin K. These subproblems can be learned separately on (cid:96) cores, leading to major speedups.\n\n6.2 Wider graphs induce sparse models\n\nThe high sparsity typical to extreme classi\ufb01cation datasets (e.g. the Dmoz dataset has d = 833, 484\nfeatures, but on average only de = 174 of them are non-zero), is heavily exploited by previous works\nsuch as PD-Sparse [15], PPDSparse [35], and DiSMEC [2], which all learn sparse models.\nIndeed, we \ufb01nd that for sparse datasets, our algorithm typically learns a model with a low percentage\nof non-zero weights. Moreover, the percentage of non-zero decreases signi\ufb01cantly as the slice width\nb is increased (see Appendix E.6). This allows us to employ a simple post-pruning of the learned\nweights. For some threshold value \u03bb, we set to zero all learned weights in [\u2212\u03bb, \u03bb], yielding a sparse\nmodel. Similar approaches were taken by [2, 19, 15] either explicitly or implicitly.\nIn Section 7.3 we show that the above scheme successfully yields highly sparse models.\n\n7 Experiments\n\nWe test our algorithms on 5 extreme multiclass datasets previously used in [15], having approximately\n102, 103, and 104 classes (see Table 1 in Appendix E.1). We use AROW [10] to train the binary\nfunctions {fj}(cid:96)\nj=1 of W-LTLS. Its online updates are based on the squared hinge loss LSH (z) (cid:44)\n(max (0, 1 \u2212 z))2. For each dataset, we build wide graphs with multiple slice widths. For each\ncon\ufb01guration (dataset and graph) we perform \ufb01ve runs using random sample shuf\ufb02ing on every epoch,\nand a random path assignment (as explained in Section 4.1, unlike the greedy policy used in [19]),\nand report averages over these \ufb01ve runs. Unlike [19], we train the (cid:96) binary learners independently\nrather than in a joint (structured) manner. This allows parallel independent training, as common for\ntraining binary learners for ECOC, with no need to perform full multiclass inference during training.\n\n7.1 Loss-based decoding\n\nWe run W-LTLS with different loss functions for loss-based decoding: the exponential loss, the\nsquared loss (used by LTLS, see Lemma 2), the log loss, the hinge loss, and the squared hinge loss.\nThe results appear in Figure 4. We observe that decoding with the exponential loss works the best\non all \ufb01ve datasets. For the two largest datasets (Dmoz and LSHTC1) we report signi\ufb01cant accuracy\nimprovement when using the exponential loss for decoding in graphs with large slice widths (b),\nover the squared loss used implicitly by LTLS. Indeed, for these larger values of b, the subproblems\nare easier (see Appendix E.2 for detailed analysis). This should result in larger prediction margins\n|fj (x)|, as we indeed observe empirically (shown in Appendix E.4). The various loss functions L (z)\ndiffer signi\ufb01cantly for z (cid:28) 0, potentially explaining why we \ufb01nd larger accuracy differences as b\nincreases when decoding with different loss functions.\n\n6\n\n\fFigure 4: First row: Multiclass test accuracy as a function of the model size (MBytes) for loss-based\ndecoding with different loss functions. Second row: Relative increase in multiclass test accuracy\ncompared to decoding with the squared loss used implicitly in LTLS. The secondary x-axes (top axes,\nblue) indicate the slice widths (b) used for the W-LTLS trellis graphs.\n\nFigure 5: First row: Multiclass test accuracy vs model size. Second row: Multiclass test accuracy vs\nprediction time. A 95% con\ufb01dence interval is shown for the results of W-LTLS.\n\n7.2 Multiclass test accuracy\n\nWe compare the multiclass test accuracy of W-LTLS (using the exponential loss for decoding) to the\nsame baselines presented in [19]. Namely we compare to LTLS [19], LOMTree [7] (results quoted\nfrom [19]), FastXML [29] (run with the default parameters on the same computer as our model),\nand OVR (binary learners trained using AROW). For convenience, the results are also presented in a\ntabular form in Appendix E.5.\n\n7.2.1 Accuracy vs Model size\n\nThe \ufb01rst row of Figure 5 (best seen in color) summarizes the multiclass accuracies vs model size.\nAmong the four competitors, LTLS enjoys the smallest model size, LOMTree and FastXML have\nlarger model sizes, and OVR is the largest. LTLS achieves lower accuracies than LOMTree on two\ndatasets, and higher ones on the other two. OVR enjoys the best accuracy, yet with a price of model\nsize. For example, in Dmoz, LTLS achieves 23% accuracy vs 35.5% of OVR, though the model size\nof the latter is \u00d7200 larger than of the former.\nIn all \ufb01ve datasets, an increase in the slice width of W-LTLS (and consequently in the model size)\ntranslates almost always to an increase in accuracy. Our model is often better or competitive with the\nother algorithms that have logarithmic inference time complexity (LTLS, LOMTree, FastXML), and\nalso competitive with OVR in terms of accuracy, while we still enjoy much smaller model sizes.\nFor the smallest model sizes of W-LTLS (corresponding to b = 2), our trellis graph falls back to the\none of LTLS. The accuracies gaps between these two models may be explained by the different binary\nlearners the experiments were run with \u2013 LTLS used averaged Perceptron as the binary learner whilst\nwe used AROW. Also, LTLS was trained in a structured manner with a greedy path assignment policy\nwhile we trained every binary function independently with a random path assignment policy (see\nSection 6.1). In our runs we observed that independent training achieves accuracy competitive with\nto structured online training, while usually converging much faster. It is interesting to note that for\n\n4 Clearly, when b \u2248 \u221a\n\nstudy shows that high accuracy can be achieved using much smaller values of b.\n\nK our method cannot be regarded as sublinear in K anymore. However, our empirical\n\n7\n\n2324 919293949596Test accuracy (%)sectorexpSquaredLogHingeSq. Hinge245710272829210 858789919395aloi_bin23571015202221202122Model size (MB)024681012imageNet25101520302829210211212213 252831343740Dmoz23510203040502829210211212213 81114172023LSHTC1expSquaredLogHingeSq. Hinge23510203040452324 0.20.10.00.10.2Accuracy delta (%)expSquaredLogHingeSq. Hinge245710272829210 0.400.250.100.050.2023571015202221202122Model size (MB)1.300.650.000.651.3025101520302829210211212213 3.01.50.01.53.023510203040502829210211212213 63036expSquaredLogHingeSq. Hinge2351020304045232425 818487909396Test accuracy (%)sectorW-LTLSLTLSLOMTreeFastXMLOVR245710272829210211 818487909396aloi_bin2357101520222022242628210Model size (MB)03691215imageNet251020302829210211212213214215 202428323640Dmoz23510203040502829210211212213214215216 81114172023LSHTC1W-LTLSLTLSLOMTreeFastXMLOVR2351020304522 818487909396Test accuracy (%)W-LTLSLTLSLOMTreeFastXML245710202122 818487909396Test accuracy (%)235710152024252627Prediction time (sec)03691215Test accuracy (%)2510152030232425262728 202428323640Test accuracy (%)2351020304050202122232425 81114172023Test accuracy (%)W-LTLSLTLSLOMTreeFastXML2351020304045\fFigure 6: Multiclass test accuracy vs model size for sparse models. Lines between two W-LTLS\nplots connect the same models before and after the pruning. The secondary x-axes (top axes, blue)\nindicate the slice widths (b) used for the (unpruned) W-LTLS trellis graphs.\n\nthe imageNet dataset the LTLS model cannot \ufb01t the data, i.e the training error is close to 1 and the\ntest accuracy is close to 0. The reason is that the binary subproblems are very hard, as was also noted\nby [19]. By increasing the slice width (b), the W-LTLS model mitigates this under\ufb01tting problem,\nstill with logarithmic time and space complexity.\nWe also observe in the \ufb01rst row of Figure 5 that there is a point where the multiclass test accuracy\nof W-LTLS starts to saturate (except for imageNet). Our experiments show that this point can\nbe found by looking at the training error and its bound only. We thus have an effective way to\nchoose the optimal model size for the dataset and space/time budget at hand by performing model\nselection (width of the graph in our case) using the training error bound only (see detailed analysis\nAppendix E.2 and Appendix E.3).\n\n7.2.2 Accuracy vs Prediction time\n\nIn the second row of Figure 5 we compare prediction (inference) time of W-LTLS to other methods.\nLTLS enjoys the fastest prediction time, but suffers from low accuracy. LOMTree runs slower than\nLTLS, but sometimes achieves better accuracy. Despite being implemented in Python, W-LTLS is\ncompetitive with FastXML, which is implemented in C++.\n\n7.3 Exploiting the sparsity of the datasets\n\nWe now demonstrate that the post-pruning proposed in Section 6.2, which zeroes the weights in\n[\u2212\u03bb, \u03bb], is highly bene\ufb01cial. Since imageNet is not sparse at all, we do not consider it in this section.\nWe tune the threshold \u03bb so that the degradation in the multiclass validation accuracy is at most 1%\n(tuning the threshold is done after the cumbersome learning of the weights, and does not require\nmuch time).\nIn Figure 6 we plot the multiclass test accuracy versus model size for the non-sparse W-LTLS, as well\nas the sparse W-LTLS after pruning the weights as explained above. We compare ourselves to the\naforementioned sparse competitors: DiSMEC, PD-Sparse, and PPDSparse (all results quoted from\n[35]). Since the aforementioned FastXML [29] also exploits sparsity to reduce the size of the learned\ntrees, we consider it here as well (we run the code supplied by the authors for various numbers of\ntrees). For convenience, all the results are also presented in a tabular form in Appendix E.6.\nWe observe that our method can induce very sparse binary learners with a small degradation in\naccuracy. In addition, as expected, the wider the graphs (large b), the more bene\ufb01cial is the pruning.\nInterestingly, while the number of parameters increases as the graphs become wider, the actual storage\nspace for the pruned sparse models may even decrease. This phenomenon is observed for the sector\nand aloi.bin datasets.\nFinally, we note that although PD-Sparse [15] and DiSMEC [2] perform better on some model size\nregions of the datasets, their worse case space requirement during training is linear in the number of\nclasses K, whereas our approach guarantees (adjustable) logarithmic space for training.\n\n8 Related work\n\nExtreme classi\ufb01cation was studied extensively in the past decade. It faces unique challenges, amongst\nwhich is the model size of its designated learning algorithms. An extremely large model size often\n\n8\n\n202122232425Model size (MB)858789919395Test accuracy (%)sectorW-LTLSSp. W-LTLSFastXML24510242526272829210Model size (MB)858789919395aloi_binW-LTLSSp. W-LTLSFastXMLDiSMECPDSparsePPDSparse2357202526272829210211212213Model size (MB)252831343740DmozW-LTLSSp. W-LTLSFastXMLDiSMECPDSparsePPDSparse2351050242628210212Model size (MB)10.012.414.817.219.622.0LSHTC1W-LTLSSp. W-LTLSFastXMLDiSMECPDSparsePPDSparse235102045\fimplies long training and test times, as well as excessive space requirements. Also, when the number\nof classes K is extremely large, the inference time complexity should be sublinear in K for the\nclassi\ufb01er to be useful.\nThe Error Correcting Output Coding (ECOC) (see Section 3) approach seems promising for extreme\nclassi\ufb01cation, as it potentially allows a very compact representation of the label space with K\ncodewords of length (cid:96) = O (log K). Indeed, many works concentrated on utilizing ECOC for\nextreme classi\ufb01cation. Some formulate dedicated optimization problems to \ufb01nd ECOC matrices\nsuitable for extreme classi\ufb01cation [8] and others focus on learning better binary learners [24].\nHowever, very little attention has been given to the decoding time complexity. In the multiclass\nregime where only one class is the correct class, many of these works are forced to use exact (i.e. not\napproximated) decoding algorithms which often require O (K(cid:96)) time [21] in the worst-case. Norouzi\net al. [27] proposed a fast exact search nearest neighbor algorithm in the Hamming space, which\nfor coding matrices suitable for extreme classi\ufb01cation can achieve o (K) time complexity, but not\nO (log K). These algorithms are often limited to binary (dense) matrices and hard decoding. Some\napproaches [22] utilize graphical processing units in order to \ufb01nd the nearest neighbor in Euclidean\nspace, which can be useful for soft decoding, but might be too demanding for weaker devices. In our\nwork we keep the time complexity of any loss-based decoding logarithmic in K.\nMoreover, most existing ECOC methods employ coding matrices with higher minimum distance\n\u03c1, but with balanced binary subproblems. In Section 6 we explain how our ability of inducing\nless balanced subproblems is bene\ufb01cial both for the learnability of these subproblems, and for the\npost-pruning of learned weights to create sparse models.\nIt is also worth mentioning that many of the ECOC-based works (like randomized or learned codes\n[8, 37]) require storing the entire coding matrix even during inference time. Hence, the additional\nspace complexity needed only for decoding during inference is O (K log K), rather than O (K) as in\nLTLS and W-LTLS which do not directly use the coding matrix for decoding the binary predictions\nand only require a mapping from code to label (e.g. a binary tree).\nNaturally, hierarchical classi\ufb01cation approaches are very popular for extreme classi\ufb01cation tasks.\nMany of these approaches employ tree based models [3, 29, 28, 18, 20, 7, 12, 4, 26, 13]. Such\nmodels can be seen as decision trees allowing inference time complexity linear in the tree height,\nthat is O (log K) if the tree is (approximately) balanced. A few models even achieve logarithmic\ntraining time, e.g. [20]. Despite having a sublinear time complexity, these models require storing\nO (K) classi\ufb01ers.\nAnother line of research focused on label-embedding methods [5, 31, 33, 36]. These methods try to\nexploit label correlations and project the labels onto a low-dimensional space, reducing training and\nprediction time. However, the low-rank assumption usually leads to an accuracy degradation.\nLinear methods were also the focus of some recent works [2, 15, 35]. They learn a linear classi\ufb01er\nper label and incorporate sparsity assumptions or perform distributed computations. However, the\ntraining and prediction complexities of these methods do not scale gracefully to datasets with a\nvery large number of labels. Using a similar post-pruning approach and independent (i.e. not joint)\nlearning of the subproblems, W-LTLS is also capable of exploiting sparsity and learn in parallel.\n\n9 Conclusions and Future work\n\nWe propose a new ef\ufb01cient loss-based decoding algorithm that works for any loss function. Motivated\nby a general error bound for loss-based decoding [1], we show how to build on the log-time log-space\n(LTLS) framework [19] and employ a more general type of trellis graph architectures. Our method\noffers a tradeoff between multiclass accuracy, model size and prediction time, and achieves better\nmulticlass accuracies under logarithmic time and space guarantees.\nMany intriguing directions remain uncovered, suggesting a variety of possible future work. One\ncould try to improve the restrictively low minimum code distance of W-LTLS discussed in Section 4.2\nRegularization terms could also be introduced, to try and further improve the learned sparse models.\nMoreover, it may be interesting to consider weighing every entry of the coding matrix (in the spirit of\nEscalera et al. [16]) in the context of trellis graphs. Finally, many ideas in this paper can be extended\nfor other types of graphs and graph algorithms.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Eyal Bairey for the fruitful discussions. This research was supported in part\nby The Israel Science Foundation, grant No. 2030/16.\n\nReferences\n[1] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A\nunifying approach for margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141,\n2000.\n\n[2] Rohit Babbar and Bernhard Sch\u00f6lkopf. Dismec: Distributed sparse machines for extreme\nmulti-label classi\ufb01cation. In Proceedings of the Tenth ACM International Conference on Web\nSearch and Data Mining, WSDM \u201917, pages 721\u2013729, New York, NY, USA, 2017. ACM.\n\n[3] Samy Bengio, Jason Weston, and David Grangier. Label embedding trees for large multi-class\n\ntasks. Advances in Neural Information Processing Systems, 23(1):163\u2013171, 2010.\n\n[4] Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory Sorkin, and Alex Strehl. Conditional\nprobability tree estimation analysis and algorithms. In Proceedings of the Twenty-Fifth Con-\nference on Uncertainty in Arti\ufb01cial Intelligence, UAI \u201909, pages 51\u201358, Arlington, Virginia,\nUnited States, 2009. AUAI Press.\n\n[5] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse\nlocal embeddings for extreme multi-label classi\ufb01cation. In Advances in Neural Information\nProcessing Systems 28: Annual Conference on Neural Information Processing Systems 2015,\nDecember 7-12, 2015, Montreal, Quebec, Canada, pages 730\u2013738, 2015.\n\n[6] Erin J. Bredensteiner and Kristin P. Bennett. Multicategory classi\ufb01cation by support vector\n\nmachines. Computational Optimizations and Applications, 12:53\u201379, 1999.\n\n[7] Anna Choromanska and John Langford. Logarithmic time online multiclass prediction. In\nProceedings of the 28th International Conference on Neural Information Processing Systems -\nVolume 1, NIPS\u201915, pages 55\u201363, Cambridge, MA, USA, 2015. MIT Press.\n\n[8] Moustapha Cisse, Thierry Artieres, and Patrick Gallinari. Learning compact class codes for\nfast inference in large multi class classi\ufb01cation. Lecture Notes in Computer Science (including\nsubseries Lecture Notes in Arti\ufb01cial Intelligence and Lecture Notes in Bioinformatics), 7523\nLNAI(PART 1):506\u2013520, 2012.\n\n[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013\n\n297, 1995.\n\n[10] Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors.\nIn Advances in Neural Information Processing Systems 22, pages 414\u2013422. Curran Associates,\nInc., 2009.\n\n[11] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-\n\nbased vector machines. Jornal of Machine Learning Research, 2:265\u2013292, 2001.\n\n[12] Hal Daum\u00e9, III, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic time\none-against-some. In Proceedings of the 34th International Conference on Machine Learn-\ning, volume 70 of Proceedings of Machine Learning Research, pages 923\u2013932, International\nConvention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[13] Krzysztof Dembczy\u00b4nski, Wojciech Kot\u0142owski, Willem Waegeman, R\u00f3bert Busa-Fekete, and\nEyke H\u00fcllermeier. Consistency of probabilistic classi\ufb01er trees. In Machine Learning and\nKnowledge Discovery in Databases, pages 511\u2013526, Cham, 2016. Springer International\nPublishing.\n\n[14] Thomas G. Dietterich and Ghulum Bakiri. Solving Multiclass Learning Problems via Error-\n\nCorrecting Output Codes. Jouranal of Arti\ufb01cal Intelligence Research, 2:263\u2013286, 1995.\n\n10\n\n\f[15] Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit S. Dhillon.\nPD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classi\ufb01-\ncation. Proceedings of The 33rd International Conference on Machine Learning, 48:3069\u20133077,\n2016.\n\n[16] Sergio Escalera, Oriol Pujol, and Petia Radeva. Loss-Weighted Decoding for Error-Correcting\n\nOutput Coding. Visapp (2), pages 117\u2013122, 2008.\n\n[17] Johannes F\u00fcrnkranz. Round robin classi\ufb01cation. Jornal of Machine Learning Research, 2:721\u2013\n\n747, March 2002.\n\n[18] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for\nrecommendation, tagging, ranking & other missing label applications. In Proceedings of the\n22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD \u201916, pages 935\u2013944, New York, NY, USA, 2016. ACM.\n\n[19] Kalina Jasinska and Nikos Karampatziakis. Log-time and log-space extreme classi\ufb01cation.\n\narXiv preprint arXiv:1611.01964, 2016.\n\n[20] Yacine Jernite, Anna Choromanska, and David Sontag. Simultaneous learning of trees and\nrepresentations for extreme classi\ufb01cation and density estimation. In Proceedings of the 34th\nInternational Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11\nAugust 2017, pages 1665\u20131674, 2017.\n\n[21] Ashraf M. Kibriya and Eibe Frank. An Empirical Comparison of Exact Nearest Neighbour\n\nAlgorithms. Knowledge Discovery in Databases: PKDD 2007, pages 140\u2013151, 2007.\n\n[22] Shengren Li and Nina Amenta. Brute-force k-nearest neighbors search on the GPU. In Similarity\nSearch and Applications - 8th International Conference, SISAP 2015, Glasgow, UK, October\n12-14, 2015, Proceedings, pages 259\u2013270, 2015.\n\n[23] Shu Lin and Daniel J. Costello. Error Control Coding, Second Edition. Prentice-Hall, Inc.,\n\nUpper Saddle River, NJ, USA, 2004.\n\n[24] Mingxia Liu, Daoqiang Zhang, Songcan Chen, and Hui Xue. Joint binary classi\ufb01er learning\nfor ecoc-based multi-class classi\ufb01cation. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 38(11):2335\u20132341, Nov. 2016.\n\n[25] Chris Mesterharm. A multi-class linear learning algorithm related to winnow. In Advances in\n\nNeural Information Processing Systems 13, 1999.\n\n[26] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model.\nIn Proceedings of the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics,\npages 246\u2013252. Society for Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[27] Mohammad Norouzi, Ali Punjani, and David J. Fleet. Fast exact search in hamming space\nwith multi-index hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n36(6):1107\u20131119, 2014.\n\n[28] Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel:\nPartitioned label trees for extreme classi\ufb01cation with application to dynamic search advertising.\nIn Proceedings of the 2018 World Wide Web Conference, WWW \u201918, 2018.\n\n[29] Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classi\ufb01er\nfor extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, KDD \u201914, pages 263\u2013272, 2014.\n\n[30] Robert E. Schapire. Explaining adaboost. In Empirical Inference - Festschrift in Honor of\n\nVladimir N. Vapnik, pages 37\u201352, 2013.\n\n[31] Yukihiro Tagami. Annexml: Approximate nearest neighbor search for extreme multi-label clas-\nsi\ufb01cation. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, KDD \u201917, pages 455\u2013464, New York, NY, USA, 2017. ACM.\n\n11\n\n\f[32] Andrew J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum\n\ndecoding algorithm. IEEE Trans. Information Theory, 13(2):260\u2013269, 1967.\n\n[33] Jason Weston, Samy Bengio, and Nicolas Usunier. WSABIE: Scaling up to large vocabulary\nimage annotation. IJCAI International Joint Conference on Arti\ufb01cial Intelligence, pages 2764\u2013\n2770, 2011.\n\n[34] Jason Weston and Chris Watkins. Support vector machines for multi-class pattern recognition.\n\nIn Esann, volume 99, pages 219\u2013224, 1999.\n\n[35] Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit Dhillon, and Eric Xing.\nPpdsparse: A parallel primal-dual sparse method for extreme classi\ufb01cation. In Proceedings of\nthe 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD \u201917, pages 545\u2013553, New York, NY, USA, 2017. ACM.\n\n[36] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label\nlearning with missing labels. In International conference on machine learning, pages 593\u2013601,\n2014.\n\n[37] Bin Zhao and Eric P. Xing. Sparse Output Coding for Scalable Visual Recognition. International\n\nJournal of Computer Vision, 119(1):60\u201375, 2013.\n\n12\n\n\f", "award": [], "sourceid": 3599, "authors": [{"given_name": "Itay", "family_name": "Evron", "institution": "Technion"}, {"given_name": "Edward", "family_name": "Moroshko", "institution": "Technion"}, {"given_name": "Koby", "family_name": "Crammer", "institution": "Technion"}]}