{"title": "Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 4159, "page_last": 4168, "abstract": "We address the problem of multi-class classification in the case where the number of classes is very large. We propose a double sampling strategy on top of a multi-class to binary reduction strategy, which transforms the original multi-class problem into a binary classification problem over pairs of examples. The aim of the sampling strategy is to overcome the curse of long-tailed class distributions exhibited in majority of large-scale multi-class classification problems and to reduce the number of pairs of examples in the expanded data. We show that this strategy does not alter the consistency of the empirical risk minimization principle defined over the double sample reduction. Experiments are carried out on DMOZ and Wikipedia collections with 10,000 to 100,000 classes where we show the efficiency of the proposed approach in terms of training and prediction time, memory consumption, and predictive performance with respect to state-of-the-art approaches.", "full_text": "Aggressive Sampling for Multi-class to Binary\n\nReduction with Applications to Text Classi\ufb01cation\n\nBikash Joshi\n\nUniv. Grenoble Alps, LIG\n\nGrenoble, France\n\nMassih-Reza Amini\n\nUniv. Grenoble Alps, LIG\n\nGrenoble, France\n\nIoannis Partalas\nExpedia EWE\n\nGeneva, Switzerland\n\nbikash.joshi@imag.fr\n\nmassih-reza.amini@imag.fr\n\nipartalas@expedia.com\n\nFranck Iutzeler\n\nUniv. Grenoble Alps, LJK\n\nGrenoble, France\n\nfranck.iutzeler@imag.fr\n\nLos Alamos National Laboratory and Skolkovo IST,\n\nYury Maximov\n\nUSA and Moscow, Russia\n\nyury@lanl.gov\n\nAbstract\n\nWe address the problem of multi-class classi\ufb01cation in the case where the number of\nclasses is very large. We propose a double sampling strategy on top of a multi-class\nto binary reduction strategy, which transforms the original multi-class problem into\na binary classi\ufb01cation problem over pairs of examples. The aim of the sampling\nstrategy is to overcome the curse of long-tailed class distributions exhibited in\nmajority of large-scale multi-class classi\ufb01cation problems and to reduce the number\nof pairs of examples in the expanded data. We show that this strategy does not\nalter the consistency of the empirical risk minimization principle de\ufb01ned over the\ndouble sample reduction. Experiments are carried out on DMOZ and Wikipedia\ncollections with 10,000 to 100,000 classes where we show the ef\ufb01ciency of the\nproposed approach in terms of training and prediction time, memory consumption,\nand predictive performance with respect to state-of-the-art approaches.\n\n1\n\nIntroduction\n\nLarge-scale multi-class or extreme classi\ufb01cation involves problems with extremely large number of\nclasses as it appears in text repositories such as Wikipedia, Yahoo! Directory (www.dir.yahoo.com),\nor Directory Mozilla DMOZ (www.dmoz.org); and it has recently evolved as a popular branch of\nmachine learning with many applications in tagging, recommendation and ranking. The most common\nand popular baseline in this case is the one-versus-all approach (OVA) [18] where one independent\nbinary classi\ufb01er is learned per class. Despite its simplicity, this approach suffers from two main\nlimitations; \ufb01rst, it becomes computationally intractable when the number of classes grow large,\naffecting at the same time the prediction. Second, it suffers from the class imbalance problem by\nconstruction.Recently, two main approaches have been studied to cope with these limitations. The\n\ufb01rst one, broadly divided in tree-based and embedding-based methods, have been proposed with\nthe aim of reducing the effective space of labels in order to control the complexity of the learning\nproblem. Tree-based methods [4, 3, 6, 7, 9, 21, 5, 15] rely on binary tree structures where each\nleaf corresponds to a class and inference is performed by traversing the tree from top to bottom; a\nbinary classi\ufb01er being used at each node to determine the child node to develop. These methods have\nlogarithmic time complexity with the drawback that it is a challenging task to \ufb01nd a balanced tree\nstructure which can partition the class labels. Further, even though different heuristics have been\ndeveloped to address the unbalanced problem, these methods suffer from the drawback that they have\nto make several decisions prior to reaching a \ufb01nal category, which leads to error propagation and\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthus a decrease in accuracy. On the other hand, label embedding approaches [11, 5, 19] \ufb01rst project\nthe label-matrix into a low-dimensional linear subspace and then use an OVA classi\ufb01er. However,\nthe low-rank assumption of the label-matrix is generally transgressed in the extreme multi-class\nclassi\ufb01cation setting, and these methods generally lead to high prediction error.The second type of\napproaches aim at reducing the original multi-class problem into a binary one by \ufb01rst expanding the\noriginal training set using a projection of pairs of observations and classes into a low dimensional\ndyadic space, and then learning a single classi\ufb01er to separate between pairs constituted with examples\nand their true classes and pairs constituted with examples with other classes [1, 28, 16]. Although\nprediction in the new representation space is relatively fast, the construction of the dyadic training\nobservations is generally time consuming and prevails over the training and prediction times.\nContributions. In this paper, we propose a scalable multi-class classi\ufb01cation method based on\nan aggressive double sampling of the dyadic output prediction problem. Instead of computing all\npossible dyadic examples, our proposed approach consists \ufb01rst in drawing a new training set of much\nsmaller size from the original one by oversampling the most small size classes and by sub-sampling\nthe few big size classes in order to avoid the curse of long-tailed class distributions common in the\nmajority of large-scale multi-class classi\ufb01cation problems [2]. The second goal is to reduce the\nnumber of constructed dyadic examples. Our reduction strategy brings inter-dependency between the\npairs containing the same observation and its true class in the original training set. Thus, we derive\nnew generalization bounds using local fractional Rademacher complexity showing that even with a\nshift in the original class distribution and also the inter-dependency between the pairs of example, the\nempirical risk minimization principle over the transformation of the sampled training set remains\nconsistent. We validate our approach by conducting a series of experiments on subsets of DMOZ and\nthe Wikipedia collections with up to 100,000 target categories.\n\n2 A doubly-sampled multi-class to binary reduction strategy\nWe address the problem of monolabel multi-class classi\ufb01cation de\ufb01ned on joint space X \u00d7 Y\nwhere X \u2286 Rd is the input space and Y = {1, . . . , K} .\n= [K] the output space, made of K\nclasses. Elements of X \u00d7 Y are denoted as xy = (x, y). Furthermore, we assume the training set\nS = (xyi\ni=1 is made of m i.i.d examples/class pairs distributed according to a \ufb01xed but unknown\nprobability distribution D, and we consider a class of predictor functions G = {g : X \u00d7 Y \u2192 R}.\nWe de\ufb01ne the instantaneous loss for predictor g \u2208 G on example xy as:\n\ni )m\n\ne(g, xy) =\n\n1\n\nK \u2212 1\n\n(cid:88)\n\ny(cid:48)\u2208Y\\{y}\n\n1\n\ng(xy)\u2264g(xy(cid:48)\n\n),\n\n(1)\n\ny(cid:54)=argmaxy(cid:48)\u2208Y g(xy(cid:48)\n\nwhere 1\u03c0 is the indicator function equal to 1 if the predicate \u03c0 is true and 0 otherwise. Compared to\nthe classical multi-class error, e(cid:48)(g, xy) = 1\n), the loss of (1) estimates the average\nnumber of classes, given any input data, that get a greater scoring by g than the correct class. The\nloss (1) is hence a ranking criterion, and the multi-class SVM of [29] and AdaBoost.MR [24] optimize\nconvex surrogate functions of this loss. It is also used in label ranking [12]. Our objective is to \ufb01nd a\nfunction g \u2208 G with a small expected risk R(g) = Exy\u223cD [e(g, xy)], by minimizing the empirical\ni \u2208 S which, in mean, are scored lower\nerror de\ufb01ned as the average number of training examples xyi\nthan xy(cid:48)\n(cid:88)\nm(cid:88)\n\ni , for y(cid:48) \u2208 Y\\{yi} :\n\nm(cid:88)\n\n1\n\ne(g, xyi\n\ni ) =\n\nm(K \u2212 1)\n\ni=1\n\ni=1\n\ny(cid:48)\u2208Y\\{yi}\n\n1\n\ng(xyi\n\ni )\u2212g(xy(cid:48)\n\ni )\u22640\n\n.\n\n(2)\n\n\u02dcRm(g,S) =\n\n1\nm\n\n2.1 Binary reduction based on dyadic representations of examples and classes\nIn this work, we consider prediction functions of the form g = f \u25e6 \u03c6, where \u03c6 : X \u00d7 Y \u2192 Rp\nis a projection of the input and the output space into a joint feature space of dimension p; and\nf \u2208 F = {f : Rp \u2192 R} is a function that measures the adequacy between an observation x and\na class y using their corresponding representation \u03c6(xy). The projection function \u03c6 is application-\ndependent and it can either be learned [28], or de\ufb01ned using some heuristics [27, 16].\n\n2\n\n\fFurther, consider the following dyadic transformation\n\n(cid:18)(cid:26) (cid:0)zj =(cid:0)\u03c6(xk\n(cid:0)zj =(cid:0)\u03c6(xyi\n\ni )(cid:1) , \u02dcyj = \u22121(cid:1)\ni )(cid:1) , \u02dcyj = +1(cid:1)\n\ni ), \u03c6(xyi\ni ), \u03c6(xk\n\nT (S) =\n\n(cid:19)\n\n(3)\nwhere j = (i \u2212 1)(K \u2212 1) + k with i \u2208 [m], k \u2208 [K \u2212 1]; that expands a K-class labeled set S of\nsize m into a binary labeled set T (S) of size N = m(K \u2212 1) (e.g. Figure 1 over a toy problem).\nWith the class of functions\n\n.\n=(i\u22121)(K\u22121)+k\n\n,\n\nj\n\nH = {h : Rp \u00d7 Rp \u2192 R; (\u03c6(xy), \u03c6(xy(cid:48)\n\n)) (cid:55)\u2192 f (\u03c6(xy)) \u2212 f (\u03c6(xy(cid:48)\n\n)), f \u2208 F},\n\n(4)\n\nif k < yi\nelsewhere\n\nthe empirical loss (Eq. (2)) can be rewritten as :\n\nN(cid:88)\n\n1\nN\n\n= (cid:80)\n\n.\n\nS\n\nxy4\n4\n\nxy3\n3\n\nxy1\n1\n\nxy2\n2\n\nj=1\n\n\u02dcRT (S)(h) =\n\n1\u02dcyj h(zj )\u22640.\n\n(5)\nthe minimization of Eq. (5) over the transformation T (S) of a training set S\n\nHence,\nde\ufb01nes a binary classi\ufb01cation over the pairs of\ndyadic examples. However, this binary problem\ntakes as examples dependent random variables,\nas for each original example xy \u2208 S, the K \u2212 1\npairs in {(\u03c6(xy), \u03c6(xy(cid:48)\n)); \u02dcy} \u2208 T (S) all de-\npend on xy. In [16] this problem is studied by\nbounding the generalization error associated to\n(5) using the fractional Rademacher complex-\nity proposed in [25]. In this work, we derive\na new generalization bounds based on Local\nRademacher Complexities introduced in [22]\nthat implies second-order (i.e. variance) information inducing faster convergence rates (Theorem 1).\nOur analysis relies on the notion of graph covering introduced in [14] and de\ufb01ned as :\nDe\ufb01nition 1 (Exact proper fractional cover of G, [14]). Let G = (V,E) be a graph. C =\n{(Ck, \u03c9k)}k\u2208[J], for some positive integer J, with Ck \u2286 V and \u03c9k \u2208 [0, 1] is an exact proper\nfractional cover of G, if: i) it is proper: \u2200k, Ck is an independent set, i.e., there is no connections\n\nbetween vertices in Ck; ii) it is an exact fractional cover of G: \u2200v \u2208 V, (cid:80)\n\nFigure 1: A toy example depicting the transforma-\ntion T (Eq. (3)) applied to a training set S of size\nm = 4 and K = 4.\n\n1 )), +1) (z3 = (\u03c6(xy1\n2 )), +1) (z6 = (\u03c6(xy2\n3 )),\u22121) (z9 = (\u03c6(xy3\n4 )),\u22121) (z12 = (\u03c6(xy3\n\n1 )), +1) (z2 = (\u03c6(xy1\n2 )),\u22121) (z5 = (\u03c6(xy2\n3 )),\u22121) (z8 = (\u03c6(xy2\n4 )),\u22121) (z11 = (\u03c6(xy2\n\n(z1 = (\u03c6(xy1\n(z4 = (\u03c6(xy1\n(z7 = (\u03c6(xy1\n(z10 = (\u03c6(xy1\n\n1 ), \u03c6(xy4\n2 ), \u03c6(xy4\n3 ), \u03c6(xy4\n4 ), \u03c6(xy4\n\n1 ), \u03c6(xy2\n2 ), \u03c6(xy2\n3 ), \u03c6(xy3\n4 ), \u03c6(xy4\n\n1 ), \u03c6(xy3\n2 ), \u03c6(xy3\n3 ), \u03c6(xy3\n4 ), \u03c6(xy4\n\n1 )), +1)\n2 )), +1)\n3 )), +1)\n4 )),\u22121)\n\n\u03c9k = 1.\n\nT\n\nk:v\u2208Ck\n\nFrom this statement,\n\nThe weight W (C) of C is given by: W (C)\nk\u2208[J] \u03c9k and the minimum weight\n\u03c7\u2217(G) = minC\u2208K(G) W (C) over the set K(G) of all exact proper fractional covers of\nG is the fractional chromatic number of G.\n[14] extended Ho-\neffding\u2019s inequality and proposed large deviation bounds for sums of dependent\nran-\ndom variables which was the precursor of new generalisation bounds,\nincluding a Tala-\ngrand\u2019s type inequality for empirical processes in the dependent case presented in [22].\nWith the classes of functions G and H intro-\nduced previously, consider the parameterized\nfamily Hr which, for r > 0, is de\ufb01ned as:\nHr = {h : h \u2208 H, V[h]\n= Vz,\u02dcy[1\u02dcyh(z)] \u2264 r},\n.\nwhere V denotes the variance.\nThe fractional Rademacher complexity intro-\nduced in [25] entails our analysis :\nRT (S)(H)\n.\n\u03c9kECksup\n=\nh\u2208H\n\nFigure 2: The dependency graph G = {1, . . . , 12}\ncorresponding to the toy problem of Figure 1,\nwith (\u03bei)N\nindependent\nRademacher variables verifying P(\u03ben = 1) =\nwhere dependent nodes are connected with ver-\nP(\u03ben=\u22121) = 1\ntices in blue double-line. The exact proper frac-\n2. If other is not speci\ufb01ed explic-\ntional cover C1, C2 and C3 is shown in dashed.\nitly we assume below all \u03c9k = 1. Our \ufb01rst result\nThe fractional chromatic number is in this case\nthat bounds the generalization error of a function\n\u03c7\u2217(G) = K \u2212 1 = 3.\nh \u2208 H; R(h) = ET (S)[ \u02dcRT (S)(h)], with respect\nto its empirical error \u02dcRT (S)(h) over a transformed training set, T (S), and the fractional Rademacher\ncomplexity, RT (S)(H), is stated below.\n\ni=1 a sequence of\n\nE\u03be\nk\u2208[K\u22121]\n\n(cid:88)\n\n(cid:88)\n\n\u03be\u03b1h(z\u03b1),\n\nz\u03b1\u2208T (S)\n\n2\nN\n\n\u03b1\u2208Ck\n\n3\n\n\fi=1 \u2208 (X \u00d7 Y)m be a dataset of m examples drawn i.i.d. according to a\nTheorem 1. Let S = (xyi\nprobability distribution D over X \u00d7 Y and T (S) = ((zi, \u02dcyi))N\ni=1 the transformed set obtained as\nin Eq. (3). Then for any 1 > \u03b4 > 0 and 0/1 loss (cid:96) : {\u22121, +1} \u00d7 R \u2192 [0, 1], with probability at\nleast (1 \u2212 \u03b4) the following generalization bound holds for all h \u2208 Hr :\n\ni )m\n\nR(h) \u2264 \u02dcRT (S)(h) + RT (S)((cid:96) \u25e6 Hr) +\n\n5\n2\n\nRT (S)((cid:96) \u25e6 Hr) +\n\nlog 1\n\u03b4\nm\n\n+\n\n25\n48\n\nlog 1\n\u03b4\nm\n\n.\n\n(cid:19)(cid:115)\n\n(cid:114) r\n\n2\n\n(cid:18)(cid:113)\n\nThe proof is provided in the supplementary material, and it relies on the idea of splitting up the\nsum (5) into several parts, each part being a sum of independent variables.\n\n2.2 Aggressive Double Sampling\n\ni )m\ni=1\n\nAlgorithm: (\u03c0, \u03ba)-DS\nInput: Labeled training set S = (xyi\ninitialization: S\u03c0 \u2190 \u2205;\nT\u03ba(S\u03c0) \u2190 \u2205 ;\nfor k = 1..K do\n\nEven-though the previous multi-class to binary transformation T with a proper projection function\n\u03c6 allows to rede\ufb01ne the learning problem in a dyadic feature space of dimension p (cid:28) d, the\nincreased number of examples can lead to a large computational overhead. In order to cope with\nthis problem, we propose a (\u03c0, \u03ba)-double subsampling of T (S), which \ufb01rst aims to balance the\npresence of classes by constructing a new training set S\u03c0 from S with probabilities \u03c0 = (\u03c0k)K\nk=1.\nThe idea here is to overcome\nthe curse of long-tailed class\ndistributions exhibited in ma-\njority of large-scale multi-\nclass classi\ufb01cation problems\n[2] by oversampling the most\nsmall size classes and by sub-\nsampling the few big size\nclasses in S. The hyperpa-\nrameters \u03c0 are formally de-\n\ufb01ned as \u2200k, \u03c0k = P (xy \u2208\nS\u03c0|xy \u2208 S).\nIn practice\nwe set them inversely pro-\nportional to the size of each\nclass in the original training\nset; \u2200k, \u03c0k \u221d 1/\u00b5k where\n\u00b5k is the proportion of class\nk in S. The second aim is to\nreduce the number of dyadic\nexamples controlled by the\nhyperparameter \u03ba. The pseudo-code of this aggressive double sampling procedure, referred to as\n(\u03c0, \u03ba)-DS, is depicted above and it is composed of two main steps.\n\nDraw randomly a set S\u03c0k of examples of class k from S with\nprobability \u03c0k;\nS\u03c0 \u2190 S\u03c0 \u222a S\u03c0k;\nforall xy \u2208 S\u03c0 do\nDraw uniformly a set Yxy of \u03ba classes from Y\\{y} (cid:46) \u03ba (cid:28) K;\nforall k \u2208 Yxy do\nif k < y then\n\nT\u03ba(S\u03c0) \u2190 T\u03ba(S\u03c0) \u222a(cid:0)z =(cid:0)\u03c6(xk), \u03c6(xy)(cid:1), \u02dcy = \u22121(cid:1);\nT\u03ba(S\u03c0) \u2190 T\u03ba(S\u03c0) \u222a(cid:0)z =(cid:0)\u03c6(xy), \u03c6(xk)(cid:1), \u02dcy = +1(cid:1);\n\nreturn T\u03ba(S\u03c0)\n\nelse\n\n1. For each class k \u2208 {1, . . . , K}, draw randomly a set S\u03c0k of examples from S of that class\n\nwith probability \u03c0k, and let S\u03c0 =\n\nS\u03c0k;\n\n2. For each example xy in S\u03c0, draw uniformly \u03ba adversarial classes in Y\\{y}.\n\nk=1\n\nK(cid:91)\n\nAfter this double sampling, we apply the transformation T de\ufb01ned as in Eq. (3), leading to a set\nT\u03ba(S\u03c0) of size \u03ba|S\u03c0| (cid:28) N.\nIn Section 3, we will show that this procedure practically leads to dramatic improvements in terms of\nmemory consumption, computational complexity, and a higher multi-class prediction accuracy when\nthe number of classes is very large. The empirical loss over the transformation of the new subsampled\ntraining set S\u03c0 of size M, outputted by the (\u03c0, \u03ba)-DS algorithm is :\n\n\u02dcRT\u03ba(S\u03c0)(h) =\n\n1\n\u03baM\n\n1\u02dcy\u03b1h(z\u03b1)\u22640 =\n\n1\n\u03baM\n\n(\u02dcy\u03b1,z\u03b1)\u2208T\u03ba(S\u03c0)\n\n1\n\ng(xy)\u2212g(xy(cid:48)\n\n)\u22640,\n\n(6)\n\nwhich is essentially the same empirical risk as the one de\ufb01ned in Eq. (2) but taken with respect to the\ntraining set outputted by the (\u03c0, \u03ba)-DS algorithm. Our main result is the following theorem which\nbounds the generalization error of a function h \u2208 H learned by minimizing \u02dcRT\u03ba(S\u03c0).\n\n4\n\n(cid:88)\n\n(cid:88)\n\nxy\u2208S\u03c0\n\ny(cid:48)\u2208Yxy\n\n(cid:88)\n\n\fi )m\n\ni=1 \u2208 (X \u00d7 Y)m be a training set of size m i.i.d. according to a\nTheorem 2. Let S = (xyi\nprobability distribution D over X \u00d7 Y, and T (S) = ((zi, \u02dcyi))N\ni=1the transformed set obtained with\nthe transformation function T de\ufb01ned as in Eq. (3). Let S\u03c0 \u2286 S, |S\u03c0| = M, be a training set\noutputted by the algorithm (\u03c0, \u03ba)-DS and T (S\u03c0) \u2286 T (S) its corresponding transformation. Then for\nany 1 > \u03b4 > 0 with probability at least (1 \u2212 \u03b4) the following risk bound holds for all h \u2208 H :\n\nR(h) \u2264 \u03b1 \u02dcRT\u03ba(S\u03c0)(h) + \u03b1RT\u03ba(S\u03c0)((cid:96) \u25e6 H) + \u03b1\n.\nFurthermore, for all functions in the class Hr, we have the following generalization bound that holds\nwith probability at least (1 \u2212 \u03b4) :\n\n2\u03b1 log 4K\n\u03b4\n\u03b2(m \u2212 1)\n\n7\u03b2 log 4K\n\u03b4\n3(m \u2212 1)\n\n2M \u03ba\n\n+\n\n+\n\n\u03b4\n\n(K \u2212 1) log 2\n\n(cid:115)\n\n(cid:115)\n\n(cid:115)\n(cid:19)(cid:115)\n\n(cid:114) r\n\n2\n\n+\n\n2\u03b1 log 4K\n\u03b4\n\u03b2(m \u2212 1)\n(K \u2212 1) log 2\n\n\u03b4\n\nM \u03ba\n\n7\u03b2 log 4K\n\u03b4\n3(m \u2212 1)\n\n+\n\n25\u03b1\n48\n\nlog 2\n\u03b4\nM\n\n,\n\nR(h) \u2264\u03b1 \u02dcRT\u03ba(S\u03c0)(h) + \u03b1RT\u03ba(S\u03c0)((cid:96) \u25e6 Hr) +\n\n(cid:18)(cid:113)\n\n+\n\n5\u03b1\n2\n\nRT\u03ba(S\u03c0)((cid:96) \u25e6 Hr) +\n\nwhere (cid:96) : {\u22121, +1} \u00d7 R \u2192 [0, 1] 0/1 is an instantaneous loss, and \u03b1 = maxy: 1\u2264y\u2264K \u03b7y/\u03c0y,\n\u03b2 = maxy: 1\u2264y\u2264K 1/\u03c0y and \u03b7y > 0 is the proportion of class y in S.\nThe proof is provided in the supplementary material. This theorem hence paves the way for the\nconsistency of the empirical risk minimization principle [26, Th. 2.1, p. 38] de\ufb01ned over the double\nsample reduction strategy we propose.\n\n2.3 Prediction with Candidate Selection\n\nthe\n\nthe\n\nout\n\nin\n\nthe\n\nthat\n\nscore\n\ndyadic\n\nfeature\n\nhighest\n\nlearned\n\nspace,\nthe\n\nby \ufb01rst\nclasses,\n\nconsider-\nand then\nclassi\ufb01er.\n\ncarried\n\nis\nconstituted by a\nleads\nclass\n\ntest observation and all\nby\nto\n\nThe\nprediction\ning the pairs\nchoosing\nthe\nIn the large scale scenario, com-\nputing the feature representations\nfor all classes may require a huge\namount of time. To overcome this\nproblem we sample over classes\nby choosing just those that are the\nnearest to a test example, based on\nits distance with class centroids.\nHere we propose to consider class\ncentroids as the average of vectors\nwithin that class. Note that class centroids are computed once in the preliminary projection of training\nexamples and classes in the dyadic feature space and thus represent no additional computation at this\nstage. The algorithm above presents the pseudocode of this candidate based selection strategy 1.\n\nAlgorithm: Prediction with Candidate Selection Algorithm\nInput: Unlabeled test set T ;\nLearned function f\u2217 : Rp \u2192 R;\ninitialization: \u2126 \u2190 \u2205;\nforall x \u2208 T do\n\nSelect Yx \u2286 Y candidate set of q nearest-centroid classes;\n\u2126 \u2190 \u2126 \u222a argmaxk\u2208Yx f\u2217(\u03c6(xk)) ;\n\nreturn predicted classes \u2126\n\n3 Experiments\n\nIn this section, we provide an empirical evaluation of the proposed reduction approach with the (\u03c0, \u03ba)-\nDS sampling strategy for large-scale multi-class classi\ufb01cation of document collections. First, we\npresent the mapping \u03c6 : X \u00d7 Y \u2192 Rp. Then, we provide a statistical and computational comparison\nof our method with state-of-the-art large-scale approaches on popular datasets.\n\n3.1\n\na Joint example/class representation for text classi\ufb01cation\n\nThe particularity of text classi\ufb01cation is that documents are represented in a vector space induced by\nthe vocabulary of the corresponding collection [23]. Hence each class can be considered as a mega-\ndocument, constituted by the concatenation of all of the documents in the training set belonging to it,\n\n1The number of classes pre-selected can be tuned to offer a prediction time/accuracy tradeoff if the prediction\n\ntime is more critical.\n\n5\n\n\fFeatures in the joint example/class representation representation \u03c6(xy).\n\n3. (cid:88)\n6. (cid:88)\n\nt\u2208y\u2229x\n\nIt\n\n(cid:18)\n\nlog\n\n1 +\n\nyt|y| .It\n9. d(xy, centroid(y))\n\nt\u2208y\u2229x\n\n(cid:19)\n\n(cid:19)\n(cid:19)\n\nlS\nFt\n\nyt|y|\n\n(cid:18)\n(cid:18)\n\nlog\n\n1 +\n\nt\u2208y\u2229x\n\nt\u2208y\u2229x\n\nt\u2208y\u2229x\n\nlog(1 + yt)\n\n2. (cid:88)\n5. (cid:88)\n8. (cid:88)\n\n1. (cid:88)\n4. (cid:88)\n(cid:18)\nyt|y| .It\n7. (cid:88)\n10. BM25 =(cid:80)\nwithin S, and xt is the frequency of term t in x, yt =(cid:80)\nlS =(cid:80)\n\nyt+(0.25+0.75\u00b7len(y)/avg(len(y))\n\nt\u2208y\u2229x It.\n\nyt|y| .\n\nt\u2208y\u2229x\n\nt\u2208y\u2229x\n\nlog\n\n1 +\n\nlog\n\n1 +\n\nt\u2208y\u2229x\n\n2\u00d7yt\n\n(cid:19)\n\nlS\nFt\n\n1\n\nTable 1: Joint example/class representation for text classi\ufb01cation, where t \u2208 y \u2229 x are terms that are\npresent in both the class y\u2019s mega-document and document x. V represents the set of distinct terms\nx\u2208S xt,\nt\u2208V St. Finally, It is the inverse document frequency of term t, len(y) is number of terms of\n\nx\u2208y xt, |y| =(cid:80)\n\nt\u2208V yt, Ft =(cid:80)\n\ndocuments in class y, and avg(len(y)) is the average of document lengths for all the classes.\n\nand simple feature mapping of examples and classes can be de\ufb01ned over their common words. Here\nwe used p = 10 features inspired from learning to rank [17] by resembling a class and a document to\nrespectively a document and a query (Table 1). All features except feature 9, that is the distance of\nan example x to the centroid of all examples of a particular class y, are classical. In addition to its\npredictive interest, the latter is also used in prediction for performing candidate preselection. Note\nthat for other large-scale multi-class classi\ufb01cation applications like recommendation with extremely\nlarge number of offer categories or image classi\ufb01cation, a same kind of mapping can either be learned\nor de\ufb01ned using their characteristics [27, 28].\n\n3.2 Experimental Setup\n\nDatasets. We evaluate the proposed method using popular datasets from the Large Scale Hierarchical\nText Classi\ufb01cation challenge (LSHTC) 1 and 2 [20]. These datasets are provided in a pre-processed\nformat using stop-word removal and stemming. Various characteristics of these datesets including the\nstatistics of train, test and heldout are listed in Table 2. Since, the datasets used in LSHTC2 challenge\nwere in multi-label format, we converted them to multi-class format by replicating the instances\nbelonging to different class labels. Also, for the largest dataset (WIKI-large) used in LSHTC2\nchallenge, we used samples with 50,000 and 100,000 classes. The smaller dataset of LSHTC2\nchallenge is named as WIKI-Small, whereas the two 50K and 100K samples of large dataset are\nnamed as WIKI-50K and WIKI-100K in our result section.\n\n# of classes, K Train Size Test Size Heldout Size Dimension, d\n\nDatasets\nLSHTC1\nDMOZ\n\nWIKI-Small\nWIKI-50K\nWIKI-100K\n\n12294\n27875\n36504\n50000\n100000\n\n126871\n381149\n796617\n1102754\n2195530\n\n31718\n95288\n199155\n276939\n550133\n\n5000\n34506\n5000\n5000\n5000\n\n409774\n594158\n380078\n951558\n1271710\n\nTable 2: Characteristics of the datasets used in our experiments\n\nBaselines. We compare the proposed approach,2 denoted as the sampling strategy by (\u03c0, \u03ba)-DS,\nwith popular baselines listed below:\n\n\u2022 OVA: LibLinear [10] implementation of one-vs-all SVM.\n\u2022 M-SVM: LibLinear implementation of multi-class SVM proposed in [8].\n\u2022 RecallTree [9]: A recent tree based multi-class classi\ufb01er implemented in Vowpal Wabbit.\n\n2Source code and datasets can be found in the following repository https://github.com/bikash617/Aggressive-\n\nSampling-for-Multi-class-to-BinaryReduction\n\n6\n\n\fData\n\nLSHTC1\nm = 163589\nd = 409774\nK = 12294\n\nDMOZ\n\nm = 510943\nd = 594158\nK = 27875\n\nWIKI-Small\nm = 1000772\nd = 380078\nK = 36504\n\nWIKI-50K\nm = 1384693\nd = 951558\nK = 50000\n\nWIKI-100K\nm = 2750663\nd = 1271710\nK = 100000\n\ntrain time\npredict time\ntotal memory\n\nAccuracy\n\nMaF1\n\ntrain time\npredict time\ntotal memory\n\nAccuracy\n\nMaF1\n\ntrain time\npredict time\ntotal memory\n\nAccuracy\n\nMaF1\n\ntrain time\npredict time\ntotal memory\n\nAccuracy\n\nMaF1\n\ntrain time\npredict time\ntotal memory\n\nAccuracy\n\nMaF1\n\nOVA\n\n23056s\n328s\n40.3G\n44.1%\n27.4%\n180361s\n2797s\n131.9G\n37.7%\n22.2%\n212438s\n2270s\n109.1G\n15.6%\n8.8 %\nNA\nNA\n330G\nNA\nNA\nNA\nNA\n\n1017G\n\nNA\nNA\n\nM-SVM\n48313s\n314s\n40.3G\n36.4%\n18.8%\n212356s\n3981s\n131.9G\n32.2%\n14.3%\n>4d\nNA\n\n109.1G\n\nNA\nNA\nNA\nNA\n330G\nNA\nNA\nNA\nNA\n\n1017G\n\nNA\nNA\n\n701s\n21s\n122M\n18.1%\n3.8%\n2212s\n47s\n256M\n16.9%\n1.75%\n1610s\n24s\n178M\n7.9%\n<1%\n4188s\n45s\n226M\n17.9%\n5.5%\n8593s\n90s\n370M\n8.4%\n1.4%\n\n8564s\n339s\n470M\n39.3%\n21.3%\n14334s\n424s\n1339M\n33.4%\n15.1%\n10646s\n453s\n949M\n11.1%\n4.6%\n30459s\n1110s\n1327M\n25.8%\n14.6%\n42359s\n1687s\n2622M\n15%\n8%\n\n3912s\n164s\n471M\n39.8%\n22.4%\n15492s\n505s\n1242M\n33.7%\n15.9%\n21702s\n871s\n947M\n12.1%\n5.63%\n48739s\n2461s\n1781M\n27.3%\n16.3%\n73371s\n3210s\n2834M\n16.1%\n\n9%\n\n5105s\n67s\n10.5G\n45.7%\n27.7%\n63286s\n482s\n28.1G\n40.8%\n22.7%\n16309s\n382s\n12.4G\n15.6%\n9.91%\n41091s\n790s\n35G\n33.8%\n23.4%\n155633s\n3121s\n40.3G\n22.2%\n15.1%\n\n321s\n544s\n2G\n\n37.4%\n26.5%\n1060s\n2122s\n5.3G\n27.8%\n20.5%\n1290s\n2577s\n3.6G\n21.5%\n13.3%\n3723s\n4083s\n5G\n\n33.4%\n24.5%\n9264s\n20324s\n9.8G\n25%\n17.8%\n\nRecallTree\n\nFastXML\n\nPfastReXML\n\nPD-Sparse\n\n(\u03c0, \u03ba)-DS\n\nTable 3: Comparison of the result of various baselines in terms of time, memory, accuracy, and\nmacro F1-measure\n\n\u2022 FastXML [21]: An extreme multi-class classi\ufb01cation method which performs partitioning in\n\nthe feature space for faster prediction.\n\n\u2022 PfastReXML [13]: Tree ensemble based extreme classi\ufb01er for multi-class and multilabel\n\nproblems.\n\n\u2022 PD-Sparse [30]: A recent approach which uses multi-class loss with (cid:96)1-regularization.\n\nReferring to the work [30], we did not consider other recent methods SLEEC [5] and LEML [31] in our\nexperiments, since they have been shown to be consistently outperformed by the above mentioned\nstate-of-the-art approaches.\nPlatform and Parameters. In all of our experiments, we used a machine with an Intel Xeon 2.60GHz\nprocessor with 256 GB of RAM. Each of these methods require tuning of various hyper-parameters\nthat in\ufb02uence their performance. For each methods, we tuned the hyperparameters over a heldout set\nand used the combination which gave best predictive performance. The list of used hyperparameters\nfor the results we obtained are reported in the supplementary material (Appendix B).\nEvaluation Measures. Different approaches are evaluated over the test sets using accuracy and\nthe macro F1 measure (MaF1), which is the harmonic average of macro precision and macro recall;\nhigher MaF1thus corresponds to better performance. As opposed to accuracy, macro F1 measure is\nnot affected by the class imbalance problem inherent to multi-class classi\ufb01cation, and is commonly\nused as a robust measure for comparing predictive performance of classi\ufb01cation methods.\n\n4 Results\n\nThe parameters of the datasets along with the results for compared methods are shown in Table 3.\nThe results are provided in terms of train and predict times, total memory usage, and predictive\nperformance measured with accuracy and macro F1-measure (MaF1). For better visualization and\ncomparison, we plot the same results as bar plots in Fig. 3 keeping only the best \ufb01ve methods while\ncomparing the total runtime and memory usage. First, we observe that the tree based approaches\n(FastXML, PfastReXML and RecallTree) have worse predictive performance compared to the other\nmethods. This is due to the fact that the prediction error made at the top-level of the tree cannot be\ncorrected at lower levels, also known as cascading effect. Even though they have lower runtime and\nmemory usage, they suffer from this side effect.\nFor large scale collections (WIKI-Small, WIKI-50K and WIKI-100K), the solvers with competitive\npredictive performance are OVA, M-SVM, PD-Sparse and (\u03c0, \u03ba)-DS. However, standard OVA and\n\n7\n\n\fFigure 3: Comparisons in Total (Train and Test) Time (min.), Total Memory usage (GB), and MaF1 of\nthe \ufb01ve best performing methods on LSHTC1, DMOZ, WIKI-Small, WIKI-50K and WIKI-100K.\n\nM-SVM have a complexity that grows linearly with K thus the total runtime and memory usage\nexplodes on those datasets, making them impossible. For instance, on Wiki large dataset sample\nof 100K classes, the memory consumption of both approaches exceeds the Terabyte and they take\nseveral days to complete the training. Furthermore, on this data set and the second largest Wikipedia\ncollection (WIKI-50K and WIKI-100K) the proposed approach is highly competitive in terms of\nTime, Total Memory and both performance measures comparatively to all the other approaches.\nThese results suggest that the method least affected by long-tailed class distributions is (\u03c0, \u03ba)-DS,\nmainly because of two reasons: \ufb01rst, the sampling tends to make the training set balanced and\nsecond, the reduced binary dataset contains similar number of positive and negative examples. Hence,\nfor the proposed approach, there is an improvement in both accuracy and MaF1 measures. The\nrecent PD-Sparse method also enjoys a competitive predictive performance but it requires to store\nintermediary weight vectors during optimization which prevents it from scaling well. The PD-Sparse\nsolver provides an option for hashing leading to fewer memory usage during training which we used\nin the experiments; however, the memory usage is still signi\ufb01cantly high for large datasets and at the\nsame time this option slows down the training process considerably. In overall, among the methods\nwith competitive predictive performance, (\u03c0, \u03ba)-DS seems to present the best runtime and memory\nusage; its runtime is even competitive with most of tree-based methods, leading it to provide the best\ncompromise among the compared methods over the three time, memory and performance measures.\n\n5 Conclusion\n\nWe presented a new method for reducing a multiclass classi\ufb01cation problem to binary classi\ufb01cation.\nWe employ similarity based feature representation for class and examples and a double sampling\nstochastic scheme for the reduction process. Even-though the sampling scheme shifts the distribution\nof classes and that the reduction of the original problem to a binary classi\ufb01cation problem brings\ninter-dependency between the dyadic examples; we provide generalization error bounds suggesting\nthat the Empirical Risk Minimization principle over the transformation of the sampled training set\nstill remains consistent. Furthermore, the characteristics of the algorithm contribute for its excellent\nperformance in terms of memory usage and total runtime and make the proposed approach highly\nsuitable for large class scenario.\n\nAcknowledgments\n\nThis work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01)\nfunded by the French program Investissement d\u2019avenir, and by the U.S. Department of Energy\u2019s\nOf\ufb01ce of Electricity as part of the DOE Grid Modernization Initiative.\n\n8\n\n04590135180Time(min.)LSHTC103006009001200DMOZ0150300450WIKI-Small03006009001200WIKI-50K0100020003000WIKI-100K04812TotalMemory(GB)01020300.02.55.07.510.0012243601428420153045Accuracy(%)01530450153045015304501530450102030MaF(%)0102030010203001020300102030RecallTreeFastXMLPfastReXMLPD-SparseProposed(\u03c0,\u03ba)-DS\fReferences\n[1] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multi-class cost-sensitive\n\nlearning. In Proceedings of the 10th ACM SIGKDD, KDD \u201904, pages 3\u201311, 2004.\n\n[2] Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Eric Gaussier, and Massih R. Amini. On power law\n\ndistributions in large-scale taxonomies. SIGKDD Explorations, 16(1), 2014.\n\n[3] Samy Bengio, Jason Weston, and David Grangier. Label embedding trees for large multi-class tasks. In\n\nAdvances in Neural Information Processing Systems, pages 163\u2013171, 2010.\n\n[4] Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Error-correcting tournaments. In Proceedings\n\nof the 20th International Conference on Algorithmic Learning Theory, ALT\u201909, pages 247\u2013262, 2009.\n\n[5] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings\nfor extreme multi-label classi\ufb01cation. In Advances in Neural Information Processing Systems, pages\n730\u2013738, 2015.\n\n[6] Anna Choromanska, Alekh Agarwal, and John Langford. Extreme multi class classi\ufb01cation. In NIPS\n\nWorkshop: eXtreme Classi\ufb01cation, submitted, 2013.\n\n[7] Anna Choromanska and John Langford. Logarithmic time online multiclass prediction. CoRR,\n\nabs/1406.1822, 2014.\n\n[8] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector\n\nmachines. J. Mach. Learn. Res., 2:265\u2013292, 2002.\n\n[9] Hal Daume III, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic time one-against-\n\nsome. arXiv preprint arXiv:1606.04988, 2016.\n\n[10] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library\n\nfor large linear classi\ufb01cation. J. Mach. Learn. Res., 9:1871\u20131874, 2008.\n\n[11] Daniel J Hsu, Sham M Kakade, John Langford, and Tong Zhang. Multi-label prediction via compressed\n\nsensing. In Advances in Neural Information Processing Systems 22 (NIPS), pages 772\u2013780, 2009.\n\n[12] Eyke H\u00fcllermeier and Johannes F\u00fcrnkranz. On minimizing the position error in label ranking. In Machine\n\nLearning: ECML 2007, pages 583\u2013590. Springer, 2007.\n\n[13] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommen-\ndation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 935\u2013944. ACM, 2016.\n\n[14] S. Janson. Large deviations for sums of partly dependent random variables. Random Structures and\n\nAlgorithms, 24(3):234\u2013248, 2004.\n\n[15] Kalina Jasinska and Nikos Karampatziakis. Log-time and log-space extreme classi\ufb01cation. arXiv preprint\n\narXiv:1611.01964, 2016.\n\n[16] Bikash Joshi, Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier, and \u00c9ric Gaussier.\nOn binary reduction of large-scale multiclass classi\ufb01cation problems. In Advances in Intelligent Data\nAnalysis XIV - 14th International Symposium, IDA 2015, pages 132\u2013144, 2015.\n\n[17] Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang Li. Letor: Benchmark dataset for research on\nlearning to rank for information retrieval. In Proceedings of SIGIR 2007 workshop on learning to rank for\ninformation retrieval, pages 3\u201310, 2007.\n\n[18] Ana Carolina Lorena, Andr\u00e9 C. Carvalho, and Jo\u00e3o M. Gama. A review on the combination of binary\n\nclassi\ufb01ers in multiclass problems. Artif. Intell. Rev., 30(1-4):19\u201337, 2008.\n\n[19] Paul Mineiro and Nikos Karampatziakis. Fast label embeddings via randomized linear algebra. In Machine\nLearning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2015, Porto,\nPortugal, September 7-11, 2015, Proceedings, Part I, pages 37\u201351, 2015.\n\n[20] I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier, I. Androutsopoulos, M.-R.\nAmini, and P. Galinari. LSHTC: A Benchmark for Large-Scale Text Classi\ufb01cation. ArXiv e-prints, March\n2015.\n\n[21] Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classi\ufb01er for extreme\nmulti-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 263\u2013272. ACM, 2014.\n\n[22] Liva Ralaivola and Massih-Reza Amini. Entropy-based concentration inequalities for dependent variables.\nIn Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France,\n6-11 July 2015, pages 2436\u20132444, 2015.\n\n[23] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM,\n\n18(11):613\u2013620, November 1975.\n\n9\n\n\f[24] Robert E Schapire and Yoram Singer. Improved boosting algorithms using con\ufb01dence-rated predictions.\n\nMachine learning, 37(3):297\u2013336, 1999.\n\n[25] Nicolas Usunier, Massih-Reza Amini, and Patrick Gallinari. Generalization error bounds for classi\ufb01ers\ntrained with interdependent data. In Advances in Neural Information Processing Systems 18 (NIPS), pages\n1369\u20131376, 2005.\n\n[26] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n[27] Maksims Volkovs and Richard S. Zemel. Collaborative ranking with 17 parameters. In Advances in Neural\n\nInformation Processing Systems 25, pages 2294\u20132302, 2012.\n\n[28] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image\nannotation. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence, IJCAI, 2011.\n[29] Jason Weston and Chris Watkins. Multi-class support vector machines. Technical report, Technical Report\n\nCSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, 1998.\n\n[30] Ian EH Yen, Xiangru Huang, Kai Zhong, Pradeep Ravikumar, and Inderjit S Dhillon. Pd-sparse: A primal\nand dual sparse approach to extreme multiclass and multilabel classi\ufb01cation. In Proceedings of the 33nd\nInternational Conference on Machine Learning, 2016.\n\n[31] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with\n\nmissing labels. In International Conference on Machine Learning, pages 593\u2013601, 2014.\n\n10\n\n\f", "award": [], "sourceid": 2192, "authors": [{"given_name": "Bikash", "family_name": "Joshi", "institution": "University of Grenoble Alpes"}, {"given_name": "Massih R.", "family_name": "Amini", "institution": "University Grenoble Alps"}, {"given_name": "Ioannis", "family_name": "Partalas", "institution": "Expedia LPS Geneva"}, {"given_name": "Franck", "family_name": "Iutzeler", "institution": "Univ. Grenoble Alpes"}, {"given_name": "Yury", "family_name": "Maximov", "institution": "Los Alamos National Laboratory and Skolkovo Institute of Science and Technology"}]}