{"title": "AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 5820, "page_last": 5830, "abstract": "Extreme multi-label text classification (XMTC) is an important problem in the \nera of {\\it big data}, for tagging a given text with the most relevant multiple \nlabels from an extremely large-scale label set. XMTC can be found in many \napplications, such as item categorization, web page tagging, and news \nannotation.\nTraditionally most methods used bag-of-words (BOW) as inputs, ignoring word \ncontext as well as deep semantic information. Recent attempts to overcome the \nproblems of BOW by deep learning still suffer from 1) failing to capture the \nimportant subtext for each label and 2) lack of scalability against the huge \nnumber of labels.\nWe propose a new label tree-based deep learning model for XMTC, called \nAttentionXML, with two unique features: 1) a multi-label attention mechanism \nwith raw text as input, which allows to capture the most relevant part of text \nto each label; and 2) a shallow and wide probabilistic label tree (PLT), which \nallows to handle millions of labels, especially for \"tail labels\".\nWe empirically compared the performance of AttentionXML with those of eight \nstate-of-the-art methods over six benchmark datasets, including Amazon-3M with \naround 3 million labels. AttentionXML outperformed all competing methods \nunder all experimental settings.\nExperimental results also show that AttentionXML achieved the best performance \nagainst tail labels among label tree-based methods. The code and datasets are \navailable at \\url{http://github.com/yourh/AttentionXML} .", "full_text": "AttentionXML: Label Tree-based Attention-Aware\n\nDeep Model for High-Performance Extreme\n\nMulti-Label Text Classi\ufb01cation\n\nRonghui You1, Zihan Zhang1, Ziye Wang2, Suyang Dai1,\n\nHiroshi Mamitsuka5,6, Shanfeng Zhu1,3,4,\u2217\n\n1 Shanghai Key Lab of Intelligent Information Processing, School of Computer Science,\n\n2 Centre for Computational Systems Biology, School of Mathematical Sciences,\n\n3 Shanghai Institute of Arti\ufb01cial Intelligence Algorithms and ISTBI,\n\n4 Key Lab of Computational Neuroscience and Brain-Inspired Intelligence (MOE),\n\nFudan University, Shanghai, China;\n\n5 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan;\n6 Department of Computer Science, Aalto University, Espoo and Helsinki, Finland\n\n{rhyou18,zhangzh17,zywang17,sydai16}@fudan.edu.cn\n\nmami@kuicr.kyoto-u.ac.jp, zhusf@fudan.edu.cn\n\nAbstract\n\nExtreme multi-label text classi\ufb01cation (XMTC) is an important problem in the era\nof big data, for tagging a given text with the most relevant multiple labels from an\nextremely large-scale label set. XMTC can be found in many applications, such as\nitem categorization, web page tagging, and news annotation. Traditionally most\nmethods used bag-of-words (BOW) as inputs, ignoring word context as well as\ndeep semantic information. Recent attempts to overcome the problems of BOW\nby deep learning still suffer from 1) failing to capture the important subtext for\neach label and 2) lack of scalability against the huge number of labels. We propose\na new label tree-based deep learning model for XMTC, called AttentionXML,\nwith two unique features: 1) a multi-label attention mechanism with raw text as\ninput, which allows to capture the most relevant part of text to each label; and 2) a\nshallow and wide probabilistic label tree (PLT), which allows to handle millions\nof labels, especially for \"tail labels\". We empirically compared the performance\nof AttentionXML with those of eight state-of-the-art methods over six benchmark\ndatasets, including Amazon-3M with around 3 million labels. AttentionXML\noutperformed all competing methods under all experimental settings. Experimental\nresults also show that AttentionXML achieved the best performance against tail\nlabels among label tree-based methods. The code and datasets are available at\nhttp://github.com/yourh/AttentionXML .\n\n1\n\nIntroduction\n\nExtreme multi-label text classi\ufb01cation (XMTC) is a natural language processing (NLP) task for\ntagging each given text with its most relevant multiple labels from an extremely large-scale label set.\nXMTC predicts multiple labels for a text, which is different from multi-class classi\ufb01cation, where\neach instance has only one associated label. Recently, XMTC has become increasingly important,\ndue to the fast growth of the data scale. In fact, over hundreds of thousands, even millions of labels\nand samples can be found in various domains, such as item categorization in e-commerce, web page\ntagging, news annotation, to name a few. XMTC poses great computational challenges for developing\neffective and ef\ufb01cient classi\ufb01ers with limited computing resource, such as an extremely large number\nof samples/labels and a large number of \"tail labels\" with very few positive samples.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMany methods have been proposed for addressing the challenges of XMTC. They can be categorized\ninto the following four types: 1) 1-vs-All [3,4,30,31], 2) Embedding-based [7,27], 3) Instance [10,25]\nor label tree-based [11, 13, 24, 28]) and 4) Deep learning-based methods [17] (see Appendix for\nmore descriptions on these methods). The most related methods to our work are deep learning-based\nand label tree-based methods. A pioneering deep learning-based method is XML-CNN [17], which\nuses a convolutional neural network (CNN) and dynamic pooling to learn the text representation.\nXML-CNN however cannot capture the most relevant parts of the input text to each label, because\nthe same text representation is given for all labels. Another type of deep learning-based methods is\nsequence-to-sequence (Seq2Seq) learning-based methods, such as MLC2Seq [21], SGM [29] and\nSU4MLC [15]. These Seq2Seq learning-based methods use a recurrent neural network (RNN) to\nencode a given raw text and an attentive RNN as a decoder to generate predicted labels sequentially.\nHowever the underlying assumption of these models is not reasonable since in reality there are no\norders among labels in multi-label classi\ufb01cation. In addition, the requirement of extensive computing\nin the existing deep learning-based methods makes it unbearable to deal with datasets with millions\nof labels.\nTo handle such extreme-scale datasets, label tree-based methods use a probabilistic label tree (PLT)\n[11] to partition labels, where each leaf in PLT corresponds to an original label and each internal\nnode corresponds to a pseudo-label (meta-label). Then by maximizing a lower bound approximation\nof the log likelihood, each linear binary classi\ufb01er for a tree node can be trained independently with\nonly a small number of relevant samples [24]. Parabel [24] is a state-of-the-art label tree-based\nmethod using bag-of-words (BOW) features. This method constructs a binary balanced label tree by\nrecursively partitioning nodes into two balanced clusters until the cluster size (the number of labels in\neach cluster) is less than a given value (e.g. 100). This produces a \"deep\" tree (with a high tree depth)\nfor an extreme scale dataset, which deteriorates the performance due to an inaccurate approximation\nof likelihood, and the accumulated and propagated errors along the tree. In addition, using balanced\nclustering with a large cluster size, many tail labels are combined with other dissimilar labels and\ngrouped into one cluster. This reduces the classi\ufb01cation performance on tail labels. On the other hand,\nanother PLT-based method EXTREMETEXT [28], which is based on FASTTEXT [12], uses dense\nfeatures instead of BOW. Note that EXTREMETEXT ignores the order of words without considering\ncontext information, which underperforms Parabel.\nWe propose a label tree-based deep learning model, AttentionXML, to address the current challenges\nof XMTC. AttentionXML uses raw text as its features with richer semantic context information\nthan BOW features. AttentionXML is expected to achieve a high accuracy by using a BiLSTM\n(bidirectional long short-term memory) to capture long-distance dependency among words and a\nmulti-label attention to capture the most relevant parts of texts to each label. Most state-of-the-art\nmethods, such as DiSMEC [3] and Parabel [24], used only one representation for all labels including\nmany dissimilar (unrelated) tail labels. It is dif\ufb01cult to satisfy so many dissimilar labels by the\nsame representation. With multi-label attention, AttentionXML represents a given text differently\nfor each label, which is especially helpful for many tail labels. In addition, by using a shallow\nand wide PLT and a top-down level-wise model training, AttentionXML can handle extreme-scale\ndatasets. Most recently, Bonsai [13] also uses shallow and diverse PLTs by removing the balance\nconstraint in the tree construction, which improves the performance by Parabel. Bonsai, however,\nneeds high space complexity, such as a 1TB memory for extreme-scale datasets, because of using\nlinear classi\ufb01ers. Note that we conceive our idea that is independent from Bonsai, and apply it in deep\nlearning based method using deep semantic features other than BOW features used in Bonsai. The\nexperimental results over six benchmarks datasets including Amazon-3M [19] with around 3 million\nlabels and 2 millions samples show that AttentionXML outperformed other state-of-the-art methods\nwith competitive costs on time and space. The experimental results also show that AttentionXML is\nthe best label tree-based method against tail labels.\n\n2 AttentionXML\n\n2.1 Overview\n\nThe main steps of AttentionXML are: (1) building a shallow and wide PLT (Figs. 1a and 1b);\nand (2) for each level d (d > 0) of a given constructed PLT, training an attention-aware deep\nmodel AttentionXMLd with a BiLSTM and a multi-label attention (Fig. 1c). The pseudocodes for\nconstructing PLT, training and prediction of AttentionXML are presented in Appendix.\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Label tree-based deep model AttentionXML for XMTC. (a) An example of PLT used in AttentionXML.\n(b) An example of a PLT building process with settings of K = M = 8 = 23 and H = 3 for L = 8000. The\nnumbers from left to right show those of nodes for each level from top to down. The numbers in red show\nthose of nodes in Th that are removed in order to obtain Th+1. (c) Overview of attention-aware deep model\nin AttentionXML with text (length \u02c6T ) as its input and predicted scores \u02c6z as its output. The \u02c6xi \u2208 R \u02c6D is the\nembeddings of i-th wo rd(where \u02c6D is the dimension of embeddings), \u03b1 \u2208 R \u02c6T\u00d7L are the attention coef\ufb01cients\nand \u02c6W1 and \u02c6W2 are parameters of the fully connected layer and output layer.\n\n2.2 Building Shallow and Wide PLT\n\nPLT [10] is a tree with L leaves where each leaf corresponds to an original label. Given a sample\nx, we assign a label zn \u2208 {0, 1} for each node n, which indicates whether the subtree rooted at\nnode n has a leaf (original label) relevant to this sample. PLT estimates the conditional probability\nP (zn|zP a(n) = 1, x) to each node n. The marginal probability P (zn = 1|x) for each node n can be\neasily derived as follows by the chain rule of probability:\n\nP (zn = 1|x) =\n\nP (zi = 1|zP a(i) = 1, x)\n\n(1)\n\n(cid:89)\n\ni\u2208P ath(n)\n\nwhere P a(n) is the parent of node n and P ath(n) is the set of nodes on the path from node n to the\nroot (excluding the root).\nAs mentioned in Introduction, large tree height H (excluding the root and leaves) and large cluster\nsize M will harm the performance. So in AttentionXML, we build a shallow (a small H) and wide (a\nsmall M) PLT TH. First, we built an initial PLT, T0, by a top-down hierarchical clustering, which\nwas used in Parabel [24], with a small cluster size M. In more detail, we represent each label by\nnormalizing the sum of BOW features of text annotated by this label. The labels are then recursively\npartitioned into two smaller clusters, which correspond to internal tree nodes, by a balanced k-means\n(k=2) until the number of labels smaller than M [24]. T0 is then compressed into a shallow and wide\nPLT, i.e. TH, which is a K(= 2c) ways tree with the height of H. This compress operation is similar\nto the pruning strategy in some hierarchical multi-class classi\ufb01cation methods [1, 2]. We \ufb01rst choose\nall parents of leaves as S0 and then conduct compress operations H times, resulting in TH. The\ncompress operation has three steps: for example in the h-th compress operation over Th\u22121, we (1)\nchoose c-th ancestor nodes (h < H) or the root (h = H) as Sh, (2) remove nodes between Sh\u22121 and\nSh, and (3) then reset nodes in Sh as parents of corresponding nodes in Sh\u22121. This \ufb01nally results in\na shallow and wide tree TH. Practically we use M = K so that each internal node except the root has\nno more than K children. Fig 1b shows an example of building PLT. More examples can be found in\nAppendix.\n\n3\n\n\u03b122\u03b111BiLSTMBiLSTMBiLSTM\u03b121\u03b1T1\u03b112\u03b11L\u03b12L\u03b1TL\u03b1T2m1m2mLW1W1W1W2W2W2Word Representataion LayerBidirectional LSTM LayerFully Connected LayerOutput LayerAttention Layer\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7x1\u02c6 x1\u02c6 \u02c6 xT\u02c6 xT\u02c6 z1\u02c6 z1\u02c6 z2\u02c6 z2\u02c6 zL\u02c6 zL\u02c6 \u02c6 x2x2\u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \u02c6 \f2.3 Learning AttentionXML\n\nGiven a built PLT, training a deep model against nodes at a deeper level is more dif\ufb01cult because\nnodes at a deeper level have less positive examples. Training a deep model for all nodes of different\nlevels together is hard to optimize and harms the performance, which can only speed up marginally.\nThus we train AttentionXML in a level-wise manner as follows:\n\n1. AttentionXML trains a single deep model for each level of a given PLT in a top-down\nmanner. Note that labeling each level of the PLT is still a multi-label classi\ufb01cation problem.\nFor the nodes of \ufb01rst level (children of the root), AttentionXML (named AttentionXML1 for\nthe \ufb01rst level) can be trained for these nodes directly.\n2. AttentionXMLd for the d-th level (d > 1) of the given PLT is only trained by candidates g(x)\nfor each sample x. Speci\ufb01cally, we sort nodes of the (d \u2212 1)-th level by zn (from positives\nto negatives) and then their scores predicted by AttentionXMLd\u22121 in the descending order.\nWe keep the top C nodes at the (d \u2212 1)-th level and choose their children as g(x). It\u2019s like a\nkind of additional negative sampling and we can get a more precise approximation of log\nlikelihood than only using nodes with positive parents.\n\n3. During prediction, for the i-th sample, the predicted score \u02c6yij for j-th label can be computed\neasily based on the probability chain rule. For the prediction ef\ufb01ciency, we use beam\nsearch [13, 24]: for the d-th (d > 1) level we only predict scores of nodes that belong to\nnodes of the (d \u2212 1)-th level with top C predicted scores.\n\nWe can see that the deep model without using a PLT can be regarded as a special case of AttentionXML\nwith a PLT with only the root and L leaves.\n\n2.4 Attention-Aware Deep Model\n\nAttention-aware deep model in AttentionXML consists of \ufb01ve layers: 1) Word Representation Layer,\n2) Bidirectional LSTM Layer, 3) Multi-label Attention Layer, 4) Fully Connected Layer and 5)\nOutput Layer. Fig. 1c shows a schematic picture of attention-aware deep model in AttentionXML.\n\n2.4.1 Word Representation Layer\n\nThe input of AttentionXML is raw tokenized text with length \u02c6T . Each word is represented by a\ndeep semantic dense vector, called word embedding [22]. In our experiments, we use pre-trained\n300-dimensional GloVe [22] word embedding as our initial word representation.\n\n2.4.2 Bidirectional LSTM Layer\n\nRNN is a type of neural network with a memory state to process sequence inputs. Traditional RNN\nhas a problem called gradient vanishing and exploding during training [6]. Long short-term memory\n(LSTM) [8] is proposed for solving this problem. We use a Bidirectional LSTM (BiLSTM) to capture\nboth the left- and right-sides context (Fig. 1c), where at each time step t the output \u02c6ht is obtained by\nconcatenating the forward output\n\n\u2212\u2192\nh t and the backward output\n\n\u2190\u2212\nh t.\n\n2.4.3 Multi-Label Attention\n\nRecently, an attention mechanism in neural networks has been successfully used in many NLP tasks,\nsuch as machine translation, machine comprehension, relation extraction, and speech recognition\n[5, 18]. The most relevant context to each label can be different in XMTC. AttentionXML computes\nthe (linear) combination of context vectors \u02c6hi for each label through a multi-label attention mechanism,\ninspired by [16], to capture various intensive parts of a text. That is, the output of multi-label attention\nlayer \u02c6mj \u2208 R2 \u02c6N of the j-th label can be obtained as follows:\n\n\u02c6mj =\n\n\u03b1ij \u02c6hi,\n\n\u03b1ij =\n\n,\n\n(2)\n\nwhere \u03b1ij is the normalized coef\ufb01cient of \u02c6hi and \u02c6wj \u2208 R2 \u02c6N is the so-called attention parameters.\nNote that \u02c6wj is different for each label.\n\n\u02c6T(cid:88)\n\ni=1\n\n(cid:80) \u02c6T\n\ne\u02c6hi \u02c6wj\nt=1 e\u02c6ht \u02c6wj\n\n4\n\n\fTable 1: Datasets we used in our experiments.\n\nNtest\n3,865\n6,616\n306,782\n153,025\n769,421\n742,507\n\nD\n186,104\n101,938\n203,882\n135,909\n2,381,304\n337,067\n\nL\n5.30\n18.64\n5.04\n5.45\n4.75\n36.04\n\nDataset\nEUR-Lex\nWiki10-31K\n\nNtrain\n15,449\n14,146\nAmazonCat-13K 1,186,239\nAmazon-670K\n490,449\n1,779,881\n1,717,899\n\nW test\n1230.40\n2425.45\n245.98\n241.22\n808.56\n104.18\nNtrain: #training instances, Ntest: #test instances, D: #features, L: #labels, L: average #labels per instance, \u02c6L:\n\nL\n3,956\n30,938\n13,330\n670,091\n501,008\n2,812,281\n\n\u02c6L W train\n1248.58\n2484.30\n246.61\n247.33\n808.66\n104.08\n\n20.79\n8.52\n448.57\n3.99\n16.86\n22.02\n\nWiki-500K\nAmazon-3M\n\nthe average #instances per label, W train: the average #words per training instance and W test: the average\n\n#words per test instance. The partition of training and test is from the data source.\n\n2.4.4 Fully Connected and Output Layer\n\nAttentionXML has one (or two) fully connected layers and one output layer. The same parameter\nvalues are used for all labels at the fully connected (and output) layers, to emphasize differences of\nattention among all labels. Also sharing the parameter values of fully connected layers among all\nlabels can largely reduce the number of parameters to avoid over\ufb01tting and keep the model scale\nsmall.\n\n2.4.5 Loss Function\n\nAttentionXML uses the binary cross-entropy loss function, which is used in XML-CNN [17] as the\nloss function. Since the number of labels for each instance varies, we do not normalize the predicted\nprobability which is done in multi-class classi\ufb01cation.\n\n2.5\n\nInitialization on parameters of AttentionXML\n\nWe initialize the parameters of AttentionXMLd (d > 1) by using the parameters of trained\nAttentionXMLd\u22121, except the attention layers. This initialization helps models of deeper levels\nconverge quickly, resulting in improvement of the \ufb01nal accuracy.\n\n2.6 Complexity Analysis\n\nThe deep model without a PLT is hard to deal with extreme-scale datasets, because of high time and\nspace complexities of the multi-label attention mechanism. Multi-label attention in the deep model\nneeds O(BL \u02c6N \u02c6T ) time and O(BL( \u02c6N + \u02c6T )) space for each batch iteration, where B is the batch size.\nFor large number of labels (L > 100k), the time cost is huge. Also the whole model cannot be saved\nin the limited memory space of GPUs. On the other hand, the time complexity of AttentionXML\nwith a PLT is much smaller than that without a PLT, although we need train H + 1 different deep\nmodels. That is, the label size of AttentionXML1 is only L/K H, which is much smaller than L.\nAlso the number of candidate labels of AttentionXMLd(d > 1) is only C \u00d7 K, which is again much\nsmaller than L. Thus our ef\ufb01cient label tree-based AttentionXML can be run even with the limited\nGPU memory.\n\n3 Experimental Results\n\n3.1 Dataset\n\nWe used six most common XMTC benchmark datasets (Table 1): three large-scale datasets (L\nranges from 4K to 30K) : EUR-Lex1 [20], Wiki10-31K2 [32], and AmazonCat-13K 2 [19]; and\nthree extreme-scale datasets (L ranges from 500K to 3M): Amazon-670K2 [19], Wiki-500K2 and\nAmazon-3M2 [19]. Note that both Wiki-500K and Amazon-3M have around two million samples for\ntraining.\n\n1http://www.ke.tu-darmstadt.de/resources/eurlex/eurlex.html\n2http://manikvarma.org/downloads/XC/XMLRepository.html\n\n5\n\n\fTable 2: Hyperparameters we used in our experiments, practical computation time and model size.\n\nDatasets\n\nE\n\nB\n\n\u02c6N\n\n\u02c6Nf c\n\nH M\n= K\n\nC\n\nTrain\n(hours)\n\nModel Size\n\n(GB)\n\nEUR-Lex\nWiki10-31K\n\n30\n30\nAmazonCat-13K 10\n10\nAmazon-670K\n5\n5\n\nWiki-500K\nAmazon-3M\n\n40\n40\n200\n200\n200\n200\n\n256\n256\n512\n512\n512\n512\n\n256\n256\n\n512,256\n512,256\n512,256\n512,256\n\n-\n-\n-\n3\n1\n3\n\n-\n-\n-\n8\n64\n8\n\n-\n-\n-\n160\n15\n160\n\n0.51\n1.27\n13.11\n13.90\n19.55\n31.67\n\nE: The number of epoch; B: The batch size; N: The hidden unit size of LSTM; Nf c: The hidden unit size of\nfully connected layers; H: The height of PLT (excluding the root and leaves); M: The maximum cluster size;\n\nK: The parameters of the compress process, and here we set M = K = 2c; C: The number of parents of\n\ncandidate nodes.\n\nsample)\n\nTest\n(ms/\n\n2.07\n4.53\n1.63\n5.27\n2.46\n5.92\n\n0.20\n0.62\n0.63\n5.52\n3.11\n16.14\n\nk(cid:88)\n\nl=1\n\nP @k =\n\n1\nk\n\n3.2 Evaluation Measures\n\nWe chose P@k (Precision at k) [10] as our evaluation metrics for performance comparison, since\nP @k is widely used for evaluating the methods for XMTC.\n\nyrank(l)\n\n(3)\n\nwhere y \u2208 {0, 1}L is the true binary vector, and rank(l) is the index of the l-th highest predicted\nlabel. Another common evaluation metric is N @k (normalized Discounted Cumulative Gain at k).\nNote that P @1 is equivalent to N @1. We evaluated performance by N @k, and con\ufb01rmed that the\nperformance of N @k kept the same trend as P @k. We thus omit the results of N @k in the main text\ndue to space limitation (see Appendix).\n\n3.3 Competing Methods and Experimental Settings\n\nWe compared the state-of-the-art and most representative XMTC methods (implemented by the\noriginal authors) with AttentionXML: AnnexML3 (embedding), DiSMEC4 (1-vs-All), MLC2Seq5\n(deep learning), XML-CNN2 (deep learning), PfastreXML2 (instance tree), Parabel2 (label tree) and\nXT6 (ExtremeText) (label tree) and Bonsai7 (label tree).\nFor each dataset, we used the most frequent words in the training set as a limited-size vocabulary\n(not over 500,000). Word embeddings were \ufb01ne-tuned during training except EUR-Lex and Wiki10-\n31K. We truncated each text after 500 words for training and predicting ef\ufb01ciently. We used\ndropout [26] to avoid over\ufb01tting after the embedding layer with the drop rate of 0.2 and after the\nBiLSTM with the drop rate of 0.5. Our model was trained by Adam [14] with the learning rate of\n1e-3. We also used SWA (stochastic weight averaging) [9] with a constant learning rate to enhance\nthe performance. We used a three PLTs ensemble in AttentionXML similar to Parabel [23] and\nBonsai [13]. We also examined performance of AttentionXML with only one PLT (without ensemble),\ncalled AttentionXML-1. On three large-scale datasets, we used AttentionXML with a PLT including\nonly a root and L leaves(which can also be considered as the deep model without PLTs). Other\nhyperparameters in our experiments are shown in Tabel 2.\n\n3.4 Performance comparison\n\nTable 3 shows the performance results of AttentionXML and other competing methods by P @k over\nall six benchmark datasets. Following the previous work on XMTC, we focus on top predictions by\nvarying k at 1, 3 and 5 in P @k, resulting in 18 (= three k \u00d7 six datasets) values of P @k for each\nmethod.\n\n3https://s.yimg.jp/dl/docs/research_lab/annexml-0.0.1.zip\n4https://sites.google.com/site/rohitbabbar/dismec\n5https://github.com/JinseokNam/mlc2seq.git\n6https://github.com/mwydmuch/extremeText\n7https://github.com/xmc-aalto/bonsai\n\n6\n\n\fTable 3: Performance comparisons of AttentionXML and other competing methods over six bench-\nmarks. The results with the stars are from Extreme Classi\ufb01cation Repository directly.\n\nMethods\n\nP@3\n\nP@5\n\nMethods\n\nP@1=N@1\nAmazon-670K\n\nAnnexML\nDiSMEC\n\nPfastreXML\n\nParabel\n\nXT\n\nBonsai\n\nMLC2Seq\nXML-CNN\n\nAttentionXML-1\nAttentionXML\n\nAnnexML\nDiSMEC\n\nPfastreXML*\n\nParabel\n\nXT\n\nBonsai\n\nMLC2Seq\nXML-CNN\n\nAttentionXML-1\nAttentionXML\n\nAnnexML\nDiSMEC\n\nPfastreXML*\n\nParabel\n\nXT\n\nBonsai\n\nMLC2Seq\nXML-CNN\n\nAttentionXML-1\nAttentionXML\n\nP@1=N@1\nEUR-Lex\n79.66\n83.21\n73.14\n82.12\n79.17\n82.30\n62.77\n75.32\n85.49\n87.12\n\nWiki10-31K\n\n86.46\n84.13\n83.57\n84.19\n83.66\n84.52\n80.79\n81.41\n87.05\n87.47\n\n93.54\n93.81\n91.75\n93.02\n92.50\n92.98\n94.29\n93.26\n95.65\n95.92\n\n64.94\n70.39\n60.16\n68.91\n66.80\n69.55\n59.06\n60.14\n73.08\n73.99\n\n74.28\n74.72\n68.61\n72.46\n73.28\n73.76\n58.59\n66.23\n77.78\n78.48\n\n78.36\n79.08\n77.97\n79.14\n78.12\n79.13\n69.45\n77.06\n81.93\n82.41\n\n53.52\n58.73\n50.54\n57.89\n56.09\n58.35\n51.32\n49.21\n61.10\n61.92\n\n64.20\n65.94\n59.10\n63.37\n64.51\n64.69\n54.66\n56.11\n68.78\n69.37\n\n63.30\n64.06\n63.68\n64.51\n63.51\n64.46\n57.55\n61.40\n66.90\n67.31\n\nAnnexML\nDiSMEC\n\nPfastreXML*\n\nParabel\n\nXT\n\nBonsai\n\nMCL2Seq\nXML-CNN\n\nAttentionXML-1\nAttentionXML\n\nAnnexML\nDiSMEC\n\nPfastreXML\n\nParabel\n\nXT\n\nBonsai\n\nMCL2Seq\nXML-CNN\n\nAttentionXML-1\nAttentionXML\n\nAnnexML\nDiSMEC*\n\nPfastreXML*\n\nParabel\n\nXT\n\nBonsai\n\nMCL2Seq\nXML-CNN\n\nAttentionXML-1\nAttentionXML\n\nP@3\n\nP@5\n\n36.61\n39.72\n34.23\n39.77\n37.93\n40.39\n\n-\n\n30.00\n40.67\n42.61\n\n32.75\n36.17\n32.09\n35.98\n34.63\n36.60\n\n-\n\n27.42\n36.94\n38.92\n\n43.15\n50.57\n37.32\n49.57\n46.32\n49.80\n\n32.79\n39.68\n28.16\n38.64\n36.15\n38.83\n\n56.49\n58.42\n\n44.41\n46.14\n\n45.55\n44.96\n41.81\n44.66\n39.28\n45.65\n\n43.11\n42.80\n40.09\n42.55\n37.24\n43.49\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\nWiki-500K\n\n42.09\n44.78\n36.84\n44.91\n42.54\n45.58\n\n-\n\n33.41\n45.66\n47.58\n\n64.22\n70.21\n56.25\n68.70\n65.17\n69.26\n\n75.07\n76.95\n\n49.30\n47.34\n43.83\n47.42\n42.20\n48.45\n\n-\n-\n\n-\n-\n\nAmazonCat-13K\n\nAmazon-3M\n\n49.08\n50.86\n\n46.04\n48.04\n\n43.88\n45.83\n\n1) AttentionXML (with a three PLTs ensemble) outperformed all eight competing methods by P @k.\nFor example, for P @5, among all datasets, AttentionXML is at least 4% higher than the second best\nmethod (Parabel on AmazonCat-13K). For Wiki-500K, AttentionXML is even more than 17% higher\nthan the second best method (DiSMEC). AttentionXML also outperformed AttentionXML-1 (without\nensemble), especially on three extreme-scale datasets. That\u2019s because on extreme-scale datasets,\nthe ensemble with different PLTs reduces more variance, while on large-scale datasets models the\nensemble is with the same PLTs (only including the root and leaves). Note that AttentionXML-1 is\nmuch more ef\ufb01cient than AttentionXML, because it only trains one model without ensemble.\n2) AttentionXML-1 outperformed all eight competing methods by P @k, except only one case.\nPerformance improvement was especially notable for EUR-Lex, Wiki10-31K and Wiki-500K, with\nlonger texts than other datasets (see Table 1). For example, for P @5, AttentionXML-1 achieved\n44.41, 68.78 and 61.10 on Wiki-500K, Wiki10-31K and EUR-Lex, which were around 12%, 4%\nand 4% higher than the second best, DiSMEC with 39.68, 65.94 and 58.73, respectively. This result\nhighlights that longer text has larger amount of context information, where multi-label attention can\nfocus more on the most relevant parts of text and extract the most important information on each\nlabel.\n3) Parabel, a method using PLTs, can be considered as taking the advantage of both tree-based\n(PfastreXML) and 1-vs-All (DiSMEC) methods. It outperformed PfastreXML and achieved a similar\nperformance to DiSMEC (which is however much more inef\ufb01cient). ExtremeText (XT) is an online\n\n7\n\n\fFigure 2: P SP @k of label tree-based methods.\n\nlearning method with PLTs (similar to Parabel), which used dense instead of sparse representations and\nachieved slightly lower performance than parabel. Bonsai, another method using PLTs, outperformed\nParabel on all datasets except AmazonCat-13K. In addition, Bonsai achieved better performance than\nDiSMEC on Amazon-670K and Amazon-3M. This result indicates that the shallow and diverse PLTs\nin Bonsai improves its performance. However, Bonsai needs much more memory than Parabel, for\nexample, 1TB memory for extreme-scale datasets. Note that AttentionXML-1 with only one shallow\nand wide PLT, still signi\ufb01cantly outperformed both Parabel and Bonsai on all extreme-scale datasets,\nespecially Wiki-500K.\n4) MLC2Seq, a deep learning-based method, obtained the worst performance on the three large-scale\ndatasets, probably because of its unreasonable assumption. XML-CNN, another deep learning-based\nmethod with a simple dynamic pooling was much worse than the other competing methods, except\nMLC2Seq. Note that both MLC2Seq and XML-CNN are unable to deal with datasets with millions\nof labels.\n5) AttentionXML was the best method among all the competing methods, on the three extreme-scale\ndatasets (Amazon-670K, Wiki-500K and Amazon-3M). Although the improvement by AttentionXML-\n1 over the second and third best methods (Bonsai and DiSMEC) is rather slight, AttentionXML-1 is\nmuch faster than DiSMEC and uses much less memory than Bonsai. In addition, the improvement by\nAttentionXML with a three PLTs ensemble over Bonsai and DiSMEC is more signi\ufb01cant, which is\nstill faster than DiSMEC and uses much less memory than Bonsai.\n6) AnnexML, the state-of-the-art embedding-based method, reached the second best P @1 on Amazon-\n3M and Wiki10-31K, respectively. Note that the performance of AnnexML was not necessarily so on\nthe other datasets. The average number of labels per sample of Amazon-3M (36.04) and Wiki10-31K\n(18.64) is several times larger than those of other datasets (only around 5). This suggests that each\nsample in these datasets has been well annotated. Under this case, embedding-based methods may\nacquire more complete information from the nearest samples by using KNN (k-nearset neighbors)\nand might gain a relatively good performance on such datasets.\n\n3.5 Performance on tail labels\n\nWe examined the performance on tail labels by P SP @k (propensity scored precision at k) [10]:\n\nk(cid:88)\n\nl=1\n\nPSP@k =\n\n1\nk\n\nyrank(l)\nprank(l)\n\n(4)\n\nwhere prank(l) is the propensity score [10] of label rank(l). Fig 2 shows the results of three label\ntree-based methods (Parabel, Bonsai and AttentionXML) on the three extreme-scale datasets. Due to\nspace limitation, we reported P SP @k results of AttentionXML and all compared methods including\nProXML [4] (a state-of-the-art method on P SP @k) on six benchmarks in Appendix.\nAttentionXML outperformed both Parabel and Bonsai in P SP @k on all datasets. AttentionXML use\na shallow and wide PLT, which is different from Parabel. Thus this result indicates that this shallow\nand wide PLT in AttentionXML is promising to improve the performance on tail labels. Additionally,\nmulti-label attention in AttentionXML would be also effective for tail labels, because of capturing\n\n8\n\n\fTable 4: P@5 of XML-CNN, BiLSTM and AttentionXML (all without ensemble)\n\nDataset\n\nXML-CNN BiLSTM AttentionXML\n(BiLSTM+Att)\n\nAttentionXML\n\n(BiLSTM+Att+SWA)\n\n61.10\n68.78\n66.90\n\nEUR-Lex\nWiki10-31K\n\nAmazonCat-13K\n\n49.21\n56.21\n61.40\n\n53.12\n59.55\n63.57\n\n59.61\n66.51\n66.29\n\nTable 5: Performance of variant number of trees in AttentionXML.\n\nAmazon-670K\n\nWiki-500K\n\nAmazon-3M\n\nTrees\n\n1\n2\n3\n4\n\nP@1\n45.66\n46.86\n47.58\n48.03\n\nP@3\n40.67\n41.95\n42.61\n43.05\n\nP@5\n36.94\n38.27\n38.92\n39.32\n\nP@1\n75.07\n76.44\n76.95\n77.21\n\nP@3\n56.49\n57.92\n58.42\n58.72\n\nP@5\n44.41\n45.68\n46.14\n46.40\n\nP@1\n49.08\n50.34\n50.86\n51.66\n\nP@3\n46.04\n47.45\n48.04\n48.39\n\nP@5\n43.88\n45.26\n45.83\n46.23\n\nthe most important parts of text for each label, while Bonsai uses just the same BOW features for all\nlabels.\n\n3.6 Ablation Analysis\n\nFor examining the impact of BiLSTM and multi-label attention, we also run a model which consists of\na BiLSTM, a max-pooling (instead of the attention layer of AttentionXML), and the fully connected\nlayers (from XML-CNN). Tabel 4 shows the P @5 results on three large-scale datasets. BiLSTM\noutperformed XML-CNN on all three datasets, probably because of capturing the long-distance\ndependency among words. AttentionXML (BiLSTM+Attn) further outperformed XML-CNN and\nBiLSTM, especially on EUR-Lex and Wiki10-31K, which have long texts. Comparing with a simple\ndynamic pooling, obviously multi-label attention can extract the most important information to each\nlabel from long texts more easily. In addition, Table 4 shows that SWA has a favorable effect on\nimproving prediction accuracy.\n\n3.7\n\nImpact of Number of PLTs in AttentionXML\n\nWe examined the performance of ensemble with different number of PLTs in AttentionXML. Table 5\nshows the performance comparison of AttentionXML with different number of label trees. We can\nsee that more trees much improve the prediction accuracy. However, using more trees needs much\nmore time for both training and prediction. So its a trade-off between performance and time cost.\n\n3.8 Computation Time and Model Size\n\nAttentionXML runs on 8 Nvidia GTX 1080Ti GPUs. Table 2 shows the computation time for training\n(hours) and testing (milliseconds/per sample), as well as the model size (GB) of AttentionXML with\nonly one PLT for each dataset. For the ensemble of several trees, AttentionXML can be trained and\npredicted on a single machine sequentially, or on a distributed settings simultaneously and ef\ufb01ciently\n(without any network communication).\n\n4 Conclusion\n\nWe have proposed a new label tree-based deep learning model, AttentionXML, for XMTC, with two\ndistinguished features: the multi-label attention mechanism, which allows to capture the important\nparts of texts most relevant to each label, and a shallow and wide PLT, which allows to handle millions\nof labels ef\ufb01ciently and effectively. We examined the predictive performance of AttentionXML,\ncomparing with eight state-of-the-art methods over six benchmark datasets including three extreme-\nscale datasets. AttentionXML outperformed all other competing methods over all six datasets,\nparticularly datasets with long texts. Furthermore, AttentionXML revealed the performance advantage\nin predicting long tailed labels.\n\n9\n\n\fAcknowledgments\n\nS. Z. is supported by National Natural Science Foundation of China (No. 61572139 and No.\n61872094) and Shanghai Municipal Science and Technology Major Project (No. 2017SHZDZX01).\nR. Y., Z. Z., Z. W., S. Y. are supported by the 111 Project (NO. B18015), the key project of Shanghai\nScience & Technology (No. 16JC1420402), Shanghai Municipal Science and Technology Major\nProject (No. 2018SHZDZX01) and ZJLab. H.M. has been supported in part by JST ACCEL [grant\nnumber JPMJAC1503], MEXT Kakenhi [grant numbers 16H02868 and 19H04169], FiDiPro by\nTekes (currently Business Finland) and AIPSE by Academy of Finland.\n\nReferences\n\n[1] R. Babbar, I. Partalas, E. Gaussier, and M. R. Amini. On \ufb02at versus hierarchical classi\ufb01cation\nin large-scale taxonomies. In Advances in neural information processing systems, pages 1824\u2013\n1832, 2013.\n\n[2] R. Babbar, I. Partalas, E. Gaussier, M.-R. Amini, and C. Amblard. Learning taxonomy adaptation\nin large-scale classi\ufb01cation. The Journal of Machine Learning Research, 17(1):3350\u20133386,\n2016.\n\n[3] R. Babbar and B. Sch\u00f6lkopf. DiSMEC: distributed sparse machines for extreme multi-label\nclassi\ufb01cation. In Proceedings of the Tenth ACM International Conference on Web Search and\nData Mining, pages 721\u2013729. ACM, 2017.\n\n[4] R. Babbar and B. Sch\u00f6lkopf. Data scarcity, robustness and extreme multi-label classi\ufb01cation.\n\nMachine Learning, pages 1\u201323, 2019.\n\n[5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[6] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent\n\nis dif\ufb01cult. IEEE transactions on neural networks, 5(2):157\u2013166, 1994.\n\n[7] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. Sparse local embeddings for extreme multi-\nlabel classi\ufb01cation. In Advances in Neural Information Processing Systems, pages 730\u2013738,\n2015.\n\n[8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[9] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson. Averaging weights leads\n\nto wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.\n\n[10] H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation,\ntagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 935\u2013944. ACM,\n2016.\n\n[11] K. Jasinska, K. Dembczynski, R. Busa-Fekete, K. Pfannschmidt, T. Klerx, and E. Huller-\nmeier. Extreme f-measure maximization using sparse probability estimates. In International\nConference on Machine Learning, pages 1435\u20131444, 2016.\n\n[12] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for ef\ufb01cient text classi\ufb01ca-\ntion. In Proceedings of the 15th Conference of the European Chapter of the Association for\nComputational Linguistics: Volume 2, Short Papers, volume 2, pages 427\u2013431, 2017.\n\n[13] S. Khandagale, H. Xiao, and R. Babbar. Bonsai-diverse and shallow trees for extreme multi-label\n\nclassi\ufb01cation. arXiv preprint arXiv:1904.08249, 2019.\n\n[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[15] J. Lin, Q. Su, P. Yang, S. Ma, and X. Sun. Semantic-unit-based dilated convolution for multi-\nlabel text classi\ufb01cation. In Proceedings of the 2018 Conference on Empirical Methods in\nNatural Language Processing, pages 4554\u20134564, 2018.\n\n[16] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured\n\nself-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.\n\n10\n\n\f[17] J. Liu, W.-C. Chang, Y. Wu, and Y. Yang. Deep learning for extreme multi-label text classi-\n\ufb01cation. In Proceedings of the 40th International ACM SIGIR Conference on Research and\nDevelopment in Information Retrieval, pages 115\u2013124. ACM, 2017.\n\n[18] T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine\ntranslation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language\nProcessing, pages 1412\u20131421, 2015.\n\n[19] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions\nwith review text. In Proceedings of the 7th ACM conference on Recommender systems, pages\n165\u2013172. ACM, 2013.\n\n[20] E. L. Mencia and J. F\u00fcrnkranz. Ef\ufb01cient pairwise multilabel classi\ufb01cation for large-scale\nIn Joint European Conference on Machine Learning and\n\nproblems in the legal domain.\nKnowledge Discovery in Databases, pages 50\u201365. Springer, 2008.\n\n[21] J. Nam, E. L. Menc\u00eda, H. J. Kim, and J. F\u00fcrnkranz. Maximizing subset accuracy with recurrent\nneural networks in multi-label classi\ufb01cation. In Advances in Neural Information Processing\nSystems, pages 5413\u20135423, 2017.\n\n[22] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation.\nIn Proceedings of the 2014 conference on empirical methods in natural language processing\n(EMNLP), pages 1532\u20131543, 2014.\n\n[23] Y. Prabhu, A. Kag, S. Gopinath, K. Dahiya, S. Harsola, R. Agrawal, and M. Varma. Extreme\nmulti-label learning with label features for warm-start tagging, ranking & recommendation. In\nProceedings of the Eleventh ACM International Conference on Web Search and Data Mining,\npages 441\u2013449. ACM, 2018.\n\n[24] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma. Parabel: Partitioned label trees for\nextreme classi\ufb01cation with application to dynamic search advertising. In Proceedings of the\n2018 World Wide Web Conference on World Wide Web, pages 993\u20131002. International World\nWide Web Conferences Steering Committee, 2018.\n\n[25] Y. Prabhu and M. Varma. FastXML: A fast, accurate and stable tree-classi\ufb01er for extreme\nmulti-label learning. In Proceedings of the 20th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 263\u2013272. ACM, 2014.\n\n[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple\nway to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[27] Y. Tagami. AnnexML: approximate nearest neighbor search for extreme multi-label classi\ufb01-\ncation. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, pages 455\u2013464. ACM, 2017.\n\n[28] M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, and K. Dembczynski. A no-regret\ngeneralization of hierarchical softmax to extreme multi-label classi\ufb01cation. In Advances in\nNeural Information Processing Systems, pages 6355\u20136366, 2018.\n\n[29] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang. SGM: sequence generation model for multi-\nlabel classi\ufb01cation. In Proceedings of the 27th International Conference on Computational\nLinguistics, pages 3915\u20133926, 2018.\n\n[30] I. E. Yen, X. Huang, W. Dai, P. Ravikumar, I. Dhillon, and E. Xing. PPDsparse: a parallel\nprimal-dual sparse method for extreme classi\ufb01cation. In Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 545\u2013553. ACM,\n2017.\n\n[31] I. E.-H. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. Dhillon. PD-Sparse: a primal and dual\nsparse approach to extreme multiclass and multilabel classi\ufb01cation. In International Conference\non Machine Learning, pages 3069\u20133077, 2016.\n\n[32] A. Zubiaga.\n\narXiv:1202.5469, 2012.\n\nEnhancing navigation on wikipedia with social\n\ntags.\n\narXiv preprint\n\n11\n\n\f", "award": [], "sourceid": 3111, "authors": [{"given_name": "Ronghui", "family_name": "You", "institution": "Fudan University"}, {"given_name": "Zihan", "family_name": "Zhang", "institution": "Fudan University"}, {"given_name": "Ziye", "family_name": "Wang", "institution": "Fudan University"}, {"given_name": "Suyang", "family_name": "Dai", "institution": "Fudan University"}, {"given_name": "Hiroshi", "family_name": "Mamitsuka", "institution": "Kyoto University / Aalto University"}, {"given_name": "Shanfeng", "family_name": "Zhu", "institution": "Fudan University"}]}