NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3111
Title:AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification

Reviewer 1

========= I have read the author rebuttal and the other reviews. I maintain my rating however as I pointed out in my review I would request the authors to add more clarity to the method section. ========= This paper presents a label tree and deep learning based method for solving extreme classification problems. Proposed method essentially learns a Parabel like label tree (but shallow and wide) and then instead of using linear classifiers it uses attention based neural network to send training points to the respective nodes. In terms of clarity the paper could certainly be improved, particularly the method section. From the current text it is very hard to figure out what exactly the proposed method is doing. Nevertheless the proposed method outperforms a very hard to beat baseline and hence would be an important contribution to the extreme classification community. Another area where I felt paper could be improved is the explanation of why the method works so well? Is it because the linear classifiers used in parabel weren't powerful enough to correctly classify the points or is it that they simply used bag-of-words features and ignored the semantic meaning of the sentence. Finally it is also not clear to me why the proposed method performs well on tail labels.

Reviewer 2

Originality: This is a very interesting algorithmic contribution. The introduced method gets state-of-the-art results under reasonable computation resources. I was reviewing a former version of this paper for some other conference and have to admit that the new version is significantly improved, mainly because the authors have succeeded to decrease the computational costs of the attention-based deep network by using the probabilistic label trees. Quality: The method is sound and the empirical analysis is of high quality. The paper does not have any theoretical contribution, but it is unnecessary for this kind of contribution. Clarity: The paper in general is clearly written. However, the authors could make a better job in description of the methods: - The tree building method seems to be very simple, but I am not sure whether I have understood all the details. A pseudocode would help a lot. - Similarly, it is not clear enough how the underlying idea of probabilistic label trees have been finally implemented (there is neither a pseudocode in the paper nor the code attached to the submission). It seems that the learning follows a kind of beam search, but this is not clearly stated. Let me underline that this is not necessary for probabilistic label trees. It should be enough to use a given training example only in those nodes for which the parent node is positive (this is what the conditioning on z_{Pa(i) = 1} in (1) says). The solution used by the authors looks like a kind of additional negative sampling. A careful discussion should be given here. - It is also not entirely clear from Subsection 2.3 how training and prediction are performed. Are the models for each level trained sequentially from top to the bottom levels or they are trained all at once? How does batch training work in this case? Is the C parameter the same for training and inference using beam search? - Notation used by the authors is not systematically introduced (some symbols are defined in captions of figures). This makes the paper not pleasant to read. On the other hand, the details of conducted experiments (results, parameters, hardware, ablation analysis) are well-described. It is only surprising that the authors did not include the results of extremeText. This is a shallow network based on PLTs that significantly outperforms XML-CNN. Indeed, it gets results inferior to Parabel or Bonsai, but this is an online method methodologically more similar to the algorithm introduced by the authors. Minor comments concerning clarity are given below: - word(where => word (where - regraded => regarded - Do the outputs of AttentionXML \hat{y} correspond to the node variables z_n? - XML-CNN uses binary cross-entropy loss (not the cross-entropy loss function as suggested by the authors) which is theoretically well-justified as it leads to estimation of marginal probabilities (see, for example, "On label dependence and loss minimization in multi-label classification", MLJ 2012, and the extremeText paper). Significance: AttentionXML achieves better results in terms of precision@k than other state-of-the-art algorithms (including 1-vs-all approaches like DiSMEC, which are very hard to beat) on popular XMLC benchmark datasets. It is worth to underline that the method improves the results significantly not only on precision@1, but also on precision@5. While the proposed approach is still quite costly in training and the approach has many additional hyper-parameters that seem to affect results in a significant way, one should not ignore these outstanding results. The presentation of the paper should be improved, but the contribution deserves publication at NeurIPS.

Reviewer 3

====================== Thanks to the authors for the rebuttal, I have updated my score for the paper. It will be helpful to include the additional experiments done as part of the rebuttal to the final version, along with the references. ================ The paper presents a deep learning method based on attention mechanism for Extreme multi-label Text classification (XMTC) problem. The proposed solution is scalable to datasets with upto 3M labels, and is claimed to achieve state of the art results on precision@k and ndcg@k metrics. Originality - The paper has two main parts - (i) using the attention mechanism for XMTC and (ii) shallow tree for scaling up to large datasets. Even though using the attention mechanism is new for this domain, but the idea of shallow trees is somewhat related to a recently proposed method (Bonsai, [10] in the paper). However one of the main concerns is the following : (a) The process of making the tree shallow by node compression is related to similar ideas in hierarchical classification [1,2] below. In this respect, the reference to these works is missing, and should be included. Quality - Though the paper shows that using shallow trees along with the attention mechanism, state-of-the-art performance is achieved. (a) It is not clear how attention mechanism work as standalone i.e. how does it perform on small datasets when there are no scalability issues, and one does not need the shallow trees. It is, therefore, important to evaluate the attention effect separately without the tree structure, and then study the impact of using that in a shallow tree architecture. Does the accuracy improve upon using the shallow tree? How is it work with deep tree? (b) In comparing on propensity metrics, comparison with another state-of-the-art method [3] (ProXML) is missed out. On some of the datasets, the performance of ProXML is relatively better on these metrics. This comparison should therefore be included in the paper. Clarity - Even though the paper is clear in terms of writing, it would be really beneficial if the authors also provided the code. Significance - Since the results of the paper are significantly better than most state-of-the-art methods, it would be of interest to the community. [1] On Flat versus Hierarchical Classification in Large-Scale Taxonomies, NIPS 2013 [2] Learning taxonomy adaptation in large-scale classification, JMLR 2016 [3] Data scarcity, robustness and extreme multi-label classification, Machine Learning Journal, 2019