NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2742
Title:Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces

Reviewer 1

In the prior literature, they cited the low dimensional embedding methods is the reason of the poor performance of the embedding based methods. In this paper, the author proposed that the final score vector for the labels actually generated by highly non-linear transformation such as thresholding the scores. Thus it is not clear if the low-rank structure of the score vectors directly cause the low-rank on the label vectors. Furthermore, the author uses a simple neural network to mimic the low-dimensional embedding can attain near-perfect training accuracy but generalize poorly and suggesting that overfitting is the root cause of the poor performance of the embedding based methods. This is the first contribution of the paper which breaks the glass ceiling of embedding based methods. In order to address the overfitting problem, it highlights the need of regularization techniques. We know that most feature vectors associated with only very few true labels, thus the label embedding matrix should have near-orthogonal structure, which is exactly what the Spread-out regularizer did. But the drawback is that it ignores the frequently co-occur labels correlation, thus it is not desirable. In order to address the lack of modeling co-occurrences of labels bring by spread-out regularizer, the author estimates the degree of occurrences between labels from the training data. The author built a co-occurrence label matrix and encode the conditional frequencies between labels into the loss function, which is the novel framework of GLaS Regularization. This method defines the degree of label correlation as the arithmetic mean of conditional frequencies between labels. In order to get rid of the expensive computation, the author chooses to compute stochastic loss on a specific batch. The GLaS regularizer’s name actually comes from the sum of graph Laplacian regularizer and the spread-out regularizer. Graph Laplacian regularizer tends to assign similar embeddings to labels that co-occur frequently, but the drawback is that it doesn’t penalize the similar embeddings of the non-related labels. This drawback can be solved by spread-out regularizer which encourages all label embeddings to be orthogonal regardless of any correlation. Thus, summing the graph Laplacian regularizer and the spread-out regularizer, the author proposes their regularizer which captures the advantages of both regularizer. Through the experiments on several widely used extreme (millions of labels) multi-label classification datasets, the proposed regularizer method matches or outperforms with all previous work in terms of precision on most dataset. The main contribution of this paper is to point out the bottleneck of the inferior performance of embedding-based method is not limited to the low-dimensional embedding problem, instead, the author proves the poor performance is because of the overfitting issue, and proposes their novel regularizer which is the sum of graph Laplacian regularizer and the spread-out regularizer and evaluate the effectiveness on several widely used datasets. From this point of view, research can focus on finding the other better regularizer and it is feasible to use low-dimensional embedding based methods even with simple linear classifiers to reach competitive result.

Reviewer 2

After the rebuttal: - Thanks for the plot. That is useful to see and in favor of the paper. - Note that representability based on universal approximation theorem does not imply learnability. Finding those representative weights, can still be NP hard. ------------- This paper addresses the problem of extreme multi-label classification. The paper proves that the embedding model is expressive. It then provides an example of experiment on data where the embedding-based model performed poorly as the result of overfitting. Once authors diagnose the problem as overfitting, they propose regularizers to constrain the model and improve the situation. Strengths: - The paper addresses an important and relevant research question: is extreme multi-label classification learnable by embedding models? - The expressiveness theory (Section 2.2) is useful. Weaknesses: - Quality: Results of Section 2.1, which builds the main motivation of the paper, is demonstrated on a very limited settings and examples. It does not convince the reader that overfitting is the general reason for potential poor performance of the models under study. - Soundness: While expressiveness is useful, it does not mean that the optimal weights are learnable. The paper seem to not pay attention to this issue. - Clarity: Related work could be improved. Some related works are mainly named but their differences are not described enough. - Organization could be improved. Currently the paper is dependent on appendix (eg the algorithms). Also the contents of tables are too small. Overall, I do not think the quality of the paper is high enough and I vote for it to be rejected.

Reviewer 3

==============Post rebuttal ========= Thanks to the authors for their feedback. Somethings still remain unclear : 1. Label embedding methods for large output spaces can be seen as deep networks with one hidden layer. It seems that the main message of the paper is that training these as deep nets with corresponding optimization schemes, rather than methods from conventional label embedding methods such as LEML and SLEEC leads to better results. Instead, the paper's main message conveys something different - namely, over-fitting of label embedding is the main cause for label-embedding methods, and it is fixed by regularization. Instead, on these datasets, a linear classifier also overfits, and it is not much of a surprise that a label embedding method/one hidden layer network overfits also. 2. It is somewhat difficult to understand the 200% increase in the propensity scored metrics(PSP@k in Table 4) over other sota baselines, while only below par or similar performance on vanilla P@k (Table 3). For instance, compare the proposed method vs SLEEC for various datasets. Does it mean that the method is working very poorly on head labels that it deriving all its performance from the tail-labels. In which case, it should be verified that of the model is indeed is doing poorly on head labels, and needs to be investigated why that is happening. 3. There should be much more clarity on the training time in terms of number of cores, and number of hours/days it took on diffrenent datasets, since this is a standard practice in XMC community, which makes it comparable to other methods. ================= The paper presents a simple deep learning method with one hidden layer for learning with large output spaces. The results show that the proposed method can achieve comparable performance to state-of-the-art linear methods such as DiSMEC/PPDSparse and tree-based methods such as Parabel. On a high level, I have the following concerns about the paper : 1. The paper suggests that they have found that overfitting is the main cause for bad performance of label-embedding methds, and that this is fixed by Glas regularizer. However, that does not seem to be the case, as from Table 1 and Table 2, the results without the Glas regularization are almost in the same ball-park as state-of-the-art methods, and the regularization merely helps by couple of % points. It seems that the main idea is that take a simple deep network (fully connected) with one hidden layer(acting as embedding) it gives close to sota performance, and one can increase the performance somewhat by regularization 2. It should be discussed why LEML (and SLEEC) which is based on similar scheme architecture but does not train it as a deep network works so poorly while the proposed method works better. Is it due to different optimization algorithm scheme. 3. Over-fitting is not new for these datasets, and it is likely that even linear classifiers such as DiSMEC overfits most of these datasets. For instance, one can get close to 99% training set accuracy with DiSMEC on Amazon13k data. So, it seems that showing over-fitting with embedding scheme is not very surprising. 4. The results for propensity score metrics (Table 4) seem rather surprising, since these metrics typically correlate very well with vanilla metrics (Table 3). Even though the vanilla metrics are very close to most sota methods, the prop scored metrics are almost double in some cases, which is really surprising. Since the proposed method does nothing special for tail-labels, and 70-90% are tail labels, the vanilla metric should also much higher, in that case. In this respect, it would be useful to be provided with the code to verify the claims, or the authors could check if their is correct, and if so, then investigate the reason for such large discrepancy. 5. The proof of Theorem 2.1 is not quite clear. In line 474 it seems that the proof if for some x that exists, and it becomes 477 it becomes for all x. The transformation is not immediate, and unclear. 6. The paper does not say anything about the computing infrastructure used to train the models for big datasets. It is important for fair comparison since most other works provide some information oin this context. 7. Missing references 1. The paper does not compare with AttentionXML [1], another deep learning approach which seems to be giving quite good results. 2. There is no comparison with ProXML [2] which is a sota method for propensity scored metrics, even though the code is available. Moreover, it also mentions ideas related to using graph laplacian for label co-occurrence. [1] AttentionXML: Extreme Multi-Label Text Classification with Multi-Label Attention Based Recurrent Neural Networks, [2] Data scarcity, robustness and extreme multi-label classification,