Review for NeurIPS paper: Multi-label classification: do Hamming loss and subset accuracy really conflict with each other?

NeurIPS 2020

Multi-label classification: do Hamming loss and subset accuracy really conflict with each other?

Review 1

Summary and Contributions: This paper deals with theoretical issues in multi-label classification. Through Rademacher complexity and a vector contraction inequality, it derives and discusses generalisation error bounds for Hamming Loss, Subset Accuracy, and Ranking Loss in the case of Subset Loss optimisation.

Strengths: This paper provides appears to explain why, in the case of small label space, HL optimisation leads to better performances than SA.

Weaknesses: The relevance of the described results for the learning community at large is quite limited.

Correctness: This paper is mathematically sound. The theoretical findings are corroborated by experiments.

Clarity: This paper is very well written.

Relation to Prior Work: Definitely

Reproducibility: Yes

Additional Feedback: N.A.

Review 2

Summary and Contributions: The main idea is to use the Lipschitz continuity of loss functions with respect to the l2 norm (since the loss function takes a vector input) to study generalization bounds for multi-label learning.

Strengths: -

Weaknesses: This idea is well-known in the setting of multi-class classification, e.g., [1], [2]. Although this paper considers a different learning setting, i.e., multi-label learning, the theoretical analysis is exactly the same. In my opinion, the paper trivially extends the known results of multi-class classification to multi-label learning. The authors do not introduce any new technique and do not derive any surprising results. [1] Lei Y, Dogan Ü, Zhou DX, Kloft M. Data-dependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory. 2019 Jan 21;65(5):2995-3021. [2] A. Maurer, “A vector-contraction inequality for rademacher complexities. International Conference on Algorithmic Learning Theory. Springer, 2016, pp. 3–17. ---- I slightly upgraded my evaluation; my concern persits: the theoretical analysis is straight forward.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: -

Review 3

Summary and Contributions: The paper discusses generalization bounds for multi-label classification, using three common loss functions: Hamming loss, rank loss and subset zero-one loss. As main contribution, the authors claim that optimizing Hamming loss can be a good alternative for optimizing subset zero-one loss. This is also illustrated in a small experimental study using classical multi-label benchmarks.

Strengths: - the work is novel - the claims are based on theoretical results

Weaknesses: - some very restrictive assumptions are made - hence, the conclusions are not fully supported by the results

Correctness: The derivations are correct.

Clarity: Yes, it is very well written.

Relation to Prior Work: Yes, it is.

Reproducibility: Yes

Additional Feedback: The paper is very well written. Even though it is theoretical in nature, it is quite accessible for a broad audience. The performed analysis is novel, and the results are somewhat surprising. I do have concerns w.r.t. to the take-home message of the paper, because this message is contradicting the theoretical and empirical results obtained in [11]. In the introduction the authors mention the following to explain this contradiction “The analysis in [11] has inevitable limitations because the hypothesis space is uncontrained and the conditional distribution P(y|x) is known. These assumptions have a large gap with real learning algorithms where a constrained hypothesis space is used and P(y|x) is unknown”. It is not entirely clear to me what the authors want to say here, but perhaps they allude to the fact that in practice it is not so easy to estimate P(y|x). This is of course true, but at least one could try to estimate P(y|x) as good as possible. Several methods exist for doing this: structured SVMs, conditional random fields, probabilistic classifier chains, and deep versions of those methods. In contrast, the authors make a very strong assumption in Eqn. 7, by assuming a linear model per label, as in binary relevance. It has been shown in [11] that such a hypothesis is in general not capable of minimizing subset zero-one loss in a Bayes-optimal way. The hypothesis space in Eqn. 7 simply models P(y|x) as a product of marginals, thereby assuming conditional independence of labels. So, this hypothesis space is very restrictive. Under these restrictions, the results obtained by the authors are not so surprising. When assuming conditional independence between labels, the risk minimizers for Hamming loss and subset zero-one loss coincide, as shown in [11]. Nonetheless, the results of this paper are novel, and they are complementary to what already exists in the literature. I only have the feeling that the authors are not telling a very fair story. I think that this paper can be accepted, provided that the above comments are taken into account in a revised version. --------------------------------------------------------- Review update after author rebuttal: -------------------------------------------------------- There are indeed two competing frameworks for minimizing complex loss functions. Either you optimize a surrogate loss during training (known as direct utility maximization), or you fit a probabilistic model during training, followed by an inference phase for every test instance (known as the decision-theoretic approach). In the rebuttal the authors claim that their results are not comparable to [11], because they consider the first framework, whereas [11] considers the second framework. I don't agree. Perhaps it is a bit easier to analyze the difference between Hamming and subset zero-one loss in the decision-theoretic framework, but I don't see a reason why the conclusions should be different for the utility maximization framework. [11] has shown that two aspects are of crucial importance to minimize a complex loss such as the subset zero-one loss: 1. The algorithm should target that loss, e.g., by minimizing a surrogate of that loss. 2. The hypothesis space should be expressive enough. For the subset zero-one loss, [11] has shown that binary relevance leads to a hypothesis that is not expressive enough. I don't see why this would be different in the utility maximization framework. This could probably be easily illustrated in experiments on synthetic data, where complex dependencies between labels can be simulated. For example, a structured SVM that models dependencies between labels and minimizes subset zero-one loss is expected to outperform binary relevance on subset zero-one loss, while the latter might be better for Hamming loss. Thus, as stated in my initial review, the assumption made in Eqn. 7 leads to my opinion to a too restrictive hypothesis space, and hence wrong conclusions. I am not aware of any papers that prove my claim for subset zero-one loss in the utility maximization framework, but for the F-measure this has been illustrated in the work of Petterson and Caetano (NIPS 2010 and NIPS 2011).

Review 4

Summary and Contributions: For algorithms that optimise (surrogate) hamming, subset, and rank losses, the paper introduces (1) several inequalities associating the different (actual and surrogate) losses wrt a hypothesis (this allows to provide insight regarding the phenomenon of optimising for hamming loss achieving better subset-loss performance than directly optimising for subset loss) (2) generalisation bounds for each of the losses in terms of the others, further providing insights regarding their respective hardness and performance related to the different objectives, and (3) a set of experiments to validate the key applicable observations, namely that optimising for (surrogate) hamming (subset) loss performs better than the other when evaluating over hamming (subset) loss, respectively (with the observe exception mentioned in (1) above - only when the number of classes is small).

Strengths: The main contributions are the introduction of generalisation bounds for the different losses, and the inequalities between the different losses. The bounds are provided for kernel-based hypothesis classes, leverage recent Rademacher complexity results, and may serve to get similar bounds for other related losses and other hypothesis classes going forward.

Weaknesses: The proposed algorithms do not optimise the actual loss rather a surrogate, therefore the actual ordering among the optimised hypotheses depends on the tightness of the bounds and therefore any related conclusion should be qualified appropriately.

Correctness: The claims in the abstract (and in the rest of the paper) regarding the objective optimised by the algorithms (e.g., lines 6 and 7) should be qualified - the surrogate is optimised rather than the actual HL (or SA) loss.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Post rebuttal: After reading all reviews I decreased my score a notch. The key theoretical contribution/technique is not clear enough nor designated within the main body of the paper.