Reviews: On the Calibration of Multiclass Classification with Rejection

Since 2017, there has been a considerable effort in improving confidence modeling with classifiers, with 2 majors goals: rejection when uncertain, and detecting out-of-distribution examples. In a work that has been mostly empirical and focused on DNNs, this line of work stands out by being mostly theoretical, taking its seeds from work with boosting with abstention. There seems two main contributions in this work, using excellent theoretical derivations. However, their significance may be limited as the authors do not make any effort to connect them to the deep learning literature: 1) Negative result: In some multiclass setting using rejection, it is pointless to train a separate rejector. Solutions that converges towards the Bayes optimal solution requires the rejector to be a function of the Bayes-optimal scoring function, that is it should not be trained separately. 2) New Bound: The excess bound loss for CE (theorem 8) which clearly states that one can train with the cross-entropy loss (usual loss for Softmax DNNs) without taking the rejection threshold 'C' into account. This is a very nice result, which confirms theoretically what most people were already doing empirically. I am not sure is this is what the authors mean by "novel rejection criteria" in the abstract, but the main lessons I am taking from this paper is that they confirm me in what I was doing as a DNN practitioner. This paper would greatly gain in significance with better connection to the Deep Learning literature. 1) The scope of the negative result is unclear. The example given is over over losses (MPC/APC with binary exponential loss) I have never seen used and can only imagine in the boosting literature. There has been recent attempts to train separate rejectors in DNN settings. See for instance "Learning Confidence for Out-of-Distribution Detection in Neural Networks" (https://arxiv.org/pdf/1802.04865.pdf), where the authors spend enormous effort taming a very unstable training procedure over the loss in their Eq.(5) with hacks, in particular in determining a hyper parameter lambda. Could we apply the conditions expressed in Eq.(6) to their work? 2) The new bound that justifies training using Softmax and cross-entropy should be better publicized. Which leaves us facing the same mystery: why does SGD fail so badly to converge towards the Bayes-optimal, producing over-confident outputs (see https://arxiv.org/abs/1706.04599 "on calibration of modern neural network"). One key part of the paper which would greatly benefit from more intuitive explanations is Section 3. The results are first presented without justification and I could not understand the explanations given from line 144-158. While the paper is well written, the English could be improved and some explanation are sometimes unclear or confusing. A few detailed comments by line: 74 "It is well known": The Bayes-optimal rejector is not trivial, and some do not agree with it. It should be traced to Chow. The fact that the threshold should be (1-c) is not 'well known'. 128 typo "seperation" 145 Is the "objective function" W or dW/dr ? 171 MPS -> MPC 174 State that MPC and APC are the same with exp loss! Actually, this part is very hard to read: the distinction between MPC and APC does not add anything to the understanding of the paper and could be moved to the appendix. 227 Do you mean? This enables us to derive the same bound in a considerably simpleR way This enables us to derive a more general bound in a simple way 277 "can no more" ... "unlike" is a strange construction.

Reviewer 2

After Rebuttal: The clarification of the negative result is still not sufficient. The authors keep saying "difficult" what does that mean? If theorem 4 is only a condition for checking calibrated surrogates, it actually reduces in value. The authors should make a serious effort into making a mathematically coherent impossibility result and not just state this is "difficult". I also disagree that the excess risk derivations are complicated. Simple application of Pinsker's inequality will transform the excess log loss into a l1 norm distance, which can be easily converted to a excess abstain loss risk. Before Rebuttal: Summary: The authors consider the problem of multiclass classification with a reject option. Assuming the "cost of abstaining" is a constant, the authors analyze various surrogates and determine whether they are "calibrated" to the abstain loss. There are two main paradigms for building abstaining classifier : 1. confidence based: build a scoring model, and abstain if the scores are not "confident" 2. separation based: build a separate rejector and scorer, and use the scorer to classify all instances that have not been rejected by the rejector. The authors argue that the separation based methods cannot be calibrated, and showstandard confidence based methods can be made calibrated by use of an appropriate threshold for rejection. Review: The main contribution of the paper would be an attempt at showing the impossibility of "separation-based" calibrated surrogates. (Theorem 4). Theorem 4 looks correct and indicates some problems with a surrogate that has to be calibrated with the abstain loss, but there is no concrete impossibility statement. Lines 150 to 155, try to make this precise but it is not particularly clear. This should be made into a theorem or a corollary. I am guessing it should be something along the lines of "if the surrogate is convex in r, then it is not calibrated w.r.t the abstain loss". If this is what the authors mean, they should state and prove it instead of asking the readers to look at a statement regarding the derivatives and intuit it. The section on confidence based surrogates is standard, and cannot be considered an original contribution. It is well known that with a proper multiclass loss, the class probabilities can be estimated, and the form of the confidence predictor is exactly the same as the form of the Bayes classifier. There are some excess risk bounds derived in theorem 7 and Table 1, but these are not particularly original and such bounds can be derived using previously known techniques , (a la Steinwart, Bartlett et al.) The empirical results are not particularly impressive (or original, as the algorithm proposed is simply standard multinomial logistic regression). The APC,MPC methods have been argued to be sub-optimal in theory and are shown sub-optimal in practice, which is just a little bit satisfying.

Reviewer 3

This paper studies the problem of multiclass classification with rejection. The authors firstly survey the confidence-based approach and separation-based approach based on binary classification, and then extend these two kinds of approaches to multi-class cases. For separation-based approach, the authors defined a necessary condition for rejection calibration. For confidence-based approach, they discussed the one-versus-all loss and cross-entropy loss for multi-class cases. The error boundaries of related losses were also analyzed. Some experiments were also implemented. Although this is not the first paper to discuss the problem of mutli-class classification with rejection, the contribution of this paper is that they provide many theoretical analysis and experimental comparison for the related problem, including some theorems and some interesting experimental conclusions. There are many theorems and proofs in Appendix, I just checked two of them and they are all correct. The experiments appeared in the paper and Appendix are also convinced. I think this paper is useful for the research area of learning with rejection.

Paper ID:	1475
Title:	On the Calibration of Multiclass Classification with Rejection

Reviewer 1

Reviewer 2

Reviewer 3