Review for NeurIPS paper: Classification with Valid and Adaptive Coverage

NeurIPS 2020

Classification with Valid and Adaptive Coverage

Review 1

Summary and Contributions: The paper proposes a new conformity score for categorical / unordered response labels that is adaptive to the heterogeneity of difficulty of examples. The key insight of using generalized inverse quantile conformity measure E has the potential to be useful in settings beyond the classification problem discussed in the paper. UPDATE--------------------------------------------------------------------------------------- I thank the authors for their feedback. Both my questions were addressed quite satisfactorily.

Strengths: The novel conformity score Eq. (7) (via Eq. (3)) is an elegant solution to the problem of heterogeneity in classification setting. It has several advantages compared to the existing approaches, such as HCC (which may have quite an uneven coverage across the feature space) or CQC (which requires a three-way split and a quantile regression method on top of a score fitting method). The authors also make a good point that the key idea is applicable beyond the classification setting. For example, it is now easier to extend conformal prediction framework to the set-ups where Y may be multidimensional.

Weaknesses: At the time of this writing, I cannot think of any significant weaknesses that are inherent to the new conformity score.

Correctness: Yes, I took a brief pass at the proofs in the supplement, and they appear to contain no errors. (As the authors remark, the proofs themselves are quite standard in this literature.)

Clarity: The paper is well written and quite easy to follow. Prefacing the introduction of the new conformity score with the discussion of the oracle method in Section 1.1 is a nice touch.

Relation to Prior Work: The prior works are discussed adequately.

Reproducibility: Yes

Additional Feedback: Questions: - Could more details be given about the exact implementation of the competing methods, HCC, CQC and CQC-RF for the experiments in Sections 3 and 4? For example, CQC (and CQC-RF) requires a three-way split. Were any efforts made to tune the size of the splits? - Another coverage guarantee that may be of interest in the classification setting is label-conditional coverage guarantee (as in [24] Vovk et al. (2005), [19] Sadinle et al. (2019)). Can this be achieved with the proposed score?

Review 2

Summary and Contributions: This paper develops specialized versions of some recent developments in predictive inference, eg [1,4,18], for categorical and unordered response. More precisely, the author(s) focus on valid classification coverage using black-box classifier (assumed to treat all samples exchangeably). To this end, they use techniques from conformal prediction and introduce a new conformity score. Their (randomized) prediction sets are built mimicking the oracle prediction set using this new conformity score (7). This new conformity score is the cornerstone of their methodology and an interesting contribution. Although it appears to be a simple modification from hindsight, it is non-trivial from foresight because their "generalized quantile" based method is highly adaptive to the distributional heterogeneity, which is ubiquitous in real-world applications. Using ground-breaking results of [4,18] and a reduction argument (S6) in the supplement, the author(s) prove marginal coverage for split-conformal calibration, CV+ calibration and jackknife+ calibration. The papers presents fine experiments assessing the advantages of the proposed methods over existing alternatives.

Strengths: On the theoretical side, proving valid coverage is a difficult task especially when using CV+ calibration. Recent advances [4,18] might furnish some important tools to overcome this difficulty. This paper manages to reduce the classification problem to the above framework of the recent advances. This simple but not elementary step is the missing key to unveil powerful theoretical results (Theorem 1 and 2). On the empirical evaluation side, the paper gives a quite comprehensive study on the benefits of their methods.

Weaknesses: One potential weakness might be the positioning of this paper in regards of existing works. This paper might be seen as a specialized version of the papers [1,4,18] since most of the theoretical hardness and ideas are dealt by these references. But one might also argue that the conformity score and the reduction step (S6) are important key notion/step brought by this submission. Furthermore, their numerical experiments show that their proposed methods enjoy very nice statistical properties. (see also "relation to prior work" below)

Correctness: Claims and method are correct. On the methodology side, the author(s) use the notion of Worst (Conditional) Slice to evaluate the conditional coverage. This method is fine and well presented in the supplement.

Clarity: The paper is well written.

Relation to Prior Work: A clear discussion on previous works is given in the paper. In particular, a recent paper [2] offers similar guarantees applying quantile regression of hold-out scores of black-box classifiers. The experiments of the present paper tends to suggest that the author(s)'s methods might be superior. More fundamentally, the author(s) argue that their method is more adaptive to the distributional heterogeneity (end of Section 2.4), and I agree. [I agree with the authors' answer on this point]

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: * a new technique for obtaining calibrated probabilistic predictions for classification * theoretical justification, showing the method is near-optimal * experimental validation

Strengths: + neat idea! (use the smallest set of labels that contains exactly the required probability mass using a simple randomization trick) + theoretical guarantee of near optimality + excellent empirical performance + relatively well-written

Weaknesses: – will still require some more explaining to be accessible by the wider NeurIPS community – experiments are based on relatively simple and "boring" benchmark data sets (MNIST; CIFAR, ...), but on the other hand, at least they are nice and clean

Correctness: as far as I can tell

Clarity: mostly (but as I said, explaining the intuitions behind conformal prediction in more detail will help)

Relation to Prior Work: yes. I don't see a problem with the parallel work [2] not being carefully discussed since the [2] is so recent (1st version April 21, 2020 arXiv) that the authors haven't had a chance to study it properly

Reproducibility: Yes

Additional Feedback: ***I have read the author rebuttal. I have no further remarks based on it. I will keep my score unchanged.*** - is it really valid to say that the independence assumption (in i.i.d.) is unnecessary? that would sound like any identically distributed data are exchangeable, which isn't the case - the idea in the generalized inverse S is quite intuitive and can be understood with a few readings of the paragraph following its definition, but it's very hard to decipher while reading the definition itself. it'd be helpful to give the intuition first so that the reader can better understand the definition. - p. 2, l. 49 (about discarding zero probability labels): I don't see how zero probability elements can enter the set C^oracle in the first place - similarly to S, the intuitive definition of E, Eq. (7), would be helpful along the lines of "the smallest probability \tau s.t. $y$ is a member of the generalized inverse S" - C^sc, Eq. (8), is defined inside Algorithm 1. it took me a while to find it there. will be helpful to point this out to the reader in the text