Reviews: Calibration tests in multi-class classification: A unifying framework

Originality ------------ The present work is a novel unifying view of multiple calibration errors (ECE, MCE, MMCE). The authors relates well their work to the literature as they show how the proposed framework generalize the aforementioned methods. Through this framework, new estimators with interesting theoretical and practical properties are derived. The proposed estimator are all consistent and one of them can be computed in linear time.The estimators are also interpretable in the sense that they can be used for hypothesis testing. Quality --------- To the best of my knowledge the paper is technically sound, and all claims come with quality proofs. The authors backed error bounds by numerical studies. Yet the authors could have included experiments on real neural network (On Calibration of Modern Neural Networks, Guo et Al. 2017). Clarity -------- The paper is well written and organized. The supplementary material embed all the knowledge on Operator-Valued kernel required to understand the proofs. Significance --------------- Calibration of classification model is a very important topic for industries dealing with critical applications. Moreover the present work brings a new theoretical unifying view on calibration errors. I have read the author response and changed my score from 7 to 8.

Reviewer 2

***** Post-rebuttal update: after reading the authors' feedback, I confirm my positive evaluation of this paper. ***** I would like to thank the authors for their submission. Summary The paper presents a novel unified theoretical framework and new measures for the calibration properties of multi-class classifiers, which generalize commonly used ones. Estimators for the proposed measures, based on vector-valued RKHS, are then proposed. The statistical properties of such estimators are theoretically characterized (including proofs), and statistical tests associated to the estimators are presented. Finally, the properties of the proposed estimators are exhaustively validated in supporting simulated experiments. Originality The proposed ideas are novel in the context of calibrated multi-class classification. The proposed methods (e.g. the definition of KCE and the estimators) draw tools from matrix-valued kernel methods and kernel two-sample tests, which are appropriately referenced. To my knowledge, this area of research has a rather scattered coverage in the literature. The paper presents potentially high-impact novel contributions and provides a much-needed rigorous unifying view on the topic. To my knowledge, other relevant works in the area are correctly referenced and differences clearly stated. Quality I have found the paper to be of rigorous technical soundness. Definitions and statements are all clear and quantities properly introduced. Statements are remarkably supported by both theoretical statements, including full proofs, and exhaustive, well-designed experiments on synthetic data. Clarity The motivation, context, literature review, problem statement, theoretical claims and experiments are all delivered with excellent clarity and well-organized. The paper is smooth, polished and pleasant to read. Significance As stated in the "Contributions" section, I deem the overall significance of this work as high, from both the theoretical and practical perspectives. Developing rigorous tools for characterizing the quality of predictive models' confidence predictions is an important priority for the field, and I have little doubt about this paper representing an important step forward in this direction. Typos and minor comments: - Capitalize first letters in title and section headings - L15: patients.Since --> patients. Since - L17: increasing the training data set is... --> increasing the training data set's size is... - L22: Thus, - L24: uncertainty, this - L45: complementary - L46: miscalibrated, - L56: Recently, - L58: ..., 0.3) since --> ..., 0.3), since - L70: detail, - Eq. (1): is the conditioning on $max_y g_y (X)$ right, or could it just be on $g(X)$? - L104, L114, L121, ... : Thus, - L104, L117, L118, ... : eq. --> Eq. - L159: a general calibration measure of strong calibration --> a general measure of strong calibration - L180: setting, - L257, ...: fig. --> Fig. - L310: Consider citing as: Carmeli, C., De Vito, E., Toigo, A., & Umanitá, V. (2010). Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 8(01), 19-61.

Reviewer 3

Summary: In this paper the authors propose a unifying framework for testing multi-class classification calibration, i.e. if a classifier gives a probabilistic output, how close (or if, as a statistical test) are these outputs to the actual probabilities given the classifier (the actual definition is actually a bit more subtle, but this is the general idea). To do this they introduce a framework involving an integral probability metric. They propose using a "matrix kernel" RKHS as a nice way of controlling the function which they maximize in the probability metric. With this kernel setup they have a nice closed form expression for finding the calibration error and some finite and asymptotic bounds describing its behavior. Finally the authors perform experiments where they empirically evaluate the distribution of the test statistic, as well as its type I and II errors, and compare to a classic method. Overview: Overall the paper reads well and is fairly clear. The theory looks good and I see the kind of theorems and bounds I would expect to find in this kind of paper. The experiments seem good, although I think they could consider more competitor techniques (assuming they exist, I am not terribly familiar with this specific topic). Perhaps it would also be interesting to see how the test performs on a real world dataset; however this test statistic does not depend on the value of X, just on Y values and the prediction output distributions, so perhaps a real world dataset doesn't really add much to what the authors have done already. The supplementary material is quite extensive and contains quite a bit of good, mathematically correct, theory along with more experiments. Potential Issues: It is unclear to me why "matrix valued kernels" are necessary for this test. I haven't personally derived anything but it seems likely that one could do this sort of test using standard kernels. Perhaps the "matrix valued kernel" arises naturally from the obvious test statistic, but it would be nice to know why it is necessary for this concept to be introduced. I am unsure how significant this paper is. The fact that the paper only provides a statistical test is a bit concerning. Again I am not familiar with this particular problem, perhaps it is difficult and/or important enough that this paper is important. I found l.213-216 a bit too mysterious for my liking. Is it not possible for other estimators to estimate (3). Does this method allow (3) to become tractable, or are you simply observing that one should keep in mind that test statistics come from (3). Verdict: While I'm not entirely sure of the significance of the topic the paper is well written and the research seems high quality, so I recommend that it should be accepted. Small errors or potential improvements: l.6-8 This sentence is grammatically incorrect, or at least strange. Maybe something like "We present a new method based on matrix-valued kernels which offers consistent and unbiased..." l.15 Missing a space after the period l.18 "would be" should be "is" l.35 maybe this is nomenclature for the topic, but I'm not totally sure what "subjective" means here l.38-47 I found this paragraph confusing while reading. I think it could be improve by using a concrete mathematical expression or two. I didn't really understand the problem until I got to (1) and (2) l.52 "were" should be "have been" l.133 "matrix-valued" kernel should be \emph l.139-148 Its difficult for me to figure out what here is established theory and what is stuff that you have developed and have just left underived. For example is "universal kernel" totally analgous to the standard definition or is it a bit different? Also how does the Micchelli and Pontil 2005 relate to this? Did they actually introduce the matrix valued kernel or just a vector valued one which you use as the basis for the matrix one. l.161 its not clear to me why the CE has to be infinity if its not 0. If F only contains bounded functions that it seems to me that maximizing over F should remain bounded. I wonder if you are already adopting some of the intuition from the RKHS function class (e.g. assuming F is a vector space) l.219 "greater or equal than" should be "greater than or equal to" I find the P[...] notation a bit strange, I think it is definitely less standard than P(...) l.225 "weak" should be "loose"

Paper ID:	6628
Title:	Calibration tests in multi-class classification: A unifying framework

Reviewer 1

Reviewer 2

Reviewer 3