{"title": "Accurate Layerwise Interpretable Competence Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 13981, "page_last": 13991, "abstract": "Estimating machine learning performance \u201cin the wild\u201d is both an important and\nunsolved problem. In this paper, we seek to examine, understand, and predict the\npointwise competence of classification models. Our contributions are twofold:\nFirst, we establish a statistically rigorous definition of competence that generalizes\nthe common notion of classifier confidence; second, we present the ALICE\n(Accurate Layerwise Interpretable Competence Estimation) Score, a pointwise\ncompetence estimator for any classifier. By considering distributional, data, and\nmodel uncertainty, ALICE empirically shows accurate competence estimation in\ncommon failure situations such as class-imbalanced datasets, out-of-distribution\ndatasets, and poorly trained models.\n\nOur contributions allow us to accurately predict the competence of any classification model given any input and error function. We compare our score with state-of-the-art confidence estimators such as model confidence and Trust Score, and show significant improvements in competence prediction over these methods on datasets such as DIGITS, CIFAR10, and CIFAR100.", "full_text": "Accurate Layerwise Interpretable Competence\n\nEstimation\n\nVickram Rajendran, William LeVine\n\nThe Johns Hopkins University Applied Physics Laboratory\n\nLaurel, MD 20723\n\n{vickram.rajendran, william.levine}@jhuapl.edu\n\nAbstract\n\nEstimating machine learning performance \u201cin the wild\u201d is both an important and\nunsolved problem. In this paper, we seek to examine, understand, and predict the\npointwise competence of classi\ufb01cation models. Our contributions are twofold:\nFirst, we establish a statistically rigorous de\ufb01nition of competence that general-\nizes the common notion of classi\ufb01er con\ufb01dence; second, we present the ALICE\n(Accurate Layerwise Interpretable Competence Estimation) Score, a pointwise\ncompetence estimator for any classi\ufb01er. By considering distributional, data, and\nmodel uncertainty, ALICE empirically shows accurate competence estimation in\ncommon failure situations such as class-imbalanced datasets, out-of-distribution\ndatasets, and poorly trained models.\nOur contributions allow us to accurately predict the competence of any classi\ufb01cation\nmodel given any input and error function. We compare our score with state-of-\nthe-art con\ufb01dence estimators such as model con\ufb01dence and Trust Score, and show\nsigni\ufb01cant improvements in competence prediction over these methods on datasets\nsuch as DIGITS, CIFAR10, and CIFAR100.\n\n1\n\nIntroduction\n\nMachine learning algorithms have achieved tremendous success in areas such as classi\ufb01cation [12],\nobject detection [24], and segmentation [1]. However, as these algorithms become more prevalent\nin society it is essential to understand their limitations. In particular, a supervised machine learning\nmodel\u2019s performance on a reserved test point is characterized by the difference between that point\u2019s\nlabel and the model\u2019s prediction on that point. A model is considered performant on that point if\nthis difference is suf\ufb01ciently small; unfortunately, this difference is impossible to compute once the\nmodel is deployed since the point\u2019s true label is unknown.\nThis problem is exacerbated when we consider the difference between real world data and the curated\ndatasets that the models are evaluated on \u2014 often these datasets are signi\ufb01cantly different, and it is\nnot clear whether performance on a held aside test set is indicative of real-world performance. It is\nessential to have a predictive measure of performance that does not require ground truth in order to\ndetermine whether or not a machine learning algorithm\u2019s prediction should be trusted \"in the wild\"\n\u2014 a measure of model competence. However, competence is currently not de\ufb01ned in any rigorous\nmanner and is often restricted to the more speci\ufb01c idea of model con\ufb01dence.\nIn this paper, we de\ufb01ne competence to be a generalized form of predictive uncertainty, and so we must\naccount for all of its\u2019 generating facets. Predictive uncertainty arises from three factors: distributional,\ndata, and model uncertainty. Distributional uncertainty [4] arises from mismatched training and\ntest distributions (i.e. dataset shift [23]). Data uncertainty [4] is inherent in the complex nature\nof the data (e.g. input noise, class overlap, etc.). Finally, model uncertainty measures error in the\napproximation of the true model used to generate the data (e.g. over\ufb01tting, under\ufb01tting, etc.) [4] \u2014\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthis generally reduces as the amount of data increases. Accurate predictive uncertainty estimation (and\nthus accurate competence estimation) requires consideration of all three of these factors. Previous\nattempts to explicitly model these three factors require out-of-distribution data, or are not scalable to\nhigh dimensional datasets or deep networks [4] [18]; there are currently very few methods that do\nso in a way that requires no additional data, scales to high dimensional data and large models, and\napplies to any classi\ufb01cation model, regardless of architecture, dataset, or performance.\nWe focus on mitigating these issues in the space of classi\ufb01ers. In Section 2 we present several\nde\ufb01nitions, including a robust, generalizable de\ufb01nition of model competence that encompasses the\ncommon notion of model con\ufb01dence. In Section 3 we examine the related work in the areas of\npredictive uncertainty estimation and interpretable machine learning. In Section 4 we show a general\nmetric for evaluating competence estimators. In Section 5 we develop the \"ALICE Score,\" an accurate\nlayerwise interpretable competence estimator. In Section 6 we empirically evaluate the ALICE Score\nin situations involving different types of predictive uncertainty and on various models and datasets.\nWe conclude in Section 7 with implications and ideas for future work.\n\n2 De\ufb01nitions\nDe\ufb01nition 1. (Error Function) Let C be the \ufb01nite \"label space\" of possible labels of the true model\nf used to generate data, and let Y be the associated unit simplex of class probabilities, which we\ncall the \"distributional space\". Let \u02c6Y \u2286 Y be the space of possible outputs of a classi\ufb01er \u02c6f that\napproximates f. We will denote the classes in C predicted by these models (usually through an\nargmax) as \u02c6fc and fc. An error function E is a function E : Y \u00d7 \u02c6Y \u2192 R\u22650 \u222a {+\u221e}, with the\nproperty that E(y, \u02c6y) = \u221e when y \u2208 \u02c6Y \u2229 Y. This property intuitively means that the output of the\nerror function is in\ufb01nite if the true class is outside of the classi\ufb01er\u2019s prediction space. Given a point\nx, we denote E(f (x), \u02c6f (x)) the error of \u02c6f on x. Common examples of error functions are mean\nsquared error, cross-entropy error, and 0-1 error (the indicator that the classes predicted by \u02c6f and f\nare different). Note that an error function is distinct from a loss function since it is neither required to\nbe differentiable nor continuous.\nDe\ufb01nition 2. (Con\ufb01dence) The commonly accepted de\ufb01nition of classi\ufb01er con\ufb01dence [19] [2] [18]\n[3] is the probability that the model\u2019s predicted class on an input x is the true class of x. Explicitly,\nthis is p(fc(x) = \u02c6fc(x)|x, \u02c6f ). This is also the inverse of the predictive uncertainty [3] of a classi\ufb01er,\nwhich is the probability that the model\u2019s prediction is incorrect [18].\n\nWhile con\ufb01dence is suf\ufb01cient in many cases, we would like to have a more general and \ufb02exible\nde\ufb01nition that can be tuned towards a speci\ufb01c user\u2019s goals. For example, some users may be\ninterested in top-k error, cross-entropy or mean squared error instead of 0-1 error. We can model this\nby rewriting the con\ufb01dence de\ufb01nition with respect to an error function E:\n\np(fc(x) = \u02c6fc(x)|x, \u02c6f ) = p(E(f (x), \u02c6f (x)) = 0|x, \u02c6f )\n\nwhere E is the 0-1 error. We can now extend E beyond E0\u22121 to \ufb01t an end-user\u2019s goals. We can make\nthis de\ufb01nition even more general by borrowing ideas from the Probable Approximately Correct (PAC)\nLearning framework [30] and allowing users to specify an error tolerance \u03b4. For example, some users\nmay allow for their prediction error to be below a speci\ufb01c \u03b4 for their model to be considered competent.\nOne could imagine that for highly precise problems with low threshold for error, \u03b4 would be quite\nlow, while less stringent use-cases could allow for larger \u03b4\u2019s. The relaxation of the prediction error\nleads to the generalized notion of \u03b4-competence, which we de\ufb01ne as p(E(f (x), \u02c6f (x)) < \u03b4|x, \u02c6f ).\nCon\ufb01dence can be recovered by setting E = E0-1 and \u03b4 \u2208 (0, 1).\nAllowing both \u03b4 and E to vary gives \ufb01ne control to an end-user about the details of a model\u2019s\nperformance with respect to a speci\ufb01c error function.\nDe\ufb01nition 3. (\u03b4-\u0001 Competence) The true \u03b4-competence of a model at a given point is the binary\nvariable E(f (x), \u02c6f (x)) < \u03b4|x, f, \u02c6f ) where E is an error function (De\ufb01nition 1). Note that E becomes\na random variable when f is unknown since E is a deterministic function of the uncertain variable\nf (x) \u2014 this notion of randomness is slightly distinct from treating \u02c6f as a random variable due to \ufb01nite\ndata. Given that f is unknown, we must estimate the \u03b4-competence, which can now be written as\np(E(f (x), \u02c6f (x)) < \u03b4|x, \u02c6f ). Putting a risk threshold \u0001 on the value of the \u03b4-competence leads us to the\n\n2\n\n\ffollowing notion: A model is \u03b4-\u0001 competent with respect to E at x if p(E(f (x), \u02c6f (x)) < \u03b4|x, \u02c6f ) > \u0001,\nor it is likely to be approximately correct.\n\nThis de\ufb01nition of competence allows a user to set a correctness threshold (\u03b4) on how close the\nprediction and the true output need to be in order to be considered approximately correct, as well as\nset a risk threshold (\u0001) on the probability that this prediction is approximately correct with respect to\nany error function. These thresholds and error functions allow for a \ufb02exible de\ufb01nition of competence\nthat can be adjusted depending on the application. This also follows the de\ufb01nition of trust in [14]\nas \"the attitude that an agent will help achieve an individual\u2019s goals in a situation characterized by\nuncertainty and vulnerability.\"\nSince we neither have access to labels nor have enough information to ef\ufb01ciently compute the true\nprobability distribution p(E(f (x), \u02c6f (x)) < \u03b4|x, \u02c6f ) we seek to estimate this probability. We make\nthis clear with the following de\ufb01nition:\nDe\ufb01nition 4. (Competence Estimator) A competence estimator of a model \u02c6f with respect to\nthe error function E is a function g \u02c6f : X \u00d7 R \u2192 [0, 1], where X is the space of inputs, that is a\nstatistical point estimator of the true variable E(f (x), \u02c6f (x)) < \u03b4|x, \u02c6f , f. In particular, g \u02c6f (x, \u03b4) =\n\u02c6p(E(f (x), \u02c6f (x)) < \u03b4|x, \u02c6f ).\n\nIn the future we omit conditioning on \u02c6f in our notation with the note that all subsequent probabilities\nare conditioned on \u02c6f.\n\n3 Related Work\n\nCompetence estimation is closely tied with the well-studied areas of predictive uncertainty and\ncon\ufb01dence estimation, which can further be divided into Bayesian approaches such as [9] [7] [17], or\nnon-Bayesian approaches including [5], [22], [13]. Bayesian methods attempt to determine some\ndistribution about each of the weights in a network and predict a distribution of outputs using this\ndistribution of weights. Computing the uncertainty of a prediction then becomes computing statistics\nabout the estimated output distribution. These estimates tend to perform well, but tend not to be\nscalable to high dimensional datasets or larger networks. The non-bayesian methods traditionally fall\nunder ensemble approaches [13], training on out-of-distribution data [18] [22] [29], or dropout [5].\nThis \ufb01eld tends to only work on a certain subset of classi\ufb01ers (such as models with dropout for [5])\nor require modi\ufb01cations to the models in order to compute uncertainty [19]. Many of these methods\nare based off of the unmodi\ufb01ed model con\ufb01dence [5], and thus could be supplementary to our new\ncompetence score. To the best of our knowledge there are no existing Bayesian or non-Bayesian\nmethods that consider competence with respect to error functions other than 0-1 error nor methods\nthat have tunable tolerance parameters.\nAnother related area of research is interpretable machine learning. Methods such as prototype\nnetworks [28] or LIME [25] are very useful in explaining why a classi\ufb01er is making a prediction, and\nwe expect these methods to augment our work. However, competence prediction does not attempt to\nexplain the predictions of a classi\ufb01er in any way\u2014we simply seek to determine whether or not the\nclassi\ufb01er is competent on a point, without worrying about why or how the model made that decision.\nIn this sense we are more closely aligned with calibration [6], which adjusts prediction scores to\nmatch class conditional probabilities which are interpretable scores [29] [31] and works such as [26]\nare orthogonal to ours. While our goal is not to compute class probabilites, our method similarly\nprovides an interpretable probability score that the model is competent.\nThe closest estimators to our own are [2] and [8]. [2] learns a meta model that ensembles transfer\nclassi\ufb01ers\u2019 predictions to predict whether or not the overall network has a correct classi\ufb01cation.\nConversely, [8] computes the ratio of the distance to the predicted class and the second highest\npredicted class as a Trust Score. While [2] takes into account data uncertainty with transfer classi\ufb01ers,\nit does not explicitly take into account distributional or model uncertainty. Oppositely, [8] considers\nneither model nor data uncertainty explicitly, though it does model distributional uncertainty similarly\nto [13], [15], and [16]. Further, both merely rank examples according to uncertainty measures that\nare not human-interpretable. They also focus on con\ufb01dence rather than competence, which does not\nallow them to generalize to either more nuanced error functions or varying margins of error.\n\n3\n\n\fTo the best of our knowledge, the ALICE Score is the \ufb01rst competence estimator that is scalable to\nlarge models and datasets and is generalizable to all classi\ufb01ers, error functions, and performance\nlevels. Our method takes into account all three aspects of predictive uncertainty in order to accurately\npredict competence on all of the models and datasets that it has encountered, regardless of the stage\nof training. Further, it does not require any out-of-distribution data to train on and can easily be\ninterpreted as a probability of model competence. It also provides tunable parameters of \u03b4, \u0001, and E\nallowing for a more \ufb02exible version of competence that can \ufb01t a variety of users\u2019 needs.\n\n4 Evaluating Competence Estimators\n4.1 Binary \u03b4 \u2212 \u0001 Competence Classi\ufb01cation\n\nWe consider the task of pointwise binary competence classi\ufb01cation. Given f (x) and \u02c6f (x), we can\ndirectly calculate E(f (x), \u02c6f (x)) and thus the model\u2019s true \u03b4 competence on x. Given a competence\nestimator, we can then predict if the model is \u03b4 competent on x, thus creating a binary classi\ufb01cation\ntask parametrized by \u0001. This allows us to use standard binary classi\ufb01cation metrics such as Average\nPrecision (AP) across all recall values to evaluate the competence estimator.\nWe note that the true model competence is nondecreasing as \u03b4 increases since we are strictly increasing\nthe support. In particular, we have that the model is truly incompetent with respect to E on all points\nwhen \u03b4 = 0, and the model is truly competent with respect to E on all points as \u03b4 \u2192 \u221e as long as E\nis bounded above. This makes it dif\ufb01cult to pick a single \u03b4 that is representative of the performance\nof the competence estimator on a range of \u03b4\u2019s. To mitigate this issue we report mean AP over a range\nof \u03b4\u2019s, as this averages the estimator\u2019s precision across these error tolerances.\nNote that this metric only evaluates how well each estimator orders the test points based on com-\npetence, and does not consider the actual value of the score. We test this since some competence\nestimators (e.g. TrustScore) only seek to rank points based on competence and do not care what\nthe magnitude of the \ufb01nal score is. As a technical detail, this means that we cannot parametrize the\ncomputation of Average Precision by \u0001 (since some estimators don\u2019t output scores in the range [0, 1]),\nand must instead parametrize each estimator\u2019s AP computation separately by thresholding on that\nestimator\u2019s output.\n\n5 The ALICE Score: \u03b4 \u2212 \u0001 competence estimation\n\nWe would like to determine whether or not the model is competent on a point without knowledge of\nground truth, as in a test-set scenario where the user does not have access to the labels of a data point.\nFormally, given a \u03b4 and an input x, we want to estimate p(E(f (x), \u02c6f (x)) < \u03b4|x).\nWe write p(E(f (x), \u02c6f (x)) < \u03b4|x) as p(E < \u03b4|x), where E is the random variable that denotes the\nvalue of the E function given a point x and its label f (x). We begin by marginalizing over the possible\nlabel values f (x) = cj \u2208 Y (where cj is the one-hot label for class j):\n\ncj\u2208Y\n\n(cid:88)\n(cid:88)\n(cid:88)\n\np(E < \u03b4|x) =\n\n=\n\n=\n\np(E < \u03b4|cj, x)p(cj|x)\n\np(E < \u03b4|cj, x)p(cj|x) +\n\ncj\u2208 \u02c6Y\u2229Y\n\np(E < \u03b4|cj, x)p(cj|x)\n\n(cid:88)\n\ncj\u2208 \u02c6Y\u2229Y\n\np(E < \u03b4|cj, x)p(cj|x)\n\n(1)\n\n(2)\n\n(3)\n\ncj\u2208 \u02c6Y\n\nNote that the E(cj, \u02c6f (x)) was de\ufb01ned to be \u221e when cj \u2208 \u02c6Y \u2229 Y (De\ufb01nition 1), thus the rightmost\nsummation in Equation 2 is 0 for all \u03b4. Furthermore, since \u02c6Y \u2286 Y (De\ufb01nition 1) we have \u02c6Y \u2229 Y = \u02c6Y\nwhich gives the \ufb01nal equality. To explicitly capture distributional uncertainty, we now marginalize\nover the variable D, which we de\ufb01ne as the event that x is in-distribution:\n\n4\n\n\fp(E < \u03b4|x) =\n\n=\n\n(cid:88)\n(cid:88)\n\ncj\u2208 \u02c6Y\n\ncj\u2208 \u02c6Y\n\np(E < \u03b4|cj, x)p(cj|x)\n\np(E < \u03b4|cj, x, D)p(cj|x, D)p(D|x) +\n\n(cid:88)\n\ncj\u2208 \u02c6Y\n\np(E < \u03b4|cj, x, D)p(cj|x, D)p(D|x)\n\n(4)\n\nConsider the rightmost summation in Equation 4. This represents the probability that the model is\ncompetent on the point x assuming that x is out-of-distribution. However, this term is intractable to\napproximate due to distributional uncertainty. Given only in-distribution training data, we assume that\nwe cannot know whether the model will be competent on out-of-distribution test points. To mitigate\nthis concern we lower bound the estimation by setting this term to 0 \u2014 this introduces the inductive\nbias that the model is not competent on points that are out-of-distribution. This simpli\ufb01cation yields:\n\n(cid:88)\n\np(E < \u03b4|x) \u2265 p(D|x)\n\np(E < \u03b4|cj, x)p(cj|x, D)\n\n(5)\n\ncj\u2208 \u02c6Y\n\nThis allows our estimate to err on the side of caution as we would rather predict that the model is\nincompetent even if it is truly competent compared to the opposite situation. We approximate each of\nthe terms in Equation 5 in turn.\n\n5.1 Approximating p(D|x)\n\nThis term computes the probability that a point x is in-distribution. We follow a method derived\nfrom the state-of-the-art anomaly detector [16] to compute this term: For each class j we \ufb01t a\nclass-conditional Gaussian Gj to the set {x \u2208 Xtrain : \u02c6f (x) = cj} where Xtrain is the training\ndata. Given a test point x we then compute the Mahalanobis distance dj between x and Gj. In\norder to turn this distance into a probability, we consider the empirical distribution \u03b2j of possible\nin-distribution distances by computing the distance of each training point to the Gaussian Gj, and\nthen computing the survival function. We take the maximum value of the survival function across all\nj. This intuitively models the probability that the point is in-distribution with respect to any class.\nExplicitly, we have p(D|x) = maxj 1 \u2212 CDF\u03b2j (dj). Note that this term measures distribution shift,\nwhich closely aligns with distributional uncertainty.\n\n5.2 Approximating p(E < \u03b4|x, cj)\n\nThis term computes the probability that the error at the point x is less than \u03b4 given that the one-hot\nlabel is cj. We directly compute E(cj, \u02c6f (x)), then simply check whether or not this error is less than\n\u03b4. Note that this value is always 1 or 0 since it is the indicator 1[E(cj, \u02c6f (x)) < \u03b4], and that this term\nestimates the difference between the predictions of f and \u02c6f, which aligns with model uncertainty.\n\n5.3 Approximating p(cj|x, D)\n\nThis term computes the probability that a point x is of class j, given that it is in-distribution. To\nestimate this class probability, we \ufb01t a transfer classi\ufb01er at the given layer and use its class-probability\noutput, \u02c6p(cj|x, D). Since the test points are assumed to be in-distribution, we can trust the output\nof the classi\ufb01er as long as it is calibrated \u2014 that is, for all x with p(cj|x) = p, p of them belong\nto class j. [21] examines the calibration of various classi\ufb01ers, and shows that Logistic Regression\n(LR) Classi\ufb01ers are well calibrated. Random Forests and Bagged Decision Trees are also calibrated\n[21], however, we \ufb01nd that the choice of calibrated classi\ufb01er has little effect on the accuracy of our\ncompetence estimator. Note that \u2014 with a perfectly calibrated classi\ufb01er \u2014 this term estimates the\nuncertainty inherent in the data (e.g. a red/blue classi\ufb01er will always be uncertain on purple inputs\ndue to class overlap), which closely aligns with data uncertainty.\n\n5\n\n\f5.4 The ALICE Score\n\nPutting all of these approximations together yields the ALICE Score:\n\n(cid:88)\n\ncj\u2208 \u02c6Y\n\np(E(f (x), \u02c6f (x)) < \u03b4|x) (cid:39) max\n\nj\n\n(1 \u2212 CDF\u03b2j (dj))\n\n1[E( \u02c6f (x), cj) < \u03b4]\u02c6p(cj|x, D)\n\n(6)\n\nNote that the ALICE Score can be written at layer l of a neural network by treating x as the activation\nof layer l in a network and using those activations for the transfer classi\ufb01ers and the class conditional\nGaussians.\nWe do not claim that the individual components of the ALICE Score are optimal nor that our estimator\nis optimal \u2014 we merely wish to demonstrate that the ALICE framework of expressing competence\nestimation according to Equation 6 is empirically effective.\n\n6 Experiments and Results\n\n6.1 Experimental Setup\n\nWe conduct a variety of experiments to empirically evaluate ALICE as a competence estimator\nfor classi\ufb01cation tasks. We vary the model, training times, dataset, and error function to show the\nrobustness of the ALICE Score to different variables. We compute metrics for competence prediction\nby simply using the score as a ranking and thresholding by recall values to compare with other scores\nthat are neither \u0001-aware nor calibrated, as discussed in Section 4. The mean Average Precision is\ncomputed across 100 \u03b4\u2019s linearly spaced between the minimum and maximum of the E output (e.g.\nfor cross-entropy we space \u03b4\u2019s between the minimum and the maximum cross-entropy error on a\nvalidation set). For all experiments, we compute ALICE scores on the penultimate layer, as we\nempirically found this layer to provide the best results \u2014 we believe this is due to the penultimate\nlayer having the most well-formed representations before the \ufb01nal predictions. We compare our\nmethod only with Trust Score and model con\ufb01dence (usually the softmax score) since they apply to\nall models and do not require extraneous data. Further experimental details are provided in Appendix\nA.\n\n6.2 Predictive Uncertainty Experiments\n\nSince competence is a generalized form of con\ufb01dence, and con\ufb01dence amalgamates all forms of\npredictive uncertainty, competence estimators must account for these factors as well. We empirically\nshow that ALICE can accurately predict competence when encountering all three types of predictive\nuncertainty \u2014 note that we do not claim that the ALICE framework perfectly disentangles these three\nfacets, merely that each term is essential to account for all forms of predictive uncertainty.\nWe \ufb01rst examine model uncertainty by performing an ablation study on both over\ufb01t and under\ufb01t\nclassical models on DIGITS and VGG16 [27] on CIFAR100 [11]. Details about these models are\nin Appendix A. As expected, ALICE strongly outperforms the other metrics in areas of over and\nunder\ufb01tting and weakly outperforms in regions where the network is trained well (Table 1). Further,\nwe highlight a speci\ufb01c form of model uncertainty in Figure 1 by performing the same ablation study\non the common situation of class-imbalanced datasets. We remove 95% of the training data for the\n\ufb01nal 5 classes of CIFAR10 so that the model is poorly matched to these low-count classes, thus\nintroducing model uncertainty. Figure 1 shows the mean Average Precision (mAP) of competence\nprediction on the unmodi\ufb01ed CIFAR10 test set after fully training VGG16 on the class-imbalanced\nCIFAR10 dataset. While all of the metrics perform similarly on the classes of high count, neither\nsoftmax (orange) nor trust score (green) were able to accurately predict competence on the low count\nclasses. ALICE (blue), on the other hand, correctly identi\ufb01es competence on all classes because\nALICE considers model uncertainty. We additionally show that omitting the term p(E < \u03b4|x, cj)\nremoves this capability, thus empirically showing that this term is necessary to perform accurate\ncompetence estimation under situations of model uncertainty.\nWhile Figure 1 and Table 1 show ALICE\u2019s performance under situations of high model uncertainty,\nwe show ALICE\u2019s performance under situations of distributional uncertainty in Table 2. First we\n\n6\n\n\f(a) mAP of competence scores (E = cross-entropy)\n\n(b) mAP of competence scores (E = 0-1 error)\n\nFigure 1: Competence Scores on Class Imbalanced CIFAR10\n\nTable 1: mAP for Competence Prediction Under Model Uncertainty (E = cross-entropy). VGG16 is tested on\nCIFAR100 while the other models are on DIGITS. (U) is under\ufb01t, (W) is well trained, and (O) is over\ufb01t. Ablated\nALICE refers to ALICE without the p(E < \u03b4|x, cj) terms. Hyperparameters for these trials are in Appendix A.\n\nModel\nMLP (U)\nMLP (W)\nMLP (O)\nRF (U)\nRF (W)\nSVM (U)\nSVM (W)\nSVM (O)\nVGG16 (U)\nVGG16 (W)\nVGG16 (O)\n\nAccuracy\n.121 \u00b1 .048\n.898 \u00b1.022\n.097 \u00b1.015\n.563 \u00b1 .078\n.930 \u00b1.019\n.630 \u00b1.018\n.984 \u00b1.009\n.258 \u00b1 .023\n.0878 \u00b1 .0076\n.498 \u00b1 .012\n.282 \u00b1 .15\n\nSoftmax\n\n.0486 \u00b1 .015\n.989 \u00b1.005\n.532 \u00b1.062\n.824 \u00b1 .16\n.998 \u00b1.002\n.995 \u00b1.003\n1.00 \u00b1.000\n.200 \u00b1 .16\n.899 \u00b1 .014\n.975 \u00b1 .013\n.659 \u00b1 .024\n\nTrustScore\n.505 \u00b1 .27\n.929 \u00b1.044\n.768 \u00b1.064\n.504 \u00b1 .33\n.898 \u00b1.025\n.626 \u00b1.046\n.931 \u00b1.048\n.215 \u00b1 .12\n.292 \u00b1 .049\n.604 \u00b1 .104\n.665 \u00b1 .0080\n\nAblated ALICE\n\nALICE\n\n.0538 \u00b1 .031\n.958 \u00b1.042\n.576 \u00b1.033\n.290 \u00b1 .322\n.923 \u00b1.016\n.496 \u00b1.069\n.963 \u00b1.038\n.252 \u00b1 .16\n\n.0369 \u00b1 .0041\n.0863 \u00b1 .0071\n.257 \u00b1 .018\n\n.999 \u00b1 .0015\n.998 \u00b1.001\n.996 \u00b1.003\n.999 \u00b1 .0011\n.999 \u00b1.000\n1.00 \u00b1.000\n1.00 \u00b1.000\n.981 \u00b1 .028\n.913 \u00b1 .012\n.978 \u00b1 .0082\n.738 \u00b1 .019\n\nde\ufb01ne a distributional competence error function:\n\nED(f (x), \u02c6f (x)) =\n\n(cid:40)\n\n0\n1\n\nf (x) \u2208 \u02c6Y\nf (x) /\u2208 \u02c6Y\n\nThis function is simply an indicator as to whether or not the true label of a point is in the predicted\nlabel space. We fully train ResNet32 on the unmodi\ufb01ed CIFAR10 training set. We then compute\ncompetence scores with respect to ED on a test set with varying proportions of SVHN [20] (out-of-\ndistribution) and CIFAR10 (in-distribution) data. In this case Y = YCIFAR \u222a YSVHN but \u02c6Y = YCIFAR,\nthus ED is 1 on SVHN points and 0 on CIFAR points. Table 2 shows that both softmax and\nALICE without the p(D|x) term perform poorly on distributional competence. In contrast, both\nthe full ALICE score and Trust Score are able to estimate distributional competence in all levels of\ndistributional uncertainty \u2014 this is expected since ALICE contains methods derived from a state-\nof-the-art anomaly detector [16] and Trust Score considers distance to the training data. Note that\nthis construction of the distributional competence function is a clear example of how the general\nnotion of competence can vary tremendously depending on the task at hand, and ALICE is capable of\npredicting accurate competence estimation for any of these notions of competence.\n\nTable 2: mAP for Competence Prediction Under Distributional Uncertainty (E = ED).\n\nCIFAR/SVHN Proportion\n\nSoftmax\n\n10/90\n30/70\n50/50\n70/30\n90/10\n\n.458 \u00b10.056\n.693 \u00b10.034\n.816 \u00b10.020\n.901 \u00b10.010\n.970 \u00b10.003\n\nTrustScore\n.518 \u00b10.039\n.721 \u00b10.026\n.833 \u00b10.015\n.910 \u00b10.008\n.972 \u00b10.002\n\nAblated ALICE\n\nALICE\n\n.100 \u00b10.000\n.300 \u00b10.000\n.500 \u00b10.000\n.700 \u00b10.000\n.900 \u00b10.000\n\n.868 \u00b10.014\n.946 \u00b10.007\n.970 \u00b10.003\n.985 \u00b10.002\n.997 \u00b10.001\n\n7\n\n\fFigure 2: Competence Visualization on CIFAR10 (\u03b4 = .001,E = cross-entropy). Points are projected to two\ndimensions with Neighborhood Component Analysis. From left to right, \ufb01gures are colored by the class label,\nALICE Score, Ablated ALICE Score, and inverse error (so darker colors imply competence).\n\nWe examine ALICE\u2019s capturing of data uncertainty by observing competence predictions in areas\nof class overlap in Figure 2. Here we trained VGG16 on CIFAR10 [10] and visualized competence\nscores with respect to cross-entropy. Note that the competence scores are very low in areas of class\noverlap, and that these regions also match with areas of high error. Additional experiments with\nvarying models, error functions, and levels of uncertainty are provided in Appendix B.\n\n6.3 Calibration Experiments\n\nWhile the previous experiments show the ability of ALICE to rank points according to competence,\nwe now show the interpretability of the ALICE score through calibration curves. Note that we are\nnot attempting to interpet or explain why the model has made the decision that it has, we simply aim\nto show that the ALICE score matches its semantic meaning: for all points with ALICE score of\np, we expect p of them to be truly competent. To show this, we train ResNet32 on CIFAR100 and\ncompute ALICE scores at various stages of training and for different error functions (we use \u03b4 =\n0.2 when computing competence for Exent. We bin the ALICE scores into tenths ([0.0 - 0.1), [0.1\n- 0.2), ..., [0.9, 1.0)) and plot the true proportion of competent points for each bin as a histogram.\nNote that a perfect competence estimation with in\ufb01nite data would result in these histograms roughly\nresembling a y = x curve. We visualize the difference between our competence estimator and perfect\ncompetence estimation by showing these residuals as well as the number of points in each bin in\nFigure 3. Note that ALICE is relatively well-calibrated at all stages of training and for all error\nfunctions tested \u2014 this result shows that one can interpret the ALICE score as an automatically\ncalibrated probability that the model is competent on a particular point. This shows that not only does\nthe ALICE Score rank points accurately according to their competence but it also rightfully assigns\nthe correct probability values for various error functions and at all stages of training.\n\n(a) E0-1 (1)\n\n(b) E0-1 (5)\n\n(c) E0-1 (50)\n\n(d) Exent (1)\n\n(e) Exent (5)\n\n(f) Exent (50)\n\nFigure 3: ALICE score calibration of ResNet32 trained on CIFAR10, with various error functions and stages of\ntraining. The captions show the error functions and number of epochs trained.\n\n7 Conclusions and Future Work\n\nIn this work we present a new, \ufb02exible de\ufb01nition of competence. Our de\ufb01nition naturally generalizes\nthe notion of con\ufb01dence by allowing a variety of error functions as well as risk and correctness\nthresholds in order to construct a de\ufb01nition that is tunable to an end-user\u2019s needs. We also develop\nthe ALICE Score, an accurate layerwise interpretable competence estimator for classi\ufb01ers. The\n\n8\n\n0.000.250.500.751.00ALICE Scores0.00.20.40.60.81.0True Proportion of Competent PointsResidualsALICE0.00.20.40.60.81.0ALICE ScoresResidualsALICE0.00.20.40.60.81.0ALICE ScoresResidualsALICE0.00.20.40.60.81.0ALICE ScoresResidualsALICE0.00.20.40.60.81.0ALICE ScoresResidualsALICE0.00.51.0ALICE ScoresResidualsALICE0200040006000800010000Number of Points per bin\fALICE Score is not only applicable to any classi\ufb01er but also outperforms the state-of-the-art in\ncompetence prediction. Further, we show that the ALICE Score is robust to out-of-distribution data,\nclass imbalance and poorly trained models due to our considerations of all three facets of predictive\nuncertainty.\nThe implications of an accurate competence estimator are far reaching. For instance, future work\ncould include using the ALICE Score to inform an Active Learning acquisition function by labeling\npoints that a model is least competent on. One could also examine a network more closely by\nperforming feature visualization or \ufb01nding prototypes in areas of low competence, as this would\nelucidate which features are correlated with incompetence. This is particularly useful since the\nALICE Score can be computed layerwise in order to \ufb01nd both low and high level features that the\nmodel is not competent on. Competence estimators could also be used as test and evaluation metrics\nwhen a model is deployed to detect both distributional shift and classi\ufb01cation failure.\nFuture work will focus on extending the ALICE Score to supervised tasks other than classi\ufb01cation\nsuch as object detection, segmentation, and regression. Additionally, because many of the components\nof the ALICE Score are state-of-the-art for detecting adversarial examples, we expect that the ALICE\nScore would also be able to detect adversarial samples and assign them low competence, though we\nhave not tested this explicitly. Further research will also include better approximations of the terms in\nthe ALICE Score to improve competence estimation. Finally, we plan to explore different methods to\nensemble the layerwise ALICE Scores into an overall ALICE Score for the model and determine\nwhether or not that improves performance compared to the layerwise ALICE Scores.\n\n9\n\n\fAcknowledgements\n\nThe authors would like to thank the JHU/APL Internal Research and Development (IRAD) program\nfor funding this research.\n\nReferences\n[1] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous\n\nconvolution for semantic image segmentation. CoRR, abs/1706.05587, 2017.\n\n[2] Tongfei Chen, Ji\u02c7r\u00ed Navr\u00e1til, Vijay Iyengar, and Karthikeyan Shanmugam. Con\ufb01dence scoring\nusing whitebox meta-models with linear classi\ufb01er probes. arXiv preprint arXiv:1805.05396,\n2018.\n\n[3] Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. Evaluation of neural network uncertainty\nestimation with application to resource-constrained platforms. CoRR, abs/1903.02050, 2019.\n\n[4] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.\n\n[5] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Insights and\n\napplications. In Deep Learning Workshop, ICML, volume 1, page 2, 2015.\n\n[6] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern\nneural networks. In Proceedings of the 34th International Conference on Machine Learning,\nICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1321\u20131330, 2017. URL http:\n//proceedings.mlr.press/v70/guo17a.html.\n\n[7] Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing\nIn Proceedings of the Sixth Annual Conference on\nthe description length of the weights.\nComputational Learning Theory, COLT \u201993, pages 5\u201313, New York, NY, USA, 1993. ACM.\nISBN 0-89791-611-5. doi: 10.1145/168304.168306. URL http://doi.acm.org/10.1145/\n168304.168306.\n\n[8] Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. To trust or not to trust a classi\ufb01er.\n\nIn Advances in Neural Information Processing Systems, pages 5541\u20135552, 2018.\n\n[9] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for\n\ncomputer vision? CoRR, abs/1703.04977, 2017.\n\n[10] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced\n\nresearch). . URL http://www.cs.toronto.edu/~kriz/cifar.html.\n\n[11] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced\n\nresearch). . URL http://www.cs.toronto.edu/~kriz/cifar.html.\n\n[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.\n\nImagenet classi\ufb01cation with\ndeep convolutional neural networks.\nIn F. Pereira, C. J. C. Burges, L. Bottou, and\nK. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages\n1097\u20131105. Curran Associates,\nInc., 2012. URL http://papers.nips.cc/paper/\n4824-imagenet-classification-with-deep-convolutional-neural-networks.\npdf.\n\n[13] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6402\u20136413, 2017.\n\n[14] John D. Lee and Katrina A. See. Trust in automation: Designing for appropriate reliance.\nHuman Factors, 46(1):50\u201380, 2004. doi: 10.1518/hfes.46.1.50\\_30392. URL https://doi.\norg/10.1518/hfes.46.1.50_30392. PMID: 15151155.\n\n[15] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training con\ufb01dence-calibrated classi\ufb01ers\n\nfor detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017.\n\n10\n\n\f[16] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple uni\ufb01ed framework for\ndetecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information\nProcessing Systems, pages 7167\u20137177, 2018.\n\n[17] David J. C. MacKay. A practical bayesian framework for backpropagation networks. Neural\nComput., 4(3):448\u2013472, May 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.3.448. URL\nhttp://dx.doi.org/10.1162/neco.1992.4.3.448.\n\n[18] Andrey Malinin and Mark Gales.\n\nPredictive uncertainty estimation via prior net-\nworks.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages\n7047\u20137058. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/\n7936-predictive-uncertainty-estimation-via-prior-networks.pdf.\n\n[19] Amit Mandelbaum and Daphna Weinshall. Distance-based con\ufb01dence score for neural network\n\nclassi\ufb01ers. arXiv preprint arXiv:1709.09844, 2017.\n\n[20] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading digits in natural images with unsupervised feature learning. 2011.\n\n[21] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised\nlearning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML\n\u201905, pages 625\u2013632, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/\n1102351.1102430. URL http://doi.acm.org/10.1145/1102351.1102430.\n\n[22] Philipp Oberdiek, Matthias Rottmann, and Hanno Gottschalk. Classi\ufb01cation uncertainty of\n\ndeep neural networks based on gradient information. In ANNPR, 2018.\n\n[23] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence.\nDataset Shift in Machine Learning. The MIT Press, 2009. ISBN 0262170051, 9780262170055.\n\n[24] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767,\n\n2018.\n\n[25] Marco T\u00falio Ribeiro, Sameer Singh, and Carlos Guestrin. \"why should I trust you?\": Explaining\n\nthe predictions of any classi\ufb01er. CoRR, abs/1602.04938, 2016.\n\n[26] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions\n\nand use interpretable models instead. Nature Machine Intelligence, 1(5):206, 2019.\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. CoRR, abs/1409.1556, 2014.\n\n[28] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning.\n\nCoRR, abs/1703.05175, 2017.\n\n[29] Akshayvarun Subramanya, Suraj Srinivas, and R. Venkatesh Babu. Con\ufb01dence estimation in\n\ndeep neural networks via density modelling. CoRR, abs/1707.07013, 2017.\n\n[30] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134\u20131142, November 1984.\nISSN 0001-0782. doi: 10.1145/1968.1972. URL http://doi.acm.org/10.1145/1968.\n1972.\n\n[31] Bianca Zadrozny and Charles Elkan. Transforming classi\ufb01er scores into accurate multiclass\n\nprobability estimates, 2002.\n\n11\n\n\f", "award": [], "sourceid": 7804, "authors": [{"given_name": "Vickram", "family_name": "Rajendran", "institution": "The Johns Hopkins University Applied Physics Lab"}, {"given_name": "William", "family_name": "LeVine", "institution": "The Johns Hopkins University Applied Physics Lab"}]}