{"title": "Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift", "book": "Advances in Neural Information Processing Systems", "page_first": 13991, "page_last": 14002, "abstract": "Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive uncertainty. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under dataset shift. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.", "full_text": "Can You Trust Your Model\u2019s Uncertainty? Evaluating\n\nPredictive Uncertainty Under Dataset Shift\n\nYaniv Ovadia\u21e4\nGoogle Research\n\nEmily Fertig\u21e4\u2020\nGoogle Research\n\nJie Ren\u2020\n\nGoogle Research\n\nyovadia@google.com\n\nemilyaf@google.com\n\njjren@google.com\n\nZachary Nado\nGoogle Research\n\nD Sculley\n\nGoogle Research\n\nSebastian Nowozin\nGoogle Research\n\nznado@google.com\n\ndsculley@google.com\n\nnowozin@google.com\n\nJoshua V. Dillon\nGoogle Research\n\nBalaji Lakshminarayanan\u2021\n\nDeepMind\n\nJasper Snoek\u2021\nGoogle Research\n\njvdillon@google.com\n\nbalajiln@google.com\n\njsnoek@google.com\n\nAbstract\n\nModern machine learning methods including deep learning have achieved great\nsuccess in predictive accuracy for supervised learning tasks, but may still fall short\nin giving useful estimates of their predictive uncertainty. Quantifying uncertainty\nis especially critical in real-world settings, which often involve input distributions\nthat are shifted from the training distribution due to a variety of factors including\nsample bias and non-stationarity. In such settings, well calibrated uncertainty\nestimates convey information about when a model\u2019s output should (or should not)\nbe trusted. Many probabilistic deep learning methods, including Bayesian-and non-\nBayesian methods, have been proposed in the literature for quantifying predictive\nuncertainty, but to our knowledge there has not previously been a rigorous large-\nscale empirical comparison of these methods under dataset shift. We present a large-\nscale benchmark of existing state-of-the-art methods on classi\ufb01cation problems\nand investigate the effect of dataset shift on accuracy and calibration. We \ufb01nd that\ntraditional post-hoc calibration does indeed fall short, as do several other previous\nmethods. However, some methods that marginalize over models give surprisingly\nstrong results across a broad spectrum of tasks.\n\nIntroduction\n\n1\nRecent successes across a variety of domains have led to the widespread deployment of deep\nneural networks (DNNs) in practice. Consequently, the predictive distributions of these models are\nincreasingly being used to make decisions in important applications ranging from machine-learning\naided medical diagnoses from imaging (Esteva et al., 2017) to self-driving cars (Bojarski et al., 2016).\nSuch high-stakes applications require not only point predictions but also accurate quanti\ufb01cation\nof predictive uncertainty, i.e. meaningful con\ufb01dence values in addition to class predictions. With\nsuf\ufb01cient independent labeled samples from a target data distribution, one can estimate how well\n\n\u21e4Equal contribution\n\u2020AI Resident\n\u2021Corresponding authors\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fa model\u2019s con\ufb01dence aligns with its accuracy and adjust the predictions accordingly. However, in\npractice, once a model is deployed the distribution over observed data may shift and eventually be\nvery different from the original training data distribution. Consider, e.g., online services for which the\ndata distribution may change with the time of day, seasonality or popular trends. Indeed, robustness\nunder conditions of distributional shift and out-of-distribution (OOD) inputs is necessary for the\nsafe deployment of machine learning (Amodei et al., 2016). For such settings, calibrated predictive\nuncertainty is important because it enables accurate assessment of risk, allows practitioners to know\nhow accuracy may degrade, and allows a system to abstain from decisions due to low con\ufb01dence.\nA variety of methods have been developed for quantifying predictive uncertainty in DNNs. Probabilis-\ntic neural networks such as mixture density networks (MacKay & Gibbs, 1999) capture the inherent\nambiguity in outputs for a given input, also referred to as aleatoric uncertainty (Kendall & Gal, 2017).\nBayesian neural networks learn a posterior distribution over parameters that quanti\ufb01es parameter\nuncertainty, a type of epistemic uncertainty that can be reduced through the collection of additional\ndata. Popular approximate Bayesian approaches include Laplace approximation (MacKay, 1992),\nvariational inference (Graves, 2011; Blundell et al., 2015), dropout-based variational inference (Gal\n& Ghahramani, 2016; Kingma et al., 2015), expectation propagation Hern\u00b4andez-Lobato & Adams\n(2015) and stochastic gradient MCMC (Welling & Teh, 2011). Non-Bayesian methods include\ntraining multiple probabilistic neural networks with bootstrap or ensembling (Osband et al., 2016;\nLakshminarayanan et al., 2017). Another popular non-Bayesian approach involves re-calibration of\nprobabilities on a held-out validation set through temperature scaling (Platt, 1999), which was shown\nby Guo et al. (2017) to lead to well-calibrated predictions on the i.i.d. test set.\nUsing Distributional Shift to Evaluate Predictive Uncertainty While previous work has evaluated\nthe quality of predictive uncertainty on OOD inputs (Lakshminarayanan et al., 2017), there has not\nto our knowledge been a comprehensive evaluation of uncertainty estimates from different methods\nunder dataset shift. Indeed, we suggest that effective evaluation of predictive uncertainty is most\nmeaningful under conditions of distributional shift. One reason for this is that post-hoc calibration\ngives good results in independent and identically distributed (i.i.d.) regimes, but can fail under even a\nmild shift in the input data. And in real world applications, as described above, distributional shift is\nwidely prevalent. Understanding questions of risk, uncertainty, and trust in a model\u2019s output becomes\nincreasingly critical as shift from the original training data grows larger.\nContributions In the spirit of calls for more rigorous understanding of existing methods (Lipton\n& Steinhardt, 2018; Sculley et al., 2018; Rahimi & Recht, 2017), this paper provides a benchmark\nfor evaluating uncertainty that focuses not only on the i.i.d. setting but also uncertainty under\ndistributional shift. We present a large-scale evaluation of popular approaches in probabilistic deep\nlearning, focusing on methods that operate well in large-scale settings, and evaluate them on a diverse\nrange of classi\ufb01cation benchmarks across image, text, and categorical modalities. We use these\nexperiments to evaluate the following questions:\n\u2022 How trustworthy are the uncertainty estimates of different methods under dataset shift?\n\u2022 Does calibration in the i.i.d. setting translate to calibration under dataset shift?\n\u2022 How do uncertainty and accuracy of different methods co-vary under dataset shift? Are there\n\nmethods that consistently do well in this regime?\n\nIn addition to answering the questions above, our code is made available open-source along with our\nmodel predictions such that researchers can easily evaluate their approaches on these benchmarks 4.\n\n2 Background\nNotation and Problem Setup Let x 2 Rd represent a set of d-dimensional features and y 2\n{1, . . . , k} denote corresponding labels (targets) for k-class classi\ufb01cation. We assume that a training\ndataset D consists of N i.i.d.samples D = {(xn, yn)}N\nLet p\u21e4(x, y) denote the true distribution (unknown, observed only through the samples D), also\nreferred to as the data generating process. We focus on classi\ufb01cation problems, in which the true\ndistribution is assumed to be a discrete distribution over k classes, and the observed y 2{ 1, . . . , k}\n\nn=1.\n\n4https://github.com/google-research/google-research/tree/master/uq benchmark 2019\n\n2\n\n\fis a sample from the conditional distribution p\u21e4(y|x). We use a neural network to model p\u2713(y|x) and\nestimate the parameters \u2713 using the training dataset. At test time, we evaluate the model predictions\nagainst a test set, sampled from the same distribution as the training dataset. However, here we also\nevaluate the model against OOD inputs sampled from q(x, y) 6= p\u21e4(x, y). In particular, we consider\ntwo kinds of shifts:\n\u2022 shifted versions of the test inputs where the ground truth label belongs to one of the k classes. We\nuse shifts such as corruptions and perturbations proposed by Hendrycks & Dietterich (2019), and\nideally would like the model predictions to become more uncertain with increased shift, assuming\nshift degrades accuracy. This is also referred to as covariate shift (Sugiyama et al., 2017).\n\n\u2022 a completely different OOD dataset, where the ground truth label is not one of the k classes. Here\nwe check if the model exhibits higher predictive uncertainty for those new instances and to this\nend report diagnostics that rely only on predictions and not ground truth labels.\n\nHigh-level overview of existing methods A large variety of methods have been developed to either\nprovide higher quality uncertainty estimates or perform OOD detection to inform model con\ufb01dence.\nThese can roughly be divided into:\n1. Methods which deal with p(y|x) only, we discuss these in more detail in Section 3.\n2. Methods which model the joint distribution p(y, x), e.g. deep hybrid models (Kingma et al., 2014;\n\nAlemi et al., 2018; Nalisnick et al., 2019; Behrmann et al., 2018).\n\n3. Methods with an OOD-detection component in addition to p(y|x) (Bishop, 1994; Lee et al., 2018;\n\nLiang et al., 2018), and related work on selective classi\ufb01cation (Geifman & El-Yaniv, 2017).\n\nWe refer to Shafaei et al. (2018) for a recent summary of these methods. Due to the differences in\nmodeling assumptions, a fair comparison between these different classes of methods is challenging;\nfor instance, some OOD detection methods rely on knowledge of a known OOD set, or train using a\nnone-of-the-above class, and it may not always be meaningful to compare predictions from these\nmethods with those obtained from a Bayesian DNN. We focus on methods described by (1) above, as\nthis allows us to focus on methods which make the same modeling assumptions about data and differ\nonly in how they quantify predictive uncertainty.\n\n3 Methods and Metrics\nWe select a subset of methods from the probabilistic deep learning literature for their prevalence,\nscalability and practical applicability5. These include (see also references within):\n\u2022 (Vanilla) Maximum softmax probability (Hendrycks & Gimpel, 2017)\n\u2022 (Temp Scaling) Post-hoc calibration by temperature scaling using a validation set (Guo et al., 2017)\n\u2022 (Dropout) Monte-Carlo Dropout (Gal & Ghahramani, 2016; Srivastava et al., 2015) with rate p\n\u2022 (Ensembles) Ensembles of M networks trained independently on the entire dataset using random\n\u2022 (SVI) Stochastic Variational Bayesian Inference for deep learning (Blundell et al., 2015; Graves,\n2011; Louizos & Welling, 2017, 2016; Wen et al., 2018). We refer to Appendix A.6 for details of\nour SVI implementation.\n\ninitialization (Lakshminarayanan et al., 2017) (we set M = 10 in experiments below)\n\n\u2022 (LL) Approx. Bayesian inference for the parameters of the last layer only (Riquelme et al., 2018)\n\n\u2013 (LL SVI) Mean \ufb01eld stochastic variational inference on the last layer only\n\u2013 (LL Dropout) Dropout only on the activations before the last layer\n\nIn addition to metrics (we use arrows to indicate which direction is better) that do not depend on\npredictive uncertainty, such as classi\ufb01cation accuracy \", the following metrics are commonly used:\n5The methods used scale well for training and prediction (see in Appendix A.9.). We also explored methods\nsuch as scalable extensions of Gaussian Processes (Hensman et al., 2015), but they were challenging to train on\nthe 37M example Criteo dataset or the 1000 classes of ImageNet.\n\n3\n\n\f(p(y|xn, \u2713) (y yn))2 = |Y|1\u21e31 2p(yn|xn, \u2713) +Xy2Y\n\nNegative Log-Likelihood (NLL) # Commonly used to evaluate the quality of model uncertainty on\nsome held out set. Drawbacks: Although a proper scoring rule (Gneiting & Raftery, 2007), it can\nover-emphasize tail probabilities (Quinonero-Candela et al., 2006).\nBrier Score # (Brier, 1950) Proper scoring rule for measuring the accuracy of predicted probabilities.\nIt is computed as the squared error of a predicted probability vector, p(y|xn, \u2713), and the one-hot\nencoded true response, yn. That is,\np(y|xn, \u2713)2\u2318. (1)\nBS = |Y|1Xy2Y\nThe Brier score has a convenient interpretation as BS = uncertainty resolution + reliability,\nwhere uncertainty is the marginal uncertainty over labels, resolution measures the deviation of\nindividual predictions against the marginal, and reliability measures calibration as the average\nviolation of long-term true label frequencies. We refer to DeGroot & Fienberg (1983) for the\ndecomposition of Brier score into calibration and re\ufb01nement for classi\ufb01cation and to (Br\u00a8ocker, 2009)\nfor the general decomposition for any proper scoring rule. Drawbacks: Brier score is insensitive to\npredicted probabilities associated with in/frequent events.\nBoth the Brier score and the negative log-likelihood are proper scoring rules and therefore the\noptimum score corresponds to a perfect prediction. In addition to these two metrics, we also evaluate\ntwo metrics\u2014expected calibration error and entropy. Neither of these is a proper scoring rule, and\nthus there exist trivial solutions which yield optimal scores; for example, returning the marginal\nprobability p(y) for every instance will yield perfectly calibrated but uninformative predictions. Each\nproper scoring rule induces a calibration measure (Br\u00a8ocker, 2009). However, ECE is not the result of\nsuch decomposition and has no corresponding proper scoring rule; we instead include ECE because\nit is popularly used and intuitive. Each proper scoring rule is also associated with a corresponding\nentropy function and Shannon entropy is that for log probability (Gneiting & Raftery, 2007).\nExpected Calibration Error (ECE) # Measures the correspondence between predicted probabilities\nand empirical accuracy (Naeini et al., 2015). It is computed as the average gap between within\nbucket accuracy and within bucket predicted probability for S buckets Bs = {n 2 1 . . . N :\np(yn|xn, \u2713) 2 (\u21e2s,\u21e2 s+1]}. That is, ECE =PS\ns=1 |Bs|N | acc(Bs) conf(Bs)|, where acc(Bs) =\n[yn = \u02c6yn], conf(Bs) = |Bs|1Pn2Bs\n|Bs|1Pn2Bs\np(\u02c6yn|xn, \u2713), and \u02c6yn = arg maxy p(y|xn, \u2713)\nis the n-th prediction. When bins {\u21e2s : s 2 1 . . . S} are quantiles of the held-out predicted\nprobabilities, |Bs|\u21e1| Bk| and the estimation error is approximately constant. Drawbacks: Due to\nbinning, ECE does not monotonically increase as predictions approach ground truth. If |Bs|6 = |Bk|,\nthe estimation error varies across bins.\nThere is no ground truth label for fully OOD inputs. Thus we report histograms of con\ufb01dence\nand predictive entropy on known and OOD inputs and accuracy versus con\ufb01dence plots (Laksh-\nminarayanan et al., 2017): Given the prediction p(y = k|xn, \u2713), we de\ufb01ne the predicted label as\n\u02c6yn = arg maxy p(y|xn, \u2713), and the con\ufb01dence as p(y = \u02c6y|x, \u2713) = maxk p(y = k|xn, \u2713). We \ufb01lter\nout test examples corresponding to a particular con\ufb01dence threshold \u2327 2 [0, 1] and compute the\naccuracy on this set.\n\n4 Experiments and Results\nWe evaluate the behavior of the predictive uncertainty of deep learning models on a variety of datasets\nacross three different modalities: images, text and categorical (online ad) data. For each we follow\nstandard training, validation and testing protocols, but we additionally evaluate results on increasingly\nshifted data and an OOD dataset. We detail the models and implementations used in Appendix A.\nHyperparameters were tuned for all methods using Bayesian optimization (Golovin et al., 2017)\n(except on ImageNet) as detailed in Appendix A.8.\n\n4.1 An illustrative example - MNIST\nWe \ufb01rst illustrate the problem setup and experiments using the MNIST dataset. We used the\nLeNet (LeCun et al., 1998) architecture, and, as with all our experiments, we follow standard training,\nvalidation, testing and hyperparameter tuning protocols. However, we also compute predictions on\nincreasingly shifted data (in this case increasingly rotated or horizontally translated images) and study\n\n4\n\n\f(a) Rotated MNIST\n\n(b) Translated MNIST\n\n(c) Con\ufb01dence vs Acc Rotated 60\n\n(d) Count vs Con\ufb01dence Rotated 60\n\n(e) Entropy on OOD\n\n(f) Con\ufb01dence on OOD\n\nFigure 1: Results on MNIST: 1(a) and 1(b) show accuracy and Brier score as the data is increasingly\nshifted. Shaded regions represent standard error over 10 runs. To understand the discrepancy between\naccuracy and Brier score, we explore the predictive distributions of each method by looking at the\ncon\ufb01dence of the predictions in 1(c) and 1(d). We also explore the entropy and con\ufb01dence of each\nmethod on entirely OOD data in 1(e) and 1(f). SVI has lower accuracy on the validation and test\nsplits, but it is signi\ufb01cantly more robust to dataset shift as evidenced by a lower Brier score, lower\noverall con\ufb01dence 1(d) and higher predictive entropy under shift (1(c)) and OOD data (1(e),1(f)).\n\nthe behavior of the predictive distributions of the models. In addition, we predict on a completely\nOOD dataset, Not-MNIST (Bulatov, 2011), and observe the entropy of the model\u2019s predictions. We\nsummarize some of our \ufb01ndings in Figure 1 and discuss below.\nWhat we would like to see: Naturally, we expect the accuracy of a model to degrade as it predicts\non increasingly shifted data, and ideally this reduction in accuracy would coincide with increased\nforecaster entropy. A model that was well-calibrated on the training and validation distributions would\nideally remain so on shifted data. If calibration (ECE or Brier reliability) remained as consistent\nas possible, practitioners and downstream tasks could take into account that a model is becoming\nincreasingly uncertain. On the completely OOD data, one would expect the predictive distributions to\nbe of high entropy. Essentially, we would like the predictions to indicate that a model \u201cknows what it\ndoes not know\u201d due to the inputs straying away from the training data distribution.\nWhat we observe: We see in Figures 1(a) and 1(b) that accuracy certainly degrades as a function of\nshift for all methods tested, and they are dif\ufb01cult to disambiguate on that metric. However, the Brier\nscore paints a clearer picture and we see a signi\ufb01cant difference between methods, i.e. prediction\nquality degrades more signi\ufb01cantly for some methods than others. An important observation is that\nwhile calibrating on the validation set leads to well-calibrated predictions on the test set, it does\nnot guarantee calibration on shifted data. In fact, nearly all other methods (except vanilla) perform\nbetter than the state-of-the-art post-hoc calibration (Temperature scaling) in terms of Brier score\nunder shift. While SVI achieves the worst accuracy on the test set, it actually outperforms all other\nmethods by a much larger margin when exposed to signi\ufb01cant shift. In Figures 1(c) and 1(d) we look\nat the distribution of con\ufb01dences for each method to understand the discrepancy between metrics. We\nsee in Figure 1(d) that SVI has the lowest con\ufb01dence in general but in Figure 1(c) we observe that\nSVI gives the highest accuracy at high con\ufb01dence (or conversely is much less frequently con\ufb01dently\nwrong), which can be important for high-stakes applications. Most methods demonstrate very low\nentropy (Figure 1(e)) and give high con\ufb01dence predictions (Figure 1(f)) on data that is entirely OOD,\ni.e. they are con\ufb01dently wrong about completely OOD data.\n\n5\n\n0.10.20.30.40.50.60.70.80.91.0Accuracy9alid7eVt15\u25e630\u25e645\u25e660\u25e675\u25e690\u25e6105\u25e6120\u25e6135\u25e6150\u25e6165\u25e6180\u25e6IntenVity oI 6hiIt0.00.20.40.60.81.01.21.41.6Brier0.00.20.40.60.81.0AccuracyValidTeVt2Sx4Sx6Sx8Sx10Sx12Sx14SxIntenVity oI 6hiIt0.00.20.40.60.81.01.21.41.6Brier0.00.20.40.60.81.0\u03c40.20.30.40.50.60.70.8AccurDcy on exDmSleV p(y|x)\u2265\u03c4VDnLllD7emS 6cDlLngEnVemEleLL-DroSout6VILL-6VIDroSout0.00.20.40.60.81.0\u03c402000400060008000100001umEer oI exDmSleV p(y|x)\u2265\u03c4VDnLllDTemS 6cDlLngEnVemEleLL-DroSout6VILL-6VIDroSout0.00.51.01.52.02.5(ntroSy (1DtV)050001000015000200002500030000# oI (xDmSleVVDnLllDSVILL-DroSoutLL-SVI(nVemEleTemS ScDlLngDroSout0.00.20.40.60.81.0\u03c402000400060008000100001umEer oI exDmSleV p(y|x)\u2265\u03c4VDnLllDTemS 6cDlLngEnVemEleLL-DroSout6VILL-6VIDroSout\fFigure 2: Calibration under distributional shift: a detailed comparison of accuracy and ECE under\nall types of corruptions on (a) CIFAR-10 and (b) ImageNet. For each method we show the mean on\nthe test set and summarize the results on each intensity of shift with a box plot. Each box shows the\nquartiles summarizing the results across all (16) types of shift while the error bars indicate the min\nand max across different shift types. Figures showing additional metrics are provided in Figures S4\n(CIFAR-10) and S5 (ImageNet). Tables for numerical comparisons are provided in Appendix G.\n\nImage Models: CIFAR-10 and ImageNet\n\n4.2\nWe now study the predictive distributions of residual networks (He et al., 2016) trained on two\nbenchmark image datasets, CIFAR-10 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009), under\ndistributional shift. We use 20-layer and 50-layer ResNets for CIFAR-10 and ImageNet respectively.\nFor shifted data we use 80 different distortions (16 different types with 5 levels of intensity each, see\nAppendix B for illustrations) introduced by Hendrycks & Dietterich (2019). To evaluate predictions\nof CIFAR-10 models on entirely OOD data, we use the SVHN dataset (Netzer et al., 2011).\nFigure 2 summarizes the accuracy and ECE for CIFAR-10 (top) and ImageNet (bottom) across all 80\ncombinations of corruptions and intensities from (Hendrycks & Dietterich, 2019). Figure 3 inspects\nthe predictive distributions of the models on CIFAR-10 (top) and ImageNet (bottom) for shifted\n(Gaussian blur) and OOD data. Classi\ufb01ers on both datasets show poorer accuracy and calibration\nwith increasing shift. Comparing accuracy for different methods, we see that ensembles achieve\n\n6\n\nTest12345Shift intensity0.20.30.40.50.60.70.80.91.0AccuracyMethodVanillaTemp ScalingEnsembleDropoutLL DropoutSVILL SVITest12345Shift intensity0.00.10.20.30.40.50.60.70.8AccuracyMethodVanillaTemp ScalingEnsembleDropoutLL DropoutLL SVITest12345Shift intensity0.000.050.100.150.200.250.300.35ECEMethodVanillaTemp ScalingEnsembleDropoutLL DropoutLL SVITest12345Shift intensity0.00.10.20.30.40.50.60.7ECEMethodVanillaTemp ScalingEnsembleDropoutLL DropoutSVILL SVI(a) CIFAR-10(b) ImageNet\f(a) CIFAR: Con\ufb01dence vs Accuracy (b) CIFAR: Count vs Con\ufb01dence\n\n(c) CIFAR: Entropy on OOD\n\n(d) ImageNet: Con\ufb01dence vs Acc (e) ImageNet: Count vs Con\ufb01dence\n\n(f) CIFAR: Con\ufb01dence on OOD\n\nFigure 3: Results on CIFAR-10 and ImageNet. Left column: 3(a) and 3(d) show accuracy as a\nfunction of con\ufb01dence. Middle column: 3(b) and 3(e) show the number of examples greater than\ngiven con\ufb01dence values for Gaussian blur of intensity 3. Right column: 3(c) and 3(f) show histogram\nof entropy and con\ufb01dences from CIFAR-trained models on a completely different dataset (SVHN).\n\nhighest accuracy under distributional shift. Comparing the ECE for different methods, we observe\nthat while the methods achieve comparable low values of ECE for small values of shift, ensembles\noutperform the other methods for larger values of shift. To test whether this result is due simply to\nthe larger aggregate capacity of the ensemble, we trained models with double the number of \ufb01lters\nfor the Vanilla and Dropout methods. The higher-capacity models showed no better accuracy or\ncalibration for medium- to high-shift than the corresponding lower-capacity models (see Appendix C).\nIn Figures S8 and S9 we also explore the effect of the number of samples used in dropout, SVI and\nlast layer methods and size of the ensemble, on CIFAR-10. We found that while increasing ensemble\nsize up to 50 did help, most of the gains of ensembling could be achieved with only 5 models.\nInterestingly, while temperature scaling achieves low ECE for low values of shift, the ECE increases\nsigni\ufb01cantly as the shift increases, which indicates that calibration on the i.i.d. validation dataset\ndoes not guarantee calibration under distributional shift. (Note that for ImageNet, we found similar\ntrends considering just the top-5 predicted classes, See Figure S5.) Furthermore, the results show that\nwhile temperature scaling helps signi\ufb01cantly over the vanilla method, ensembles and dropout tend to\nbe better. In Figure 3, we see that ensembles and dropout are more accurate at higher con\ufb01dence.\nHowever, in 3(c) we see that temperature scaling gives the highest entropy on OOD data. Ensembles\nconsistently have high accuracy but also high entropy on OOD data. We refer to Appendix C for\nadditional results; Figures S4 and S5 report additional metrics on CIFAR-10 and ImageNet, such as\nBrier score (and its component terms), as well as top-5 error for increasing values of shift.\nOverall, ensembles consistently perform best across metrics and dropout consistently performed\nbetter than temperature scaling and last layer methods. While the relative ordering of methods is\nconsistent on both CIFAR-10 and ImageNet (ensembles perform best), the ordering is quite different\nfrom that on MNIST where SVI performs best. Interestingly, LL-SVI and LL-Dropout perform worse\nthan the vanilla method on shifted datasets as well as SVHN. We also evaluate a variational Gaussian\nprocess as a last layer method in Appendix E but it did not outperform LL-SVI and LL-Dropout.\n\n4.3 Text Models\nFollowing Hendrycks & Gimpel (2017), we train an LSTM (Hochreiter & Schmidhuber, 1997) on\nthe 20newsgroups dataset (Lang, 1995) and assess the model\u2019s robustness under distributional shift\n\n7\n\n0.00.20.40.60.81.0\u03c40.650.700.750.800.850.900.951.00AccurDcy on exDmSleV p(y|x)\u2265\u03c49DnLllD7emS 6cDlLngEnVemEleDroSoutLL-DroSout69ILL-69I0.00.20.40.60.81.0\u03c4100020003000400050006000700080009000100001umEer oI exDmSleV p(y|x)\u2265\u03c49DnLllD7emS 6cDlLngEnVemEleDroSoutLL-DroSout69ILL-69I0.00.51.01.52.02.5(ntroSy (1DtV)0100020003000400050006000700080009000# oI (xDmSleV9DnLllDDroSout69ILL-DroSoutLL-69I(nVemEle7emS 6cDlLng0.00.10.20.30.40.50.60.70.80.9\u03c40.30.40.50.60.70.80.91.0AccurDcy on exDmSleV p(y|x)\u2265\u03c49DnLllDLL 69IDroSoutLL DroSoutEnVemEle7emS 6cDlLng0.00.10.20.30.40.50.60.70.80.9\u03c4010000200003000040000500001umEer oI exDmSleV p(y|x)\u2265\u03c49DnLllDLL 69IDroSoutLL DroSoutEnVemEle7emS 6cDlLng0.00.20.40.60.81.0\u03c402000400060008000100001umEer oI exDmSleV p(y|x)\u2265\u03c4VDnLllDTemS 6cDlLngEnVemEleDroSoutLL-DroSout6VILL-6VI\f(a) Con\ufb01dence vs Acc.\n\n(b) Con\ufb01dence vs Count (c) Con\ufb01dence vs Accuracy (d) Con\ufb01dence vs Count\n\nFigure 4: Top row: Histograms of the entropy of the predictive distributions for in-distribution (solid\nlines), shifted (dotted lines), and completely different OOD (dashed lines) text examples. Bottom\nrow: Con\ufb01dence score vs accuracy and count respectively when evaluated for in-distribution and\nin-distribution shift text examples (a,b), and in-distribution and OOD text examples (c,d).\n\nand OOD text. We use the even-numbered classes (10 classes out of 20) as in-distribution and the 10\nodd-numbered classes as shifted data. We provide additional details in Appendix A.4.\nWe look at con\ufb01dence vs accuracy when the test data consists of a mix of in-distribution and either\nshifted or completely OOD data, in this case the One Billion Word Benchmark (LM1B) (Chelba\net al., 2013). Figure 4 (bottom row) shows the results. Ensembles signi\ufb01cantly outperform all other\nmethods, and achieve better trade-off between accuracy versus con\ufb01dence. Surprisingly, LL-Dropout\nand LL-SVI perform worse than the vanilla method, giving higher con\ufb01dence incorrect predictions,\nespecially when tested on fully OOD data.\nFigure 4 reports histograms of predictive entropy on in-distribution data and compares them to those\nfor the shifted and OOD datasets. This re\ufb02ects how amenable each method is to abstaining from\nprediction by applying a threshold on the entropy. As expected, most methods achieve the highest\npredictive entropy on the completely OOD dataset, followed by the shifted dataset and then the\nin-distribution test dataset. Only ensembles have consistently higher entropy on the shifted data,\nwhich explains why they perform best on the con\ufb01dence vs accuracy curves in the second row of\nFigure 4. Compared with the vanilla model, Dropout and LL-SVI have more a distinct separation\nbetween in-distribution and shifted or OOD data. While Dropout and LL-Dropout perform similarly\non in-distribution, LL-Dropout exhibits less uncertainty than Dropout on shifted and OOD data.\nTemperature scaling does not appear to increase uncertainty signi\ufb01cantly on the shifted data.\n\n4.4 Ad-Click Model with Categorical Features\nFinally, we evaluate the performance of different methods on the Criteo Display Advertising Chal-\nlenge6 dataset, a binary classi\ufb01cation task consisting of 37M examples with 13 numerical and 26\ncategorical features per example. We introduce shift by reassigning each categorical feature to a\nrandom new token with some \ufb01xed probability that controls the intensity of shift. This coarsely\nsimulates a type of shift observed in non-stationary categorical features as category tokens appear\nand disappear over time, for example due to hash collisions. The model consists of a 3-hidden-layer\nmulti-layer-perceptron (MLP) with hashed and embedded categorical features and achieves a negative\nlog-likelihood of approximately 0.5 (contest winners achieved 0.44). Due to class imbalance (\u21e0 25%\nof examples are positive), we report AUC instead of classi\ufb01cation accuracy.\nResults from these experiments are depicted in Figure 5. (Figure S7 in Appendix C shows additional\nresults including ECE and Brier score decomposition.) We observe that ensembles are superior\nin terms of both AUC and Brier score for most of the values of shift, with the performance gap\nbetween ensembles and other methods generally increasing as the shift increases. Both Dropout\nmodel variants yielded improved AUC on shifted data, and Dropout surpassed ensembles in Brier\n\n6https://www.kaggle.com/c/criteo-display-ad-challenge\n\n8\n\n0.00.51.01.52.02.5EntroSy0.00.51.01.52.02.53.03.5DenVityVDniOODIn-diVt.Skewed22D\u22120.50.00.51.01.52.02.5EntroSy0.00.51.01.52.02.53.03.5LL-SVI0.00.51.01.52.02.5Entropy0.00.51.01.52.02.53.03.5Dropout0.00.51.01.52.02.5Entropy0.00.51.01.52.02.53.03.5LL-Dropout0.00.51.01.52.02.5Entropy0.00.51.01.52.02.53.03.5EnsemEle0.00.51.01.52.02.5EntroSy0.00.51.01.52.02.53.03.5TemS Scaling0.00.10.20.30.40.50.60.70.80.9\u03c4405060708090100AccurDcy on exDmSleV p(y|x)\u2265\u03c49DnLllDLL-69IDroSoutLL-DroSoutEnVemEle7emS 6cDlLng0.00.10.20.30.40.50.60.70.80.9\u03c40200040006000800010000120001umEer oI exDmSleV p(y|x)\u2265\u03c49DnLllDLL-69IDroSoutLL-DroSoutEnVemEle7emS 6cDlLng0.00.10.20.30.40.50.60.70.80.9\u03c4405060708090100AccurDcy on exDmSleV p(y|x)\u2265\u03c49DnLllDLL-69IDroSoutLL-DroSoutEnVemEle7emS 6cDlLng0.00.10.20.30.40.50.60.70.80.9\u03c40200040006000800010000120001umEer oI exDmSleV p(y|x)\u2265\u03c49DnLllDLL-69IDroSoutLL-DroSoutEnVemEle7emS 6cDlLng\fFigure 5: Results on Criteo: The \ufb01rst two plots show degrading AUCs and Brier scores with increasing\nshift while the latter two depict the distribution of prediction con\ufb01dences and their corresponding\naccuracies at 75% randomization of categorical features. SVI is excluded as it performed too poorly.\n\nscore at shift-randomization values above 60%. SVI proved challenging to train, and the resulting\nmodel uniformly performed poorly; LL-SVI fared better but generally did not improve upon the\nvanilla model. Strikingly, temperature scaling has a worse Brier score than Vanilla indicating that\npost-hoc calibration on the validation set actually harms calibration under dataset shift.\n\n5 Takeaways and Recommendations\nWe presented a large-scale evaluation of different methods for quantifying predictive uncertainty\nunder dataset shift, across different data modalities and architectures. Our take-home messages are\nthe following:\n\n\u2022 Along with accuracy, the quality of uncertainty consistently degrades with increasing dataset shift\n\nregardless of method.\n\n\u2022 Better calibration and accuracy on the i.i.d. test dataset does not usually translate to better\n\ncalibration under dataset shift (shifted versions as well as completely different OOD data).\n\n\u2022 Post-hoc calibration (on i.i.d validation) with temperature scaling leads to well-calibrated uncer-\ntainty on the i.i.d. test set and small values of shift, but is signi\ufb01cantly outperformed by methods\nthat take epistemic uncertainty into account as the shift increases.\n\n\u2022 Last layer Dropout exhibits less uncertainty on shifted and OOD datasets than Dropout.\n\u2022 SVI is very promising on MNIST/CIFAR but it is dif\ufb01cult to get to work on larger datasets such\n\nas ImageNet and other architectures such as LSTMs.\n\n\u2022 The relative ordering of methods is mostly consistent (except for MNIST) across our experiments.\nThe relative ordering of methods on MNIST is not re\ufb02ective of their ordering on other datasets.\n\u2022 Deep ensembles seem to perform the best across most metrics and be more robust to dataset shift.\nWe found that relatively small ensemble size (e.g. M = 5) may be suf\ufb01cient (Appendix D).\n\u2022 We also compared the set of methods on a real-world challenging genomics problem from Ren\net al. (2019). Our observations were consistent with the other experiments in the paper. Deep\nensembles performed best, but there remains signi\ufb01cant room for improvement, as with the other\nexperiments in the paper. See Section F for details.\n\nWe hope that this benchmark is useful to the community and inspires more research on uncertainty\nunder dataset shift, which seems challenging for existing methods. While we focused only on the\nquality of predictive uncertainty, applications may also need to consider computational and memory\ncosts of the methods; Table S1 in Appendix A.9 discusses these costs, and the best performing\nmethods tend to be more expensive. Reducing the computational and memory costs, while retaining\nthe same performance under dataset shift, would also be a key research challenge.\nAcknowledgements\nWe thank Alexander D\u2019Amour, Jakub \u00b4Swia\u00b8tkowski and our reviewers for helpful feedback that\nimproved the manuscript.\n\n9\n\n7Uain9alid7eVt5%15%25%35%45%55%65%75%85%95%0.550.600.650.700.750.80A8C7rDLn9DlLd7eVt5%15%25%35%45%55%65%75%85%95%0.320.340.360.380.400.420.440.46%rLer 6core9DnLllDDroSoutLL-DroSoutLL-69I7emS 6cDlLngEnVemEle0.40.50.60.70.80.91.0\u03c4010000200003000040000500001umEer oI exDmSleV p(y|x)\u2265\u03c49DnLllDDroSoutLL-DroSoutLL-69I7emS 6cDlLngEnVemEle0.40.50.60.70.80.91.0\u03c40.700.750.800.850.900.951.00AccurDcy on exDmSleV p(y|x)\u2265\u03c49DnLllDDroSoutLL-DroSoutLL-69I7emS 6cDlLngEnVemEle\fReferences\nAlemi, A. A., Fischer, I., and Dillon, J. V. Uncertainty in the variational information bottleneck.\n\narXiv preprint arXiv:1807.00906, 2018.\n\nAmodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Man\u00b4e, D. Concrete problems\n\nin AI safety. arXiv preprint arXiv:1606.06565, 2016.\n\nBehrmann, J., Duvenaud, D., and Jacobsen, J.-H. Invertible residual networks. arXiv preprint\n\narXiv:1811.00995, 2018.\n\nBishop, C. M. Novelty Detection and Neural Network Validation. IEE Proceedings-Vision, Image\n\nand Signal processing, 141(4):217\u2013222, 1994.\n\nBlundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks.\n\nIn ICML, 2015.\n\nBojarski, M., Testa, D. D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort,\nM., Muller, U., Zhang, J., Zhang, X., Zhao, J., and Zieba, K. End to end learning for self-driving\ncars. arXiv preprint arXiv:1604.07316, 2016.\n\nBrier, G. W. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly weather review,\n\n1950.\n\nBr\u00a8ocker, J. Reliability, suf\ufb01ciency, and the decomposition of proper scores. Quarterly Journal of the\n\nRoyal Meteorological Society, 135(643):1512\u20131519, 2009.\n\nBulatov, Y. NotMNIST dataset, 2011. URL http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html.\n\nChelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. One\nbillion word benchmark for measuring progress in statistical language modeling. arXiv preprint\narXiv:1312.3005, 2013.\n\nDeGroot, M. H. and Fienberg, S. E. The comparison and evaluation of forecasters. The statistician,\n\n1983.\n\nDeng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical\n\nImage Database. In Computer Vision and Pattern Recognition, 2009.\n\nEsteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S. Dermatologist-\n\nlevel classi\ufb01cation of skin cancer with deep neural networks. Nature, 542, 1 2017.\n\nGal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty\n\nin deep learning. In ICML, 2016.\n\nGeifman, Y. and El-Yaniv, R. Selective classi\ufb01cation for deep neural networks. In NeurIPS, 2017.\n\nGneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of\n\nthe American Statistical Association, 102(477):359\u2013378, 2007.\n\nGolovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., and Sculley, D. Google vizier: A service\nfor black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining, pp. 1487\u20131495. ACM, 2017.\n\nGraves, A. Practical variational inference for neural networks. In NeurIPS, 2011.\n\nGuo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In\n\nInternational Conference on Machine Learning, 2017.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770\u2013778, 2016.\n\nHendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions\n\nand perturbations. In ICLR, 2019.\n\n10\n\n\fHendrycks, D. and Gimpel, K. A Baseline for Detecting Misclassi\ufb01ed and Out-of-Distribution\n\nExamples in Neural Networks. In ICLR, 2017.\n\nHensman, J., Matthews, A., and Ghahramani, Z. Scalable variational gaussian process classi\ufb01cation.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics. JMLR, 2015.\n\nHern\u00b4andez-Lobato, J. M. and Adams, R. Probabilistic Backpropagation for Scalable Learning of\n\nBayesian Neural Networks. In ICML, 2015.\n\nHochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9(8):1735\u20131780,\n\nNovember 1997.\n\nKendall, A. and Gal, Y. What uncertainties do we need in Bayesian deep learning for computer\n\nvision? In NeurIPS, 2017.\n\nKingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. In ICLR, 2014.\nKingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. Semi-supervised learning with deep\n\ngenerative models. In NeurIPS, 2014.\n\nKingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization\n\ntrick. In NeurIPS, 2015.\n\nKlambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. Self-normalizing neural networks. In\n\nNeurIPS, 2017.\n\nKrizhevsky, A. Learning multiple layers of features from tiny images. 2009.\nLakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and Scalable Predictive Uncertainty\n\nEstimation Using Deep Ensembles. In NeurIPS, 2017.\n\nLang, K. Newsweeder: Learning to \ufb01lter netnews. In Machine Learning. 1995.\nLeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document\n\nrecognition. In Proceedings of the IEEE, November 1998.\n\nLee, K., Lee, K., Lee, H., and Shin, J. A simple uni\ufb01ed framework for detecting out-of-distribution\n\nsamples and adversarial attacks. In NeurIPS, 2018.\n\nLiang, S., Li, Y., and Srikant, R. Enhancing the Reliability of Out-of-Distribution Image Detection in\n\nNeural Networks. ICLR, 2018.\n\nLipton, Z. C. and Steinhardt, J. Troubling trends in machine learning scholarship. arXiv preprint\n\narXiv:1807.03341, 2018.\n\nLouizos, C. and Welling, M. Structured and ef\ufb01cient variational deep learning with matrix Gaussian\n\nposteriors. arXiv preprint arXiv:1603.04733, 2016.\n\nLouizos, C. and Welling, M. Multiplicative Normalizing Flows for Variational Bayesian Neural\n\nNetworks. In ICML, 2017.\n\nMacKay, D. J. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology,\n\n1992.\n\nMacKay, D. J. and Gibbs, M. N. Density Networks. Statistics and Neural Networks: Advances at the\n\nInterface, 1999.\n\nNaeini, M. P., Cooper, G. F., and Hauskrecht, M. Obtaining Well Calibrated Probabilities Using\n\nBayesian Binning. In AAAI, pp. 2901\u20132907, 2015.\n\nNalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. Hybrid models with\n\ndeep and invertible features. arXiv preprint arXiv:1902.02767, 2019.\n\nNetzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading Digits in Natural\nIn NeurIPS Workshop on Deep Learning and\n\nImages with Unsupervised Feature Learning.\nUnsupervised Feature Learning, 2011.\n\n11\n\n\fOsband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped DQN. In\n\nNeurIPS, 2016.\n\nPlatt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood\n\nmethods. In Advances in Large Margin Classi\ufb01ers, pp. 61\u201374. MIT Press, 1999.\n\nQuinonero-Candela, J., Rasmussen, C. E., Sinz, F., Bousquet, O., and Sch\u00a8olkopf, B. Evaluating\n\npredictive uncertainty challenge. In Machine Learning Challenges. Springer, 2006.\n\nRahimi, A. and Recht, B. An addendum to alchemy, 2017.\nRen, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., DePristo, M. A., Dillon, J. V., and Lakshmi-\nnarayanan, B. Likelihood ratios for out-of-distribution detection. arXiv preprint arXiv:1906.02845,\n2019.\n\nRiquelme, C., Tucker, G., and Snoek, J. Deep Bayesian Bandits Showdown: An Empirical Compari-\n\nson of Bayesian Deep Networks for Thompson Sampling. In ICLR, 2018.\n\nSculley, D., Snoek, J., Wiltschko, A., and Rahimi, A. Winner\u2019s curse? On pace, progress, and\n\nempirical rigor. 2018.\n\nShafaei, A., Schmidt, M., and Little, J. J. Does Your Model Know the Digit 6 Is Not a Cat? A Less\n\nBiased Evaluation of \u201cOutlier\u201d Detectors. ArXiv e-Print arXiv:1809.04729, 2018.\n\nSrivastava, R. K., Greff, K., and Schmidhuber, J. Training Very Deep Networks. In NeurIPS, 2015.\nSugiyama, M., Lawrence, N. D., Schwaighofer, A., et al. Dataset shift in machine learning. The MIT\n\nPress, 2017.\n\nWelling, M. and Teh, Y. W. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In\n\nICML, 2011.\n\nWen, Y., Vicol, P., Ba, J., Tran, D., and Grosse, R. Flipout: Ef\ufb01cient pseudo-independent weight\n\nperturbations on mini-batches. arXiv preprint arXiv:1803.04386, 2018.\n\nWu, A., Nowozin, S., Meeds, E., Turner, R. E., Hernandez-Lobato, J. M., and Gaunt, A. L. Determin-\n\nistic Variational Inference for Robust Bayesian Neural Networks. In ICLR, 2019.\n\n12\n\n\f", "award": [], "sourceid": 7797, "authors": [{"given_name": "Yaniv", "family_name": "Ovadia", "institution": "Princeton University"}, {"given_name": "Emily", "family_name": "Fertig", "institution": "Google Research"}, {"given_name": "Jie", "family_name": "Ren", "institution": "Google Inc."}, {"given_name": "Zachary", "family_name": "Nado", "institution": "Google Inc."}, {"given_name": "D.", "family_name": "Sculley", "institution": "Google Research"}, {"given_name": "Sebastian", "family_name": "Nowozin", "institution": "Google Research Berlin"}, {"given_name": "Joshua", "family_name": "Dillon", "institution": "Google"}, {"given_name": "Balaji", "family_name": "Lakshminarayanan", "institution": "Google DeepMind"}, {"given_name": "Jasper", "family_name": "Snoek", "institution": "Google Brain"}]}