{"title": "Evidential Deep Learning to Quantify Classification Uncertainty", "book": "Advances in Neural Information Processing Systems", "page_first": 3179, "page_last": 3189, "abstract": "Deterministic neural nets have been shown to learn effective predictors on a wide range of machine learning problems. However, as the standard approach is to train the network to minimize a prediction loss, the resultant model remains ignorant to its prediction confidence. Orthogonally to Bayesian neural nets that indirectly infer prediction uncertainty through weight uncertainties, we propose explicit modeling of the same using the theory of subjective logic. By placing a Dirichlet distribution on the class probabilities, we treat predictions of a neural net as subjective opinions and learn the function that collects the evidence leading to these opinions by a deterministic neural net from data. The resultant predictor for a multi-class classification problem is another Dirichlet distribution whose parameters are set by the continuous output of a neural net. We provide a preliminary analysis on how the peculiarities of our new loss function drive improved uncertainty estimation. We observe that our method achieves unprecedented success on detection of out-of-distribution queries and endurance against adversarial perturbations.", "full_text": "Evidential Deep Learning to Quantify Classi\ufb01cation\n\nUncertainty\n\nMurat Sensoy\n\nDepartment of Computer Science\n\nOzyegin University, Turkey\n\nmurat.sensoy@ozyegin.edu.tr\n\nLance Kaplan\n\nUS Army Research Lab\nAdelphi, MD 20783, USA\n\nlkaplan@ieee.org\n\nBosch Center for Arti\ufb01cial Intelligence\n\nRobert-Bosch-Campus 1, 71272 Renningen, Germany\n\nMelih Kandemir\n\nmelih.kandemir@bosch.com\n\nAbstract\n\nDeterministic neural nets have been shown to learn effective predictors on a wide\nrange of machine learning problems. However, as the standard approach is to train\nthe network to minimize a prediction loss, the resultant model remains ignorant to\nits prediction con\ufb01dence. Orthogonally to Bayesian neural nets that indirectly infer\nprediction uncertainty through weight uncertainties, we propose explicit modeling\nof the same using the theory of subjective logic. By placing a Dirichlet distribution\non the class probabilities, we treat predictions of a neural net as subjective opinions\nand learn the function that collects the evidence leading to these opinions by\na deterministic neural net from data. The resultant predictor for a multi-class\nclassi\ufb01cation problem is another Dirichlet distribution whose parameters are set by\nthe continuous output of a neural net. We provide a preliminary analysis on how\nthe peculiarities of our new loss function drive improved uncertainty estimation.\nWe observe that our method achieves unprecedented success on detection of out-\nof-distribution queries and endurance against adversarial perturbations.\n\n1\n\nIntroduction\n\nThe present decade has commenced with the deep learning approach shaking the machine learning\nworld [20]. New-age deep neural net constructions have exhibited amazing success on nearly all\napplications of machine learning thanks to recent inventions such as dropout [30], batch normalization\n[13], and skip connections [11]. Further rami\ufb01cations that adapt neural nets to particular applications\nhave brought unprecedented prediction accuracies, which in certain cases exceed human-level\nperformance [5, 4]. While one side of the coin is a boost of interest and investment on deep\nlearning research, the other is an emergent need for its robustness, sample ef\ufb01ciency, security, and\ninterpretability.\nOn setups where abundant labeled data are available, the capability to achieve suf\ufb01ciently high\naccuracy by following a short list of rules of thumb has been taken for granted. The major challenges\nof the upcoming era, hence, are likely to lie elsewhere rather than test set accuracy improvement. For\ninstance, is the neural net able to identify data points belonging to an unrelated data distribution? Can\nit simply say \"I do not know\" if we feed in a cat picture after training the net on a set of handwritten\ndigits? Even more critically, can the net protect its users against adversarial attacks? These questions\nhave been addressed by a stream of research on Bayesian Neural Nets (BNNs) [8, 18, 26], which\nestimate prediction uncertainty by approximating the moments of the posterior predictive distribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThis holistic approach seeks for a solution with a wide set of practical uses besides uncertainty\nestimation, such as automated model selection and enhanced immunity to over\ufb01tting.\nIn this paper, we put our full focus on the uncertainty estimation problem and approach it from a\nTheory of Evidence perspective [7, 14]. We interpret softmax, the standard output of a classi\ufb01cation\nnetwork, as the parameter set of a categorical distribution. By replacing this parameter set with the\nparameters of a Dirichlet density, we represent the predictions of the learner as a distribution over\npossible softmax outputs, rather than the point estimate of a softmax output. In other words, this\ndensity can intuitively be understood as a factory of these point estimates. The resultant model has a\nspeci\ufb01c loss function, which is minimized subject to neural net weights using standard backprop.\nIn a set of experiments, we demonstrate that this technique outperforms state-of-the-art BNNs by a\nlarge margin on two applications where high-quality uncertainty modeling is of critical importance.\nSpeci\ufb01cally, the predictive distribution of our model approaches the maximum entropy setting much\ncloser than BNNs when fed with an input coming from a distribution different from that of the training\nsamples. Figure 1 illustrates how sensibly our method reacts to the rotation of the input digits. As it is\nnot trained to handle rotational invariance, it sharply reduces classi\ufb01cation probabilities and increases\nthe prediction uncertainty after circa 50\u25e6 input rotation. The standard softmax keeps reporting high\ncon\ufb01dence for incorrect classes for high rotations. Lastly, we observe that our model is clearly more\nrobust to adversarial attacks on two different benchmark data sets.\nAll vectors in this paper are column vectors and are represented in bold face such as x where the k-th\nelement is denoted as xk. We use (cid:12) to refer to the Hadamard (element-wise) product.\n\n2 De\ufb01ciencies of Modeling Class Probabilities with Softmax\n\nThe gold standard for deep neural nets is to use the softmax operator to convert the continuous\nactivations of the output layer to class probabilities. The eventual model can be interpreted as a\nmultinomial distribution whose parameters, hence discrete class probabilities, are determined by\nneural net outputs. For a K\u2212class classi\ufb01cation problem, the likelihood function for an observed\ntuple (x, y) is\n\nP r(y|x, \u03b8) = Mult(y|\u03c3(f1(x, \u03b8)),\u00b7\u00b7\u00b7 , \u03c3(fK(x, \u03b8))),\n\nneural net f (\u00b7) parametrized by \u03b8, and \u03c3(uj) = euj /(cid:80)K\n\nwhere Mult(\u00b7\u00b7\u00b7 ) is a multinomial mass function, fj(x, \u03b8) is the jth output channel of an arbitrary\ni=1 euK is the softmax function. While the\ncontinuous neural net is responsible for adjusting the ratio of class probabilities, softmax squashes\nthese ratios into a simplex. The eventual softmax-squashed multinomial likelihood is then maximized\nwith respect to the neural net parameters \u03b8. The equivalent problem of minimizing the negative\nlog-likelihood is preferred for computational convenience\n\n\u2212 log p(y|x, \u03b8) = \u2212 log \u03c3(fy(x, \u03b8))\n\nwhich is widely known as the cross-entropy loss. It is noteworthy that the probabilistic interpretation\nof the cross-entropy loss is mere Maximum Likelihood Estimation (MLE). As being a frequentist\ntechnique, MLE is not capable of inferring the predictive distribution variance. Softmax is also\nnotorious with in\ufb02ating the probability of the predicted class as a result of the exponent employed on\nthe neural net outputs. The consequence is then unreliable uncertainty estimations, as the distance of\nthe predicted label of a newly seen observation is not useful for the conclusion besides its comparative\nvalue against other classes.\nInspired from [9] and [24], on the left side of Figure 1, we demonstrate how the LeNet [22] fails\nto classify an image of digit 1 from MNIST dataset when it is continuously rotated in the counter-\nclockwise direction. Commonly to many standardized architectures, LeNet estimates classi\ufb01cation\nprobabilities with the softmax function. As the image is rotated it fails to classify the image correctly;\nthe image is classi\ufb01ed as 2 or 5 based on the degree of rotation. For instance, for small degrees of\nrotation, the image is correctly classi\ufb01ed as 1 with high probability values. However, when the image\nis rotated between 60 \u2212 100 degrees, it is classi\ufb01ed as 2. The network starts to classify the image\nas 5 when it is rotated between 110 \u2212 130 degrees. While the classi\ufb01cation probability computed\nusing the softmax function is quite high for the misclassi\ufb01ed samples (see Figure 1, left panel), our\napproach proposed in this paper can accurately quantify uncertainty of its predictions (see Figure 1,\nright panel).\n\n2\n\n\fFigure 1: Classi\ufb01cation of the rotated digit 1 (at bottom) at different angles between 0 and 180\ndegrees. Left: The classi\ufb01cation probability is calculated using the softmax function. Right: The\nclassi\ufb01cation probability and uncertainty are calculated using the proposed method.\n\n3 Uncertainty and the Theory of Evidence\n\nThe Dempster\u2013Shafer Theory of Evidence (DST) is a generalization of the Bayesian theory to\nsubjective probabilities [7]. It assigns belief masses to subsets of a frame of discernment, which\ndenotes the set of exclusive possible states, e.g., possible class labels for a sample. A belief mass\ncan be assigned to any subset of the frame, including the whole frame itself, which represents the\nbelief that the truth can be any of the possible states, e.g., any class label is equally likely. In other\nwords, by assigning all belief masses to the whole frame, one expresses \u2019I do not know\u2019 as an opinion\nfor the truth over possible states [14]. Subjective Logic (SL) formalizes DST\u2019s notion of belief\nassignments over a frame of discernment as a Dirichlet Distribution [14]. Hence, it allows one to use\nthe principles of evidential theory to quantify belief masses and uncertainty through a well-de\ufb01ned\ntheoretical framework. More speci\ufb01cally, SL considers a frame of K mutually exclusive singletons\n(e.g., class labels) by providing a belief mass bk for each singleton k = 1, . . . , K and providing an\noverall uncertainty mass of u. These K + 1 mass values are all non-negative and sum up to one, i.e.,\n\nK(cid:88)\n\nu +\n\nbk = 1,\n\nwhere u \u2265 0 and bk \u2265 0 for k = 1, . . . , K. A belief mass bk for a singleton k is computed using the\nevidence for the singleton. Let ek \u2265 0 be the evidence derived for the kth singleton, then the belief\nbk and the uncertainty u are computed as\n\nk=1\n\nbk =\n\nek\nS\n\nand u =\n\nK\nS\n\n,\n\n(1)\n\nwhere S =(cid:80)K\n\ni=1(ei + 1). Note that the uncertainty is inversely proportional to the total evidence.\nWhen there is no evidence, the belief for each singleton is zero and the uncertainty is one. Differently\nfrom the Bayesian modeling nomenclature, we term evidence as a measure of the amount of support\ncollected from data in favor of a sample to be classi\ufb01ed into a certain class. A belief mass assignment,\ni.e., subjective opinion, corresponds to a Dirichlet distribution with parameters \u03b1k = ek + 1. That\nis, a subjective opinion can be derived easily from the parameters of the corresponding Dirichlet\n\ndistribution using bk = (\u03b1k \u2212 1)/S, where S =(cid:80)K\n\ni=1 \u03b1i is referred to as the Dirichlet strength.\n\nThe output of a standard neural network classi\ufb01er is a probability assignment over the possible classes\nfor each sample. However, a Dirichlet distribution parametrized over evidence represents the density\nof each such probability assignment; hence it models second-order probabilities and uncertainty [14].\nThe Dirichlet distribution is a probability density function (pdf) for possible values of the probability\nmass function (pmf) p. It is characterized by K parameters \u03b1 = [\u03b11,\u00b7\u00b7\u00b7 , \u03b1K] and is given by\n\n(cid:26) 1\n\nB(\u03b1)\n0\n\n(cid:81)K\ni=1 p\u03b1i\u22121\n\ni\n\nD(p|\u03b1) =\n\nfor p \u2208 SK,\notherwise,\n\n3\n\nRotation AngleRotation Angle\fwhere SK is the K-dimensional unit simplex,\n\n(cid:110)\n\n(cid:12)(cid:12)(cid:12)(cid:80)K\n\nSK =\n\np\n\ni=1 pi = 1 and 0 \u2264 p1, . . . , pK \u2264 1\n\n(cid:111)\n\nand B(\u03b1) is the K-dimensional multinomial beta function [19].\nLet us assume that we have b = (cid:104)0, . . . , 0(cid:105) as belief mass assignment for a 10-class classi\ufb01cation\nproblem. Then, the prior distribution for the classi\ufb01cation of the image becomes a uniform distribution,\ni.e., D(p|(cid:104)1, . . . , 1(cid:105)) \u2014 a Dirichlet distribution whose parameters are all ones. There is no observed\nevidence, since the belief masses are all zero. This means that the opinion corresponds to the uniform\ndistribution, does not contain any information, and implies total uncertainty, i.e., u = 1. Let the belief\nmasses become b = (cid:104)0.8, 0, . . . , 0(cid:105) after some training. This means that the total belief in the opinion\nis 0.8 and remaining 0.2 is the uncertainty. Dirichlet strength is calculated as S = 10/0.2 = 50, since\nK = 10. Hence, the amount of new evidence derived for the \ufb01rst class is computed as 50 \u00d7 0.8 = 40.\nIn this case, the opinion would correspond to the Dirichlet distribution D(p|(cid:104)41, 1, . . . , 1(cid:105)).\nGiven an opinion, the expected probability for the kth singleton is the mean of the corresponding\nDirichlet distribution and computed as\n\n\u02c6pk =\n\n\u03b1k\nS\n\n.\n\n(2)\n\nWhen an observation about a sample relates it to one of the K attributes, the corresponding Dirichlet\nparameter is incremented to update the Dirichlet distribution with the new observation. For instance,\ndetection of a speci\ufb01c pattern on an image may contribute to its classi\ufb01cation into a speci\ufb01c class. In\nthis case, the Dirichlet parameter corresponding to this class should be incremented. This implies\nthat the parameters of a Dirichlet distribution for the classi\ufb01cation of a sample may account for the\nevidence for each class.\nIn this paper, we argue that a neural network is capable of forming opinions for classi\ufb01cation tasks\nas Dirichlet distributions. Let us assume that \u03b1i = (cid:104)\u03b1i1, . . . , \u03b1iK(cid:105) is the parameters of a Dirichlet\ndistribution for the classi\ufb01cation of a sample i, then (\u03b1ij \u2212 1) is the total evidence estimated by the\nnetwork for the assignment of the sample i to the jth class. Furthermore, given these parameters, the\nepistemic uncertainty of the classi\ufb01cation can easily be computed using Equation 1.\n\n4 Learning to Form Opinions\n\nThe softmax function provides a point estimate for the class probabilities of a sample and does\nnot provide the associated uncertainty. On the other hand, multinomial opinions or equivalently\nDirichlet distributions can be used to model a probability distribution for the class probabilities.\nTherefore, in this paper, we design and train neural networks to form their multinomial opinions\nfor the classi\ufb01cation of a given sample i as a Dirichlet distribution D(pi|\u03b1i), where pi is a simplex\nrepresenting class assignment probabilities.\nOur neural networks for classi\ufb01cation are very similar to classical neural networks. The only\ndifference is that the softmax layer is replaced with an activation layer, e.g., ReLU, to ascertain\nnon-negative output, which is taken as the evidence vector for the predicted Dirichlet distribution.\nGiven a sample i, let f (xi|\u0398) represent the evidence vector predicted by the network for the\nclassi\ufb01cation, where \u0398 is network parameters. Then, the corresponding Dirichlet distribution has\nparameters \u03b1i = f (xi|\u0398) + 1. Once the parameters of this distribution is calculated, its mean, i.e.,\n\u03b1i/Si, can be taken as an estimate of the class probabilities.\nLet yi be a one-hot vector encoding the ground-truth class of observation xi with yij = 1 and yik = 0\nfor all k (cid:54)= j, and \u03b1i be the parameters of the Dirichlet density on the predictors. First, we can treat\nD(pi|\u03b1i) as a prior on the likelihood Mult(yi|pi) and obtain the negated logarithm of the marginal\nlikelihood by integrating out the class probabilities\n\nLi(\u0398) = \u2212 log\n\n1\n\npyij\nij\n\nj=1\n\nB(\u03b1i)\n\nj=1\n\np\u03b1ij\u22121\n\nij\n\ndpi\n\n=\n\nyij\n\nlog(Si) \u2212 log(\u03b1ij)\n\n(3)\n\nand minimize with respect to the \u03b1i parameters. This technique is well-known as the Type II\nMaximum Likelihood.\n\n4\n\n(cid:32)(cid:90) K(cid:89)\n\nK(cid:89)\n\n(cid:33)\n\n(cid:16)\n\nK(cid:88)\n\nj=1\n\n(cid:17)\n\n\fAlternatively, we can de\ufb01ne a loss function and compute its Bayes risk with respect to the class\npredictor. Note that while the above loss in Equation 3 corresponds to the Bayes classi\ufb01er in the\nPAC-learning nomenclature, ones we will present below are Gibbs classi\ufb01ers. For the cross-entropy\nloss, the Bayes risk will read\n\nLi(\u0398) =\n\n\u2212yij log(pij)\n\n1\n\nB(\u03b1i)\n\np\u03b1ij\u22121\n\nij\n\ndpi =\n\n\u03c8(Si) \u2212 \u03c8(\u03b1ij)\n\nyij\n\n,\n\n(4)\n\nwhere \u03c8(\u00b7) is the digamma function. The same approach can be applied also to the sum of squares\nloss ||yi \u2212 pi||2\n\n(cid:35)\n\nK(cid:89)\n\nj=1\n\nK(cid:89)\n\n(cid:16)\n\nK(cid:88)\n\nj=1\n\n(cid:17)\n\n(cid:90) (cid:34) K(cid:88)\n\nj=1\n\n2, resulting in\nLi(\u0398) =\n\n(cid:90)\nK(cid:88)\n\nE(cid:104)\n\n||yi \u2212 pi||2\n\n2\n\n1\n\nB(\u03b1i)\n\ni=1\n\np\u03b1ij\u22121\n\nij\n\n(cid:105)\n\ndpi\n\nK(cid:88)\n\n(cid:16)\n\n(cid:17)\n\n=\n\nij \u2212 2yijpij + p2\ny2\n\nij\n\n=\n\nij \u2212 2yijE[pij] + E[p2\ny2\nij]\n\n.\n\n(5)\n\nj=1\n\nj=1\n\nAmong the three options presented above, we choose the last based on our empirical \ufb01ndings. We have\nobserved the losses in Equations 3 and 4 to generate excessively high belief masses for classes and\nexhibit relatively less stable performance than Equation 5. We leave theoretical investigation of the\ndisadvantages of these alternative options to future work, and instead, highlight some advantageous\ntheoretical properties of the preferred loss below.\nThe \ufb01rst advantage of the loss in Equation 5 is that using the identity\n\nwe get the following easily interpretable form\n\nE[p2\n\nij] = E[pij]2 + Var(pij),\n\nLi(\u0398) =\n\n(yij \u2212 E[pij])2 + Var(pij)\n\nj=1\n\nK(cid:88)\nK(cid:88)\nK(cid:88)\n\nj=1\n\n=\n\n=\n\nj=1\n\n(cid:124)\n\n(yij \u2212 \u03b1ij/Si)2\n\n+\n\n(cid:123)(cid:122)\n\nLerr\n\nij\n\n(yij \u2212 \u02c6pij)2 +\n\n(cid:125)\n\n\u03b1ij(Si \u2212 \u03b1ij)\nS2\ni (Si + 1)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\nij\n\nLvar\n\u02c6pij(1 \u2212 \u02c6pij)\n(Si + 1)\n\n.\n\nij\n\nis satis\ufb01ed.\n\nij < Lerr\n\nBy decomposing the \ufb01rst and second moments, the loss aims to achieve the joint goals of minimiz-\ning the prediction error and the variance of the Dirichlet experiment generated by the neural net\nspeci\ufb01cally for each sample in the training set. While doing so, it prioritizes data \ufb01t over variance\nestimation, as ensured by the proposition below.\nProposition 1. For any \u03b1ij \u2265 1, the inequality Lvar\nThe next step towards capturing the behavior of Equation 5 is to investigate whether it has a tendency\nto \ufb01t to the data. We assure this property thanks to our next proposition.\nProposition 2. For a given sample i with the correct label j, Lerr\ni\nadded to \u03b1ij and increases when evidence is removed from \u03b1ij.\nA good data \ufb01t can be achieved by generating arbitrarily many evidences for all classes as long as the\nground-truth class is assigned the majority of them. However, in order to perform proper uncertainty\nmodeling, the model also needs to learn variances that re\ufb02ect the nature of the observations. Therefore,\nit should generate more evidence when it is more sure of the outcome. In return, it should avoid\ngenerating evidences at all for observations it cannot explain. Our next proposition provides a\nguarantee for this preferable behavior pattern, which is known in the uncertainty modeling literature\nas learned loss attenuation [16].\nProposition 3. For a given sample i with the correct class label j, Lerr\nevidence is removed from the biggest Dirichlet parameter \u03b1il such that l (cid:54)= j.\n\ndecreases when new evidence is\n\ndecreases when some\n\ni\n\n5\n\n\fWhen put together, the above propositions indicate that the neural nets with the loss function in\nEquation 5 are optimized to generate more evidence for the correct class labels for each sample and\nhelps neural nets to avoid misclassi\ufb01cation by removing excessive misleading evidence. The loss\nalso tends to shrink the variance of its predictions on the training set by increasing evidence, but only\nwhen the generated evidence leads to a better data \ufb01t. The proofs of all propositions are presented in\nthe appendix.\nThe loss over a batch of training samples can be computed by summing the loss for each sample in the\nbatch. During training, the model may discover patterns in the data and generate evidence for speci\ufb01c\nclass labels based on these patterns to minimize the overall loss. For instance, the model may discover\nthat the existence of a large circular pattern on MNIST images may lead to evidence for the digit zero.\nThis means that the output for the digit zero, i.e., the evidence for class label 0, should be increased\nwhen such a pattern is observed by the network on a sample. However, when counter examples are\nobserved during training (e.g., a digit six with the same circular pattern), the parameters of the neural\nnetwork should be tuned by back propagation to generate smaller amounts of evidence for this pattern\nand minimize the loss of these samples, as long as the overall loss also decreases. Unfortunately,\nwhen the number of counter-examples is limited, decreasing the magnitude of the generated evidence\nmay increase the overall loss, even though it decreases the loss for the counter-examples. As a result,\nthe neural network may generate some evidence for the incorrect labels. Such misleading evidence for\na sample may not be a problem as long as it is correctly classi\ufb01ed by the network, i.e., the evidence\nfor the correct class label is higher than the evidence for other class labels. However, we prefer the\ntotal evidence to shrink to zero for a sample if it cannot be correctly classi\ufb01ed. Let us note that a\nDirichlet distribution with zero total evidence, i.e., S = K, corresponds to the uniform distribution\nand indicates total uncertainty, i.e., u = 1. We achieve this by incorporating a Kullback-Leibler\n(KL) divergence term into our loss function that regularizes our predictive distribution by penalizing\nthose divergences from the \"I do not know\" state that do not contribute to data \ufb01t. The loss with this\nregularizing term reads\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nL(\u0398) =\n\nLi(\u0398) + \u03bbt\n\nKL[D(pi| \u02dc\u03b1i) || D(pi|(cid:104)1, . . . , 1(cid:105))],\n\nwhere \u03bbt = min(1.0, t/10) \u2208 [0, 1] is the annealing coef\ufb01cient, t is the index of the current training\nepoch, D(pi|(cid:104)1, . . . , 1(cid:105)) is the uniform Dirichlet distribution, and lastly \u02dc\u03b1i = yi + (1 \u2212 yi) (cid:12) \u03b1i is\nthe Dirichlet parameters after removal of the non-misleading evidence from predicted parameters \u03b1i\nfor sample i. The KL divergence term in the loss can be calculated as\n\nKL[D(pi| \u02dc\u03b1i) || D(pi|1)]\n\n(cid:32)\n\n\u0393((cid:80)K\n\u0393(K)(cid:81)K\n\nk=1 \u02dc\u03b1ik)\nk=1 \u0393(\u02dc\u03b1ik)\n\n= log\n\n(cid:33)\n\nK(cid:88)\n\n(cid:34)\n\n+\n\n(\u02dc\u03b1ik \u2212 1)\n\n\u03c8(\u02dc\u03b1ik) \u2212 \u03c8\n\nk=1\n\nj=1\n\n(cid:16) K(cid:88)\n\n(cid:17)(cid:35)\n\n\u02dc\u03b1ij\n\n,\n\nwhere 1 represents the parameter vector of K ones, \u0393(\u00b7) is the gamma function, and \u03c8(\u00b7) is the\ndigamma function. By gradually increasing the effect of the KL divergence in the loss through\nthe annealing coef\ufb01cient, we allow the neural network to explore the parameter space and avoid\npremature convergence to the uniform distribution for the misclassi\ufb01ed samples, which may be\ncorrectly classi\ufb01ed in the future epochs.\n\n5 Experiments\n\nFor the sake of commensurability, we evaluate our method following the experimental setup studied\nby Louizos et al. [24]. We use the standard LeNet with ReLU non-linearities as the neural network\narchitecture. All experiments are implemented in Tensor\ufb02ow [1] and the Adam [17] optimizer has\nbeen used with default settings for training.1\nIn this section, we compared the following approaches: (a) L2 corresponds to the standard deter-\nministic neural nets with softmax output and weight decay, (b) Dropout refers to the uncertainty\nestimation model used in [8], (c) Deep Ensemble refers to the model used in [21], (d) FFG refers to\n\n1The\n\nimplementation\n\nand\n\na\n\ndemo\n\nhttps://muratsensoy.github.io/uncertainty.html\n\napplication\n\nof\n\nour model\n\nis\n\navailable\n\nunder\n\n6\n\n\fMethod\n\nMNIST CIFAR 5\n\n99.4\n99.5\n99.3\n99.1\n99.1\n99.3\n99.3\n\n76\n84\n79\n78\n77\n84\n83\n\nDeep Ensemble\n\nL2\n\nDropout\n\nFFGU\nFFLU\nMNFG\nEDL\n\nFigure 2: The change of accuracy with respect to\nthe uncertainty threshold for EDL.\n\nTable 1: Test accuracies (%) for MNIST and\nCIFAR5 datasets.\n\nthe Bayesian neural net used in [18] with the additive parametrization [26], (e) MNFG refers to the\nstructured variational inference method used in [24], (f) EDL is the method we propose.\nWe tested these approaches in terms of prediction uncertainty on MNIST and CIFAR10 datasets. We\nalso compare their performance using adversarial examples generated using the Fast Gradient Sign\nmethod [10].\n\n5.1 Predictive Uncertainty Performance\nWe trained the LeNet architecture for MNIST using 20 and 50 \ufb01lters with size 5 \u00d7 5 at the \ufb01rst\nand second convolutional layers, and 500 hidden units for the fully connected layer. Other methods\nare also trained using the same architecture with the priors and posteriors described in [24]. The\nclassi\ufb01cation performance of each method for the MNIST test set can be seen in Table 1. The table\nindicates that our approach performs comparable to the competitors. Hence, our extensions for\nuncertainty estimation do not reduce the model capacity. Let us note that the table may be misleading\nfor our approach, since the predictions that are totally uncertain (i.e., u = 1.0) are also considered\nas failures while calculating overall accuracy; such predictions with zero evidence implies that the\nmodel rejects to make a prediction (i.e. says \"I do not know\"). Figure 2 plots how the test accuracy\nchanges if EDL rejects predictions above a varying uncertainty threshold. It is remarkable that the\naccuracy for predictions whose associated uncertainty is less than a threshold increases and becomes\n1.0 as the uncertainty threshold decreases.\nOur approach directly quanti\ufb01es uncertainty using Equation 1. However, other approaches use\nentropy to measure the uncertainty of predictions as described in [24], i.e., uncertainty of a prediction\nis considered to increase as the entropy of the predicted probabilities increases. To be fair, we use the\nsame metric for the evaluation of prediction uncertainty in the rest of the paper; we use Equation 2\nfor class probabilities.\nIn our \ufb01rst set of evaluations, we train the models on the MNIST train split using the same LeNet\narchitecture and test on the notMNIST dataset, which contains letters, not digits. Hence, we expect\npredictions with maximum entropy (i.e. uncertainty). On the left panel of Figure 3, we show\nthe empirical CDFs over the range of possible entropies [0, log(10)] for all models trained with\nMNIST dataset. The curves closer to the bottom right corner of the plot are desirable, which indicate\nmaximum entropy in all predictions [24]. It is clear that the uncertainty estimates of our model is\nsigni\ufb01cantly better than those of the baseline methods.\nWe have also studied the setup suggested in [24], which uses a subset of the classes in CIFAR10 for\ntraining and the rest for out-of-distribution uncertainty testing. For fair comparison, we follow the\nauthors and use the large LeNet version which contains 192 \ufb01lters at each convolutional layer and\nhas 1000 hidden units for the fully connected layers. For training, we use the samples from the \ufb01rst\n\ufb01ve categories {dog, frog, horse, ship, truck} in the training set of CIFAR10. The accuracies of the\ntrained models on the test samples from same categories are shown in Table 1. Figure 2 shows that\nEDL provides much more accurate predictions as the prediction uncertainty decreases.\n\n7\n\n\fFigure 3: Empirical CDF for the entropy of the predictive distributions on the notMNIST dataset\n(left) and samples from the last \ufb01ve categories of CIFAR10 dataset (right).\n\nFigure 4: Accuracy and entropy as a function of the adversarial perturbation \u0001 on the MNIST dataset.\n\nTo evaluate the prediction uncertainties of the models, we tested them on the samples from the last\n\ufb01ve categories of the CIFAR10 dataset, i.e., {airplane, automobile, bird, cat, deer}. Hence, none\nof the predictions for these samples is correct and we expect high uncertainty for the predictions.\nOur results are shown at the right of Figure 3. The \ufb01gure indicates that EDL associates much more\nuncertainty to its predictions than other methods.\n\n5.2 Accuracy and Uncertainty on Adversarial Examples\n\nWe also evaluated our approach against adversarial examples [10]. For each model trained in the\nprevious experiments, adversarial examples are generated using the Fast Gradient Sign method\nfrom the Cleverhans adversarial machine learning library [28], using various values of adversarial\nperturbation coef\ufb01cient \u0001. These examples are generated using the weights of the models and it gets\nharder to make correct predictions for the models as the value of \u0001 increases. We use the adversarial\nexamples to test the trained models. However, the Deep Ensemble model is excluded in this set of\nexperiments for fairness, since it is trained on adversarial examples.\nFigure 4 shows the results for the models trained on the MNIST dataset. It demonstrates accuracies\non the left panel and uncertainty estimations on the right. Uncertainty is estimated in terms of the\nratio of prediction entropy to the maximum entropy, which is referred to as % max entropy in the\n\ufb01gure. Let us note that the maximum entropy is log(10) and log(5) for the MNIST and CIFAR5\ndatasets, respectively. The \ufb01gure indicates that Dropout has the highest accuracy for the adversarial\nexamples as shown on the left panel of the \ufb01gure; however, it is overcon\ufb01dent on all of its predictions\nas indicated by the right \ufb01gure. That is, it places high con\ufb01dence on its wrong predictions. However,\nEDL represents a good balance between the prediction uncertainty and accuracy. It associates very\nhigh uncertainty to the wrong predictions. We perform the same experiment on the CIFAR5 dataset.\nFigure 5 demonstrates the results, which indicates that EDL associates higher uncertainty for the\nwrong predictions. On the other hand, other models are overcon\ufb01dent with their wrong predictions.\n\n8\n\n\fFigure 5: Accuracy and entropy as a function of the adversarial perturbation \u0001 on CIFAR5 dataset.\n\n6 Related Work\n\nThe history of learning uncertainty-aware predictors is concurrent with the advent of modern Bayesian\napproaches to machine learning. A major branch along this line is Gaussian Processes (GPs) [29],\nwhich are powerful in both making accurate predictions and providing reliable measures for the\nuncertainty of their predictions. Their power in prediction has been demonstrated in different contexts\nsuch as transfer learning [15] and deep learning [32]. The value of their uncertainty calculation has\nset the state of the art in active learning [12]. As GPs are non-parametric models, they do not have a\nnotion of deterministic or stochastic model parameters. A signi\ufb01cant advantage of GPs in uncertainty\nmodeling is that the variance of their predictions can be calculated in closed form, although they are\ncapable of \ufb01tting a wide spectrum of non-linear prediction functions to data. Hence they are universal\npredictors [31].\nAnother line of research in prediction uncertainty modeling is to employ prior distributions on\nmodel parameters (when the models are parametric), infer the posterior distribution, and account\nfor uncertainty using high-order moments of the resultant posterior predictive distribution. BNNs\nalso fall into this category [25]. BNNs build on accounting for parameter uncertainty by applying\na prior distribution on synaptic connection weights. Due to the non-linear activations between\nconsecutive layers, calculation of the resultant posterior on the weights is intractable. Improvement\nof the approximation techniques, such as Variational Bayes (VB) [2, 6, 27, 23, 9] and Stochastic\nGradient Hamiltonian Monte Carlo (SG-HMC) [3] tailored speci\ufb01cally for scalable inference of\nBNNs is an active research \ufb01eld. Despite their enormous prediction power, the posterior predictive\ndistributions of BNNs cannot be calculated in closed form. The state of the art is to approximate the\nposterior predictive density with Monte Carlo integration, which brings a signi\ufb01cant noise factor on\nuncertainty estimates. Orthogonal to this approach, we bypass inferring sources of uncertainty on the\npredictor and directly model a Dirichlet posterior by learning its hyperparameters from data via a\ndeterministic neural net.\n\n7 Conclusions\n\nIn this work, we design a predictive distribution for classi\ufb01cation by placing a Dirichlet distribution\non the class probabilities and assigning neural network outputs to its parameters. We \ufb01t this predictive\ndistribution to data by minimizing the Bayes risk with respect to the L2-Norm loss which is regularized\nby an information-theoretic complexity term. The resultant predictor is a Dirichlet distribution on\nclass probabilities, which provides a more detailed uncertainty model than the point estimate of the\nstandard softmax-output deep nets. We interpret the behavior of this predictor from an evidential\nreasoning perspective by building the link from its predictions to the belief mass and uncertainty\ndecomposition of the subjective logic. Our predictor improves the state of the art signi\ufb01cantly in\ntwo uncertainty modeling benchmarks: i) detection of out-of-distribution queries, and ii) endurance\nagainst adversarial perturbations.\n\n9\n\n\fAcknowledgments\n\nThis research was sponsored by the U.S. Army Research Laboratory and the U.K. Ministry of\nDefence under Agreement Number W911NF-16-3-0001. The views and conclusions contained in this\ndocument are those of the authors and should not be interpreted as representing the of\ufb01cial policies,\neither expressed or implied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K.\nMinistry of Defence or the U.K. Government. The U.S. and U.K. Governments are authorized to\nreproduce and distribute reprints for Government purposes notwithstanding any copyright notation\nhereon. Also, Dr. Sensoy thanks to the U.S. Army Research Laboratory for its support under grant\nW911NF-16-2-0173.\n\nReferences\n\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, volume 16,\npages 265\u2013283, 2016.\n\n[2] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wiestra. Weight uncertainty in neural\n\nnetworks. In ICML, 2015.\n\n[3] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In ICML,\n\n2017.\n\n[4] D. Ciresan, A. Giusti, L.M. Gambardella, and J. Schmidhuber. Deep neural networks segment\n\nneuronal membranes in electron microscopy images. In NIPS, 2012.\n\n[5] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. Multi-column deep neural network for\n\ntraf\ufb01c sign classi\ufb01cation. Neural Networks, 32:333\u2013338, 2012.\n\n[6] M. Welling D. Kingma, T. Salimans. Variational dropout and the local reparameterization trick.\n\nIn NIPS, 2015.\n\n[7] A.P. Dempster. A generalization of Bayesian inference. In Classic works of the Dempster-Shafer\n\ntheory of belief functions, pages 73\u2013104. Springer, 2008.\n\n[8] Y. Gal and Z. Ghahramani. Bayesian convolutional neural networks with bernoulli approximate\n\nvariational inference. ICLR Workshops, 2016.\n\n[9] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model\n\nuncertainty in deep learning. In ICML, 2016.\n\n[10] I.J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.\n\narXiv preprint arXiv:1412.6572, 2014.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[12] N. Houlsby, F. Huszar, Z. Ghahramani, and J.M. Hern\u00e1ndez-Lobato. Collaborative Gaussian\n\nprocesses for preference learning. In NIPS, 2012.\n\n[13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[14] A. Josang. Subjective Logic: A Formalism for Reasoning Under Uncertainty. Springer, 2016.\n[15] M. Kandemir. Asymmetric transfer learning with deep Gaussian processes. In ICML, 2015.\n[16] A. Kendall and Y. Gal. What uncertainties do we need in Bayesian deep learning for computer\n\nvision? In NIPS, 2017.\n\n[17] D.P. Kingma and J. Ba. Adam: A method for stochastic optimisation. In ICLR, 2015.\n[18] D.P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization\n\ntrick. In NIPS, 2015.\n\n[19] S. Kotz, N. Balakrishnan, and N.L. Johnson. Continuous Multivariate Distributions, volume 1.\n\nWiley, New York, 2000.\n\n[20] A. Krizhevsky, I. Sutskever, and G.E. Hinton. ImageNet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n10\n\n\f[21] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty\n\nestimation using deep ensembles. In NIPS, 2017.\n\n[22] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. Object recognition with gradient-based learning.\n\nIn Shape, Contour and Grouping in Computer Vision, 1999.\n\n[23] Y. Li and Y. Gal. Dropout inference in Bayesian neural networks with alpha-divergences. In\n\nICML, 2017.\n\n[24] C. Louizos and M. Welling. Multiplicative normalizing \ufb02ows for variational bayesian neural\n\nnetworks. In ICML, 2017.\n\n[25] D.J. MacKay. Probable networks and plausible predictions \u2013 a review of practical Bayesian\nmethods for supervised neural networks. Network: Computation in Neural Systems, 6(3):469\u2013\n505, 1995.\n\n[26] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsi\ufb01es deep neural networks.\n\nIn ICML, 2017.\n\n[27] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsi\ufb01es deep neural networks.\n\nIn ICML, 2017.\n\n[28] N. Papernot, N. Carlini, I. Goodfellow, R. Feinman, F. Faghri, A. Matyasko, K. Hambardzumyan,\nY.-L. Juang, A. Kurakin, R. Sheatsley, et al. Cleverhans v2. 0.0: An adversarial machine learning\nlibrary. arXiv preprint arXiv:1610.00768, 2016.\n\n[29] C.E. Rasmussen and C.I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2016.\n\n[31] D. Tran, R. Ranganath, and D. Blei. The variational Gaussian process. In ICLR, 2016.\n[32] A.G. Wilson. Deep kernel learning. In AISTATS, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1625, "authors": [{"given_name": "Murat", "family_name": "Sensoy", "institution": "Ozyegin University"}, {"given_name": "Lance", "family_name": "Kaplan", "institution": "U.S. Army Research Laboratory"}, {"given_name": "Melih", "family_name": "Kandemir", "institution": "Bosch Center for Artificial Intelligence (BCAI)"}]}