{"title": "Deep Gamblers: Learning to Abstain with Portfolio Theory", "book": "Advances in Neural Information Processing Systems", "page_first": 10623, "page_last": 10633, "abstract": "We deal with the selective classification problem (supervised-learning problem with a rejection option), where we want to achieve the best performance at a certain level of coverage of the data. We transform the original $m$-class classification problem to (m+1)-class where the (m+1)-th class represents the model abstaining from making a prediction due to disconfidence. Inspired by portfolio theory, we propose a loss function for the selective classification problem based on the doubling rate of gambling. Minimizing this loss function corresponds naturally to maximizing the return of a horse race, where a player aims to balance between betting on an outcome (making a prediction) when confident and reserving one's winnings (abstaining) when not confident. This loss function allows us to train neural networks and characterize the disconfidence of prediction in an end-to-end fashion. In comparison with previous methods, our method requires almost no modification to the model inference algorithm or model architecture. Experiments show that our method can identify uncertainty in data points, and achieves strong results on SVHN and CIFAR10 at various coverages of the data.", "full_text": "Deep Gamblers:\n\nLearning to Abstain with Portfolio Theory\n\nLiu Ziyin\u2020, Zhikang T. Wang\u2020, Paul Pu Liang\u266d,\n\nRuslan Salakhutdinov\u266d, Louis-Philippe Morency\u2662, Masahito Ueda\u2020\n\u266dMachine Learning Department, Carnegie Mellon University\n\u2662Language Technologies Institute, Carnegie Mellon University\n\n\u2020Institute for Physics of Intelligence & Department of Physics, University of Tokyo\n\n{zliu,wang}@cat.phys.s.u-tokyo.ac.jp ueda@phys.s.u-tokyo.ac.jp\n\n{pliang,rsalakhu,morency}@cs.cmu.edu\n\nAbstract\n\nWe deal with the selective classi\ufb01cation problem (supervised-learning problem with\na rejection option), where we want to achieve the best performance at a certain level\nof coverage of the data. We transform the original m-class classi\ufb01cation problem\n\nto(m+ 1)-class where the(m+ 1)-th class represents the model abstaining from\n\nmaking a prediction due to discon\ufb01dence. Inspired by portfolio theory, we propose\na loss function for the selective classi\ufb01cation problem based on the doubling rate\nof gambling. Minimizing this loss function corresponds naturally to maximizing\nthe return of a horse race, where a player aims to balance between betting on\nan outcome (making a prediction) when con\ufb01dent and reserving one\u2019s winnings\n(abstaining) when not con\ufb01dent. This loss function allows us to train neural\nnetworks and characterize the discon\ufb01dence of prediction in an end-to-end fashion.\nIn comparison with previous methods, our method requires almost no modi\ufb01cation\nto the model inference algorithm or model architecture. Experiments show that\nour method can identify uncertainty in data points, and achieves strong results on\nSVHN and CIFAR10 at various coverages of the data.\n\nIntroduction\n\n1\nWith deep learning\u2019s unprecedented success in \ufb01elds such as image classi\ufb01cation [21, 18, 24],\nlanguage understanding [9, 35, 42, 32], and multimodal learning [26, 33], researchers have now\nbegun to apply deep learning to facilitate scienti\ufb01c discovery in \ufb01elds such as physics [2], biology [38],\nchemistry [16], and healthcare [20]. However, one important challenge for applications of deep\nlearning to these natural science problems comes from the requirement of assessing the con\ufb01dence\nlevel in prediction. Characterizing con\ufb01dence and uncertainty of model predictions is now an\nactive area of research [12], and being able to assess prediction con\ufb01dence allows us to handpick\ndif\ufb01cult cases and treat them separately for better performance [13] (e.g., by passing to a human\nexpert). Moreover, knowing uncertainty is important for fundamental machine learning research [19];\nfor example, many reinforcement learning algorithms (such as Thompson sampling [40]) requires\nestimating uncertainty of the distribution [39].\nHowever, there has not been any well-established, effective and ef\ufb01cient method to assess prediction\nuncertainty of deep learning models. We believe that there are four desiderata for any framework to\nassess deep learning model uncertainty. Firstly, they must be simply end-to-end trainable, because\nend-to-end trainability is important for accessibility to the method. Secondly, it should require no\nheavy sampling procedure because sampling a model or prediction (as in Bayesian methods) hundreds\nof times is computationally heavy. Thirdly, it should not require retraining when different levels of\nuncertainty are required because many tasks such as ImageNet [8] and 1 Billion Word [4] require\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Top-10 rejected images in the MNIST testing set found by two methods. The number above image\nis the predicted uncertainty score (ours) or the entropy of the prediction (baseline). For the top-2 images, our\nmethod chooses images that are hard to recognize, while that of the baseline can be identi\ufb01ed unambiguously by\nhuman.\n\nweeks of training, which is too expensive. Lastly, it should not require any modi\ufb01cation to existing\nmodel architectures so that we can achieve better \ufb02exibility and minimal engineering effort. However,\nmost of the methods that currently exist do not meet some of the above criteria. For example,\nexisting Bayesian approaches, in which priors of model parameters are de\ufb01ned and the posteriors\nare estimated using the Bayes theorem [29, 13, 34], usually rely heavily on sampling to estimate\nthe posterior distribution [13, 34] or modifying the network architecture via the reparametrization\ntrick [22, 3]. These methods therefore incur computational costs which slow down the training\nprocess. This argument also applied to ensembling methods [25]. Selective classi\ufb01cation methods\noffer an alternative approach [5], which either require modifying model objectives and retraining the\nmodel [14, 15]. See Table 1 for a summary of the existing methods, and the problems with these\nmethods are discussed in Section 3. In this paper, we follow the selective classi\ufb01cation framework (see\nSection 2), and focus on a setting where we only have a single classi\ufb01er augmented with a rejection\noption1. Inspired by portfolio theory in mathematical \ufb01nance [30], we propose a loss function for\nthe selective classi\ufb01cation problem that is easy to optimize and requires almost no modi\ufb01cation to\nexisting architectures.\n2 The Selective Prediction Problem\nIn this work, we consider a selective prediction problem setting [10]. Let X be the feature space\nand Y the label space. For example, X could be the distribution of images, and Y would be the\n\ndistribution of the class labels, and our goal is to learn the conditional distribution P(YX), and a\nprediction model parametrized by weight w is a function fw\u2236 X\u2192 Y . The risk of the task w.r.t\nto a loss function (cid:96)(\u22c5) is EP(X,Y)[(cid:96)(f(x), y)], given a dataset with size N{(xi, yi)}N\ni=1 where all\n(xi, yi) are independent draws from X\u00d7 Y . A prediction model augmented with a rejection option\nis a pair of functions(f, g) such that gh\u2236 X\u2192 R is a selection function which can be interpreted as a\nbinary quali\ufb01er for f as follow:(f, g)(x)\u2236=\u0004f(x),\ni.e., the model abstains from making a prediction when the selection function g(x) falls below a\npredetermined threshold h. We call g(x) the uncertainty score of x; different methods tend to use\ndifferent g(x). The covered dataset is de\ufb01ned to be{x\u2236 gh(x)\u2265 h}, and the coverage is the ratio of\n\nthe size of covered dataset to the size of the original dataset. Clearly, one may trade-off coverage for\nlower risk, and this is the motivation behind rejection option methods.\n3 Related Work\nAbstention Mechanisms. Here we summarize the existing methods to perform abstention, and these\nare the methods we will compare with in this paper. For a summary of the features of these methods,\nsee table 1. Entropy Selection (ES): This is the simplest way to output an uncertainty score for\na prediction, we compare with this in the qualitative experiments. It simply takes the entropy of\nthe predicted probability as the uncertainty score. Softmax-Response (SR, [14]): This is a simple\nyet theoretically-guaranteed strong baseline proposed in [14]. It regards the maximum predicted\nprobability as the con\ufb01dence score; it differs from our work in that it does not involve training an\n\nif gh(x)\u2265 h\n\nABSTAIN, otherwise\n\n(1)\n\n1i.e., we do not consider ensembling methods, but we note that such method can be used together with our\n\nmethod and is likely to increase performance\n\n2\n\n\fSimple end-to-end training\nNo sampling process required\nNo retraining needed for different coverage\nNo modi\ufb01cation to model architecture\n\nOurs\n\u0013\n\u0013\n\u0013\n\u0013\n\n\u0013\n\u0013\n\u0013\n\u0013\n\nSR [14] BD [13]\n\nSN [15]\n\n\u0017\n\u0013\n\u0017\n\u0017\n\n\u0013\n\u0017\n\u0013\n\u0013\n\nTable 1: Summary of features of different methods for selective prediction. Our method is end-to-end trainable\nand does not require sampling, retraining, or architecture modi\ufb01cation.\n\nabstention mechanism. Bayes Dropout (BD, [13]): This is a SOTA Bayesian method that offer a\nway to reject uncertain images [13]. One problem with with method is that one often needs about\nextensive sampling to obtain an accurate estimation of the uncertainty. SelectiveNet (SN, [15]): This\nis a very recent work that also trains a network to predict its uncertainty, and is the current SOTA\nmethod of the selective prediction problem. The loss function of this method requires interior point\nmethod to optimize and depends on the target coverage one wants to achieve.\nPortfolio Theory and Gambling The Modern Portfolio Theory (MPT) is a modern method in\ninvestment for assembling a portfolio of assets that maximizes expected return while minimizing\nthe risk [30]. The generalized portfolio theory is a constrained minimization problem in which we\nsought for maximum expected return with a variance constraint. In this work, however, we explore\na very limited form of portfolio theory that can be seen as a horse race, as a proof of concept for\nbridging uncertainty in deep learning and portfolio theory. In this work, we focus on the classi\ufb01cation\nproblem, and we believe that regression problems can similarly be reformulated as a general portfolio\nproblem, and we leave this line of research to the future. The connection between portfolio theory,\ngambling and information theory is studied in [6, 7]. Some of the theoretical arguments presented in\nthis work are based on arguments given in [7].\n4 Learning to Abstain with Portfolio Theory\nThe intuition behind the method is that a deep learning model learning to abstain from prediction\nindeed mimicks a gambler learning to reserve betting in a game. Indeed, we show that if we have a\n\nm-class classi\ufb01cation problem, we can instead perform a m+ 1 class classi\ufb01cation which predicts\nthe probabilities of the m classes and use the(m+ 1)-th class as an additional rejection score. This\n\nmethod is similar to [14, 15], and the difference lies in how we learn such a model. We use ideas\nfrom portfolio theory which says that if we have some budget, we should split them between how\nmuch we would like to bet, and how much to save. In the following sections, we \ufb01rst provide a gentle\nintroduction to portfolio theory which will provide the mathematical foundations of our method.\nWe then describe how to adapt portfolio theory for classi\ufb01cation problems in machine learning and\nderive our adapted loss function that trains a model to predict a rejection score. We \ufb01nally prove\nsome theoretical properties of our method to show that a classi\ufb01cation problem can indeed be seen as\na gambling problem, and thus avoiding a bet in gambling can indeed been interpreted as giving a\nrejection score.\n4.1 A Short Introduction to General Portfolio Theory\nTo keep the terminology clear, we give a\nchart of the terms from portfolio theory and\ntheir corresponding concepts in deep learn-\ning in Table 2. The rows in the dictionary\nshow the correspondences we are going to\nmake in this section. In short, portfolio the-\nory tells us what is the best way to invest\nin a stock market. A stock market with m\nstocks is a vector of positive real numbers\n\nX=(X1, ..., Xm), and we de\ufb01ne the price\nof the day. For example, Xi= 0.95 means that the price of the stock is 0.95 times its price at the\ndistribution X\u223c P(X). A portfolio refers to our investment in this stock market, and can be modeled\nas a discrete distribution b=(b1, ..., bm) where bi\u2265 0 and\u2211i bi= 1, and b is our distributing of\nwealth to X. In this formulation, the wealth relative at the end of the day is S= bT X=\u2211i biXi; this\n\nrelaive Xi as the ratio of the price of the stock i at the end of the day to the price at the beginning\n\nbeginning of the day. We formulate the price vector as a vector of random variables drawn from a joint\n\nStock Market Outcome\nHorse Race Outcome\nReservation in Gamble\n\nPortfolio Theory\n\nPortfolio\n\nDoubling Rate\nStock/Horse\n\nTable 2: Portfolio Theory - Deep Learning Dictionary.\n\nDeep Learning\n\nPrediction\n\nnegative NLL loss\ninput data point\n\nTarget Label\nTarget Label\nAbstention\n\ntell us the ratio of our wealth at the end of the day to our wealth at the beginning of the day.\n\n3\n\n\fP(X) is\n\nDe\ufb01nition 1. The doubling rate of a stock market portfolio b with respect to a stock distribution\n\nW(b, P)=S log2\u0004bT x\u0004dP(x).\n\nThis tells us the speed at which our wealth increases, and we want to maximize W . Now we consider\na simpli\ufb01ed version of portfolio theory called the \u201chorse race\".\n4.2 Horse Race\nDifferent from a stock market, a horse race has an exclusive outcome (only one horse wins, and it\u2019s\n\neither win or loss) x(j)=(0, ..., 0, 1, 0, ..., 0), which is a one-hot vector on the j -th entry. In a horse\ncorresponds to choosing a portfolio. Again, we require that bi\u2265 0, and\u2211i bi= 1. The wealth relative\nof the gambler at the end of the game will be S(x(j))= bjoj when the horse j wins. After n many\n\nrace, we want to bet on m horses, and the i-th horse wins with probability pi, and the payoff is oi for\nbetting 1 dollar on horse i if i wins, and the payoff is 0 otherwise. Now the gambler can choose to\ndistribute his wealth over the m horses, according to o and p, and let b denote such distribution; this\n\nraces, our wealth relative would be:\n\nSn= nM\ni=1\n\nS(xi).\n\nNotice that our relative wealth after n races does not depend on the order of the occurrence of the\nresult of each race (and this will justify our treatment of a batch of samples as races). We can de\ufb01ne\nthe doubling rate by changing the integral to a sum:\nDe\ufb01nition 2. The doubling rate of a horse race is\n\nW(b, p)= E log2(S)= mQ\ni=1\n\npi log2(bioi).\n\nAs before, we want to maximize the doubling rate. Notice that if we take oi = 1 and bi be the\nwith reservation, we can bet on m+ 1 categories where the m+ 1-th category denotes reservation\nwith payoff 1. Now the wealth relative after a race becomes S(xj)= bjoj+ bm+1 and our objective\nbecomes maxb W(b, p), where\n\npost-softmax output of our model, then W is equivalent to the commonly used cross-entropy loss\nin classi\ufb01cation. However, a horse race can be more general because the gambler can choose to bet\nonly with part of his money and reserve the rest to minimize risk. This means that, in a horse race\n\n(2)\n\n(3)\n\nmax W(b, p)= mQ\ni=1\n\npi log(bioi+ bm+1).\n\nThis is the gambler\u2019s loss.\n4.3 Classi\ufb01cation as a Horse Race\n\nAn m-class classi\ufb01cation task can be seen as \ufb01nding a function f\u2236 Rn\u2192 Rm, where n is the input\ndimension and m is the number of classes. For an output f(x), we assume that it is normalized, and\nwe treat the output of f(\u22c5) as the probability of input x being labeled in class j:\n\n(4)\nNow, let us parametrize the function f as a neural network with parameter w, whose output is a\ndistribution over the class labels. We want to maximize the log probability of the true label j:\n\nPr(jx)= f(x)j\nmax E[log p(jx)]= max\n\nE[log fw(x)j]\n\nFor a m-class classi\ufb01cation task, we transform it to a horse race with reservation by adding a m+ 1-th\n\nclass, which stands for reservation. The objective function for a mini-batch of size B, and for constant\no over all categories is then (cf. Equation 3)\n\nw\n\n(5)\n\nW(b(f), p)= max\n\nlog\u0004fw(xi)j(i)o+ fw(xi)m+1\u0004.\n\nmax\n\nwhere i is the index over the batch, and j(i) is the label of the i-th data point. As previously remarked,\nif oj= 1 for all j and bm+1= 0, we recover the standard supervised classi\ufb01cation task. Therefore o\n\n(6)\n\nw\n\nf\n\nBQ\n\ni\n\n4\n\n\fbecomes a hyperparameter, and a higher o encourages the network to be con\ufb01dent in inferring, and a\nlow o makes it less con\ufb01dent. In the next section, we will show that the problem is only meaningful\n\nfor 1< o< m. The selection function g(\u22c5) is then fw(\u22c5)m+1(cf. Equation 1), and prediction on\n\nInformation Theoretic Analysis\n\ndifferent coverage can be achieved by simply calibrating the threshhold h on a validation set. Also\nnotice that an advantage of our method over the current SOTA method [15] is that our loss function\ndoes not depend on the coverage.\n5\nIn this section, we analyze our formulation theoretically to explain how our method works. In the\n\ufb01rst theorem, we show that for a horce race without reservation, its optimal solution exists. We\nthen show that, in a setting (gambling with side information) that resembles an actual classi\ufb01cation\nproblem, the optimal solution also exists, and it is the same as the optimal solution we expect for\na classi\ufb01cation problem. The last theorem deals with the possible range of o for a horse race with\nreservation, followed by a discussion about we should choose the hyperparameter o.\nIn the problem setting, we considered a gambling problem that is probabilistic in nature. It corresponds\nto a horse race in which, the distribution of winning horses is drawn from a predetermined distribution\n\nP(Y) and no other information besides the indices of the horse is given. In this case, we show that\nthe optimal solution should be proportional to P(Y) when no reservation is allowed.\nwhere H(p)=\u2212\u2211 p log p is the entropy of the distribution p, and this rate is achieved by proportional\ngambling b\u2217= p.\n\npi log oi\u2212 H(p).\n\nTheorem 1. The optimal doubling rate is given by\n\n\u2217(p)=Q\n\ni\n\nW\n\n(7)\n\nThis result shows the equivalence between a prediction problem and a gambling problem. In fact,\ntrying to minimize the natural log loss for a classi\ufb01cation task is the same as trying to maximize\nthe doubling rate in a gambling problem. However, in practice, we are often in a horse race where\nsome information about the horse is known. For example, in the \u201cMNIST\" horse race, one sees a\npicture, and want to guess its category, i.e., one has access to side information. In the next theorem,\nwe show that in a gambling game with side information, the optimal gambling strategy is obtained\nby a prediction that maximizes the mutual information between the horse (image) and the outcome\n(label). This is a classical theorem that can be found in [7]. The proofs are given in the appendix.\nTheorem 2. Let W denote the doubling rate de\ufb01ned in Def. 2. For a horse race Y to which some\nside information X is given, the amount of increase \u2206W is\n\n\u2206W= I(X; Y)=Q\n\np(x, y) log\n\nx,y\n\np(x, y)\np(x)p(y) .\n\n(8)\n\nThis shows that the increase in the doubling rate from knowing X is bounded by the mutual\ninformation between the two. This means that the neural network, during training, will have to\nmaximize the mutual information between the prediction and the true label of the sample. This shows\nthat an image classi\ufb01cation problem is exactly equal to a horse race with side information. However,\nthe next theorem makes our formulation different from a standard classi\ufb01cation task and can be seen\nas a generalization of it. We show that, when reservation is allowed, the optimal strategy changes\nwith o, the return of winning. Especially, for some range of o, only trivial solutions to the gambling\nproblem exist. Since the tasks in this work only deals with situations in which o is uniform across\ncategories, we assume o to be uniform for clarity.\n\nTheorem 3. Let m be the number of horses, and let W be de\ufb01ned as in Eq. 3, and let oi= o for all i;\nthen if o> m, the optimal betting always have bm+1= 0; if o< 1, then the optimal betting always\nhave bi= 0 for i\u2260 m+ 1.\nThis theorem tells us that when the return from betting is too high (o> m), then we should always bet,\nand so the optimal solution is given by Theorem 1; when the return from betting is too low (o< 1),\nthen we should always reserve. A more realistic situation should have that 1< o< m, which re\ufb02ects\n\nthe fact that, while one might expect to gain in a horse race, the organizer of the game takes a cut of\nthe bets. We discuss the effect of varying o in the appendix. In fact, the optimal rejection score of\nto a given prediction probability of our method can be easily found using Kuhn-Tucker condition\n\n5\n\n\f(b) h= 0.999\n\n(c) h= 0.99\n\n(d) h= 0.9\n\n(e) h= 0.5\n\n(a) Training Set\n\nFigure 2: Output of the network. h is the threshold, and yellow points are rejected points at this level of h.\n\n(a) Testing Set\n\n(b) entropy based\n\n(c) deep gamblers\n\nFigure 3: Identifying the outlier distribution. h is chosen to be the largest value such that the outlier cluster is\nrejected. We see that a network trainied with our method rejects the outlier cluster much earlier than the entropy\nbased method.\nwithout training the network, but we argue that it is not the learned rejection score that is the core\nof method, but that this loss function allows the trained model to learn a qualitatively different and\nbetter hidden representation than the baseline model. See Figure 5 (and appendix).\n6 Experiments\nWe begin with some toy experiments to demonstrate that our method can deal with various kinds of\nuncertain inputs. We then compare the existing methods in selective classi\ufb01cation and show that the\nproposed method is very competitive against the SOTA method. Implementation details are in the\nappendix.\n6.1 Synthetic Gaussian Dataset\nIn this section, we train a network with 2 hidden layers each with 50 neurons and tanh activation.\nFor the training set, we generate 2 overlapping diagonal 2d-Gaussian distributions and the task is\n\na simple binary classi\ufb01cation. Cluster 1 has mean(1, 1) and unit variance, and cluster 2 has mean\n(\u22121,\u22121) with unit variance. The fact that these two clusters are not linearly separable is the \ufb01rst\nmodel deals with out-of-distribution samples. This distribution has mean(5,\u22125) and variance 0.5.\n\nsource of uncertainty. A third out-of-distribution cluster exists only in the testing set to study how the\n\nThis is the second source of uncertainty. Figure 2(a) shows the training set and 3(a) shows the test set.\nWe gradually decrease the threshold h for the predicted discon\ufb01dent score, and label the points above\nthe threshold as rejected. These results are visualized in Figure 2 and we observe that the model\ncorrectly identi\ufb01es the border of the two Gaussian distributions as the uncertain region. We also see\nthat, by lowering the threshold, the width of the uncertain region increases. This shows how we might\ncalibrate threshold h to control coverage. Now we study how the model deals with out-of-distribution\nuncertainty. From Figure 3, we see that the entropy based selection is only able to reject the third\ncluster when most of data points are excluded, while our method seems to reject the outliers equally\nwell with the boundary points.\n6.2 Locating the outlier testing images of MNIST\nIn this section, we show the images that our method \ufb01nds the most discon\ufb01dent in MNIST in\ncomparison with the entropy selection method in Figure 1. The model is a simple 4-layer CNN. We\n\ufb01nd that our method seems to outperform the baseline qualitatively. For example, the two least certain\nimages found by the entropy based method can be labeled by a human unambiguously as a 2 and 7,\nwhile the top-2 images found by our method do not look like images of numbers at all. Most \ufb01gures\nof this experiment and plots of how the images change across different epochs can be found in the\nappendix.\n6.3 Rotating an MNIST image\nFor illustration, we choose an image of 9 and rotate it up to 180 degrees because a number 9 looks\nlike a distorted 5 when rotated by 90 degrees and looks like a 6 when rotated by 180, which allows\n\n6\n\n\fFigure 4: Rotating an image of 9 by 180 degrees. The number above the images are the prediction label of the\nrotated image.\n\nSN\n3.21\n1.40\n\nSR\n3.21\n1.39\n0.89\n0.70\n0.61\n\nBD\n3.21\n1.40\n0.90\n0.71\n0.61\n\nCoverage\n\nOurs\n\n\u2212\n\n(Best Single Model)\n\nOurs\n\no=2.63.24\u00b1 0.09\no=2.61.36\u00b1 0.02\no=2.60.76\u00b1 0.05\no=2.60.57\u00b1 0.07\no=2.60.51\u00b1 0.05\n\n(Best per coverage)\n\n1.00\n0.95\n0.90\n0.85\n0.80\n\n0.82\u00b1 0.01\n0.60\u00b1 0.01\n0.53\u00b1 0.01\nmethod achieved competitive results across all coverages. It is the SOTA method at coverage(0.85, 1.00).\n\nTable 3: SVHN. The number is error percentage on the covered dataset; the lower the better. We see that our\n\no=2.61.36\u00b1 0.02\no=2.60.76\u00b1 0.05\no=3.60.66\u00b1 0.01\no=3.60.53\u00b1 0.04\n\nus to analyze the behavior of the model clearly. See \ufb01gure 4. We see that the model assesses its\ndiscon\ufb01dence as we expected, labeling the image as 9 at the beginning and 6 at the end, and as a 5\nwith high uncertainty in an intermediate region. We also notice that the uncertainty score has two\npeaks corresponding to crossing of decision boundaries. This suggests that the model has really\nlearned to assess uncertainty in a subtle and meaningful way (also see Figure 5).\n\n6.4 Comparison with Existing Methods\nIn this section, we compare with the SOTA methods in selective classi\ufb01cation. The experiment is\nperformed on SVHN [31] (Table 3), CIFAR10 [23] (Table 4) and Cat vs. Dog (Table 5). We follow\nexactly the experimental setting in [15] to allow for fair comparison. We use a version of VGG16\nthat is especially optimized for small datasets [27] with batchnorm and dropout. The baselines we\ncompare against are given in Section 3 and summarized in Table 1. A grid search is done over\nhyperparameter o with a step size of 0.2. The best models of ours for a given coverage are chosen\nusing a validation set, which is separated from the test set by a \ufb01xed random seed, and the best single\nmodel is chosen by using the model that achieves overall best validation accuracy. To report error\nbars, we estimate its standard deviation using the test errors on neighbouring 3 hyperparameter o\n\nvalues in our grid search (e.g. for o= 6.5, the results from o= 6.3, 6.5, 6.7 are used to compute the\n\nvariance).\nThe results for the baselines are cited from [15], and we show the error bar for the contender models\nwhen it overlaps or seem to overlap with our con\ufb01dence interval. We see that our model achieves\nSOTA on SVHN on all coverages, in the sense that our model starts at full coverage with a slightly\nlower accuracy but starts to outperform other contenders starting from 0.95 coverage, meaning that it\nlearned to identify the hard images better than its contenders. We also perform the experiment on\nCIFAR-10 and Cat vs. Dog datasets, and we see that our method achieves very strong results. A\nsmall problem for the comparison remains since our models have different full coverage performance\nfrom other methods, but a closer look suggests that our method performs indeed better when the\n\ncoverage is in the range[0.8, 1.0) (by comparing the relative improvements). Below 0.8 coverage,\n\nthe comparison becomes hard since there are only few images remaining, and methods on different\n\n7\n\n\fCoverage\n\n1.00\n0.95\n0.90\n0.85\n0.80\n0.75\n0.70\n\n(Single Best Model)\n\nOurs\n\no=2.26.12\u00b1 0.09\no=2.23.49\u00b1 0.15\no=2.22.19\u00b1 0.12\no=2.21.09\u00b1 0.15\no=2.20.66\u00b1 0.11\no=2.20.52\u00b1 0.03\no=2.20.43\u00b1 0.07\n\nOurs\n\n\u2212\n\n(Best per Coverage)\n\no=6.03.76\u00b1 0.12\no=6.02.29\u00b1 0.11\no=2.01.24\u00b1 0.15\no=2.20.66\u00b1 0.11\no=2.20.52\u00b1 0.03\no=2.20.43\u00b1 0.07\n\nSR\n6.79\n4.55\n2.89\n1.78\n1.05\n0.63\n0.42\n\nBD\n6.79\n4.58\n2.92\n1.82\n1.08\n0.66\n0.43\n\nSN\n6.79\n4.16\n2.43\n1.43\n0.86\n\n0.48\u00b1 0.02\n0.32\u00b1 0.01\n\nTable 4: CIFAR10. The number is error percentage on the covered dataset; the lower the better. We see that the\nsuperior performance of our method is seen again for another dataset.\n\nCoverage\n\n1.00\n0.95\n0.90\n0.85\n0.80\n\n(Single Best Model)\n\nOurs\n\no=2.02.93\u00b1 0.17\no=2.01.23\u00b1 0.12\no=2.00.59\u00b1 0.13\no=2.00.47\u00b1 0.10\no=2.00.46\u00b1 0.08\n\nOurs\n\n\u2212\n\n(Best per Coverage)\n\no=1.40.88\u00b1 0.38\no=2.00.59\u00b1 0.13\no=1.20.24\u00b1 0.10\no=2.00.46\u00b1 0.08\n\nSR\n3.58\n1.91\n1.10\n0.82\n0.68\n\nBD\n3.58\n1.92\n1.10\n0.78\n0.55\n\nSN\n3.58\n1.62\n0.93\n0.56\n\n0.35\u00b1 0.09\n\nTable 5: Cats vs. Dogs. The number is error percentage on the covered dataset; the lower the better. This dataset\nis a binary classi\ufb01cation, and the input images have larger resolution.\n\n(b) Deep Gambler\n\n(a) Normal Model\n\ndataset show misleading performance: on Cats vs. Dogs for 0.8 coverage, statistical \ufb02uctuation\ncaused the validated best model to be one of the worst models on test set.\n7 Discussion and Conclusion\nIn this work, we have proposed an end-to-end\nmethod to augment the standard supervised clas-\nsi\ufb01cation problem with a rejection option. The pro-\nposed method works competitively against the cur-\nrent SOTA [15] but is simpler and more \ufb02exible,\nwhile outperforming the runner-up SOTA model\n[14]. We hypothesize that this is because that our\nmodel has learned a qualitatively better hidden rep-\nresentation of the data. In Figure 5, we plot the\nt-SNE plots of a regular model and a model trained\nwith our loss function (more plots in Appendix).\nWe see that, for the baseline, 6 of the clusters of\nthe hidden representation are not easily separable\n(circled clusters), while a deep gambler model learned a representation with a large margin, which is\noften associated with superior performance [11, 28, 17].\nIt seems that there are many possible future directions this work might lead to. One possibility is to\nuse it in scienti\ufb01c \ufb01elds. For example, neural networks have been used in the classifying neutrinos,\nand if we do classi\ufb01cation on a subset of the data but with higher con\ufb01dence level, then we can better\nbound the frequency of neutrino oscillation, which is an important frontier in physics that will help\nus understand the fundamental laws of the universe [1]. This methods also seems to offer a way to\ninterpret how a deep learning model learns. We can show the top rejected data points at different\nepochs to study what are the problems that the model \ufb01nds dif\ufb01cult at different stages of training.\nTwo other areas our method might also turn out to be helpful are robustness against adversarial attacks\n[37] and learning in the presence of label noise [36, 41]. This work also gives a way incorporate\nideas from portfolio theory to deep learning. We hope this work will inspire further research in this\ndirection.\nAcknowledgements: Liu Ziyin thanks Mr. Zongping Gong for buying him drink sometimes, during\nthe writing of this paper; he also thanks the GSSS scholarship at the University of Tokyo for supporting\nhis graduate study. Z. T. Wang is supported by Global Science Graduate Course (GSGC) program\n\nFigure 5: t-SNE plot of the second-to-last layer\noutput of a baseline and a deep gambler model for\nMNIST. Best viewed in color and zoomed-in. The\ndeep gambler model learned a representation that is\nmore separable.\n\n8\n\n\fof the University of Tokyo. This material is based upon work partially supported by the National\nScience Foundation (Awards #1734868, #1722822) and National Institutes of Health. Any opinions,\n\ufb01ndings, and conclusions or recommendations expressed in this material are those of the author(s)\nand do not necessarily re\ufb02ect the views of National Science Foundation or National Institutes of\nHealth, and no of\ufb01cial endorsement should be inferred. Also, This work was supported by KAKENHI\nGrant No. JP18H01145 and a Grant-in-Aid for Scienti\ufb01c Research on Innovative Areas \u201cTopological\nMaterials Science\u201d (KAKENHI Grant No. JP15H05855) from the Japan Society for the Promotion\nof Science.\n\nReferences\n[1] K. Abe, Y. Hayato, T. Iida, K. Iyogi, J. Kameda, Y. Koshio, Y. Kozuma, Ll. Marti, M. Miura,\nS. Moriyama, M. Nakahata, S. Nakayama, Y. Obayashi, H. Sekiya, M. Shiozawa, Y. Suzuki,\nA. Takeda, Y. Takenaga, K. Ueno, K. Ueshima, S. Yamada, T. Yokozawa, C. Ishihara, H. Kaji,\nT. Kajita, K. Kaneyuki, K. P. Lee, T. McLachlan, K. Okumura, Y. Shimizu, N. Tanimoto,\nL. Labarga, E. Kearns, M. Litos, J. L. Raaf, J. L. Stone, L. R. Sulak, M. Goldhaber, K. Bays,\nW. R. Kropp, S. Mine, C. Regis, A. Renshaw, M. B. Smy, H. W. Sobel, K. S. Ganezer,\nJ. Hill, W. E. Keig, J. S. Jang, J. Y. Kim, I. T. Lim, J. B. Albert, K. Scholberg, C. W. Walter,\nR. Wendell, T. M. Wongjirad, T. Ishizuka, S. Tasaka, J. G. Learned, S. Matsuno, S. N. Smith,\nT. Hasegawa, T. Ishida, T. Ishii, T. Kobayashi, T. Nakadaira, K. Nakamura, K. Nishikawa,\nY. Oyama, K. Sakashita, T. Sekiguchi, T. Tsukamoto, A. T. Suzuki, Y. Takeuchi, M. Ikeda,\nA. Minamino, T. Nakaya, Y. Fukuda, Y. Itow, G. Mitsuka, T. Tanaka, C. K. Jung, G. D. Lopez,\nI. Taylor, C. Yanagisawa, H. Ishino, A. Kibayashi, S. Mino, T. Mori, M. Sakuda, H. Toyota,\nY. Kuno, M. Yoshida, S. B. Kim, B. S. Yang, H. Okazawa, Y. Choi, K. Nishijima, M. Koshiba,\nM. Yokoyama, Y. Totsuka, K. Martens, J. Schuemann, M. R. Vagins, S. Chen, Y. Heng, Z. Yang,\nH. Zhang, D. Kielczewska, P. Mijakowski, K. Connolly, M. Dziomba, E. Thrane, and R. J.\nWilkes. Evidence for the appearance of atmospheric tau neutrinos in super-kamiokande. Phys.\nRev. Lett., 110:181802, May 2013.\n\n[2] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy\n\nphysics with deep learning. Nature communications, 5:4308, 2014.\n\n[3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nin neural networks. arXiv preprint arXiv:1505.05424, 2015.\n\n[4] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and\nTony Robinson. One billion word benchmark for measuring progress in statistical language\nmodeling. arXiv preprint arXiv:1312.3005, 2013.\n\n[5] Chi-Keung Chow. An optimum character recognition system using decision functions. IRE\n\nTransactions on Electronic Computers, (4):247\u2013254, 1957.\n\n[6] Thomas M. Cover. Universal portfolios. Mathematical Finance, 1(1):1\u201329, 1991.\n\n[7] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in\nTelecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006.\n\n[8] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale\n\nhierarchical image database. In In CVPR, 2009.\n\n[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of\n\ndeep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.\n\n[10] Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classi\ufb01cation. Journal\n\nof Machine Learning Research, 11(May):1605\u20131641, 2010.\n\n[11] Gamaleldin Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large\nmargin deep networks for classi\ufb01cation. In Advances in Neural Information Processing Systems,\npages 842\u2013852, 2018.\n\n[12] Yarin Gal. Uncertainty in deep learning. 2016.\n\n9\n\n\f[13] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model\n\nUncertainty in Deep Learning. arXiv e-prints, page arXiv:1506.02142, June 2015.\n\n[14] Yonatan Geifman and Ran El-Yaniv. Selective classi\ufb01cation for deep neural networks. In\n\nAdvances in neural information processing systems, pages 4878\u20134887, 2017.\n\n[15] Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated\n\nreject option. arXiv preprint arXiv:1901.09192, 2019.\n\n[16] Garrett B Goh, Nathan O Hodas, and Abhinav Vishnu. Deep learning for computational\n\nchemistry. Journal of computational chemistry, 38(16):1291\u20131307, 2017.\n\n[17] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.\n\nSpringer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.\n\n[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015.\n\n[19] Yotam Hechtlinger, Barnab\u00e1s P\u00f3czos, and Larry Wasserman. Cautious deep learning. arXiv\n\npreprint arXiv:1805.09460, 2018.\n\n[20] Geoffrey Hinton. Deep learning\u2014a technology with the potential to transform health care.\n\nJama, 320(11):1101\u20131102, 2018.\n\n[21] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks.\n\nCoRR, abs/1608.06993, 2016.\n\n[22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[23] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced\n\nresearch).\n\n[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 25, pages 1097\u20131105. Curran\nAssociates, Inc., 2012.\n\n[25] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6402\u20136413, 2017.\n\n[26] Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe Morency. Multimodal language\n\nanalysis with recurrent multistage fusion. arXiv preprint arXiv:1808.03920, 2018.\n\n[27] Shuying Liu and Weihong Deng. Very deep convolutional neural network based image clas-\nsi\ufb01cation using small training sample size. In 2015 3rd IAPR Asian conference on pattern\nrecognition (ACPR), pages 730\u2013734. IEEE, 2015.\n\n[28] Weiyang Liu. Large-margin softmax loss for convolutional neural networks.\n\n[29] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural\n\ncomputation, 4(3):448\u2013472, 1992.\n\n[30] Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77\u201391, 1952.\n\n[31] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning 2011, 2011.\n\n[32] Graham Neubig. Neural machine translation and sequence-to-sequence models: A tutorial.\n\narXiv preprint arXiv:1703.01619, 2017.\n\n[33] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng.\nMultimodal deep learning. In Proceedings of the 28th international conference on machine\nlearning (ICML-11), pages 689\u2013696, 2011.\n\n10\n\n\f[34] Tim Pearce, Mohamed Zaki, Alexandra Brintrup, and Andy Neel. Uncertainty in neural\n\nnetworks: Bayesian ensembling. arXiv preprint arXiv:1810.05546, 2018.\n\n[35] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language\n\nmodels are unsupervised multitask learners. 2019.\n\n[36] David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive\n\nlabel noise. arXiv preprint arXiv:1705.10694, 2017.\n\n[37] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and inter-\npretability of deep neural networks by regularizing their input gradients. In Thirty-second AAAI\nconference on arti\ufb01cial intelligence, 2018.\n\n[38] Jo\u00e3o Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical micro-\ncircuits approximate the backpropagation algorithm. In S. Bengio, H. Wallach, H. Larochelle,\nK. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Pro-\ncessing Systems 31, pages 8721\u20138732. Curran Associates, Inc., 2018.\n\n[39] Csaba Szepesv\u00e1ri. Algorithms for reinforcement learning. 2009.\n\n[40] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[41] Arash Vahdat. Toward robustness against label noise in training deep discriminative neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 5596\u20135605, 2017.\n\n[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5661, "authors": [{"given_name": "Ziyin", "family_name": "Liu", "institution": "University of Tokyo"}, {"given_name": "Zhikang", "family_name": "Wang", "institution": "University of Tokyo"}, {"given_name": "Paul Pu", "family_name": "Liang", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Louis-Philippe", "family_name": "Morency", "institution": "Carnegie Mellon University"}, {"given_name": "Masahito", "family_name": "Ueda", "institution": "University of Tokyo"}]}