{"title": "Boosting with Abstention", "book": "Advances in Neural Information Processing Systems", "page_first": 1660, "page_last": 1668, "abstract": "We present a new boosting algorithm for the key scenario of binary classification with abstention where the algorithm can abstain from predicting the label of a point, at the price of a fixed cost. At each round, our algorithm selects a pair of functions, a base predictor and a base abstention function. We define convex upper bounds for the natural loss function associated to this problem, which we prove to be calibrated with respect to the Bayes solution. Our algorithm benefits from general margin-based learning guarantees which we derive for ensembles of pairs of base predictor and abstention functions, in terms of the Rademacher complexities of the corresponding function classes. We give convergence guarantees for our algorithm along with a linear-time weak-learning algorithm for abstention stumps. We also report the results of several experiments suggesting that our algorithm provides a significant improvement in practice over two confidence-based algorithms.", "full_text": "Boosting with Abstention\n\nCorinna Cortes\nGoogle Research\n\nNew York, NY 10011\ncorinna@google.com\n\nGiulia DeSalvo\nCourant Institute\n\nNew York, NY 10012\ndesalvo@cims.nyu.edu\n\nMehryar Mohri\n\nCourant Institute and Google\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nAbstract\n\nWe present a new boosting algorithm for the key scenario of binary classi\ufb01cation\nwith abstention where the algorithm can abstain from predicting the label of a point,\nat the price of a \ufb01xed cost. At each round, our algorithm selects a pair of functions,\na base predictor and a base abstention function. We de\ufb01ne convex upper bounds\nfor the natural loss function associated to this problem, which we prove to be\ncalibrated with respect to the Bayes solution. Our algorithm bene\ufb01ts from general\nmargin-based learning guarantees which we derive for ensembles of pairs of base\npredictor and abstention functions, in terms of the Rademacher complexities of the\ncorresponding function classes. We give convergence guarantees for our algorithm\nalong with a linear-time weak-learning algorithm for abstention stumps. We also\nreport the results of several experiments suggesting that our algorithm provides a\nsigni\ufb01cant improvement in practice over two con\ufb01dence-based algorithms.\n\n1\n\nIntroduction\n\nClassi\ufb01cation with abstention is a key learning scenario where the algorithm can abstain from making\na prediction, at the price of incurring a \ufb01xed cost. This is the natural scenario in a variety of common\nand important applications. An example is spoken-dialog applications where the system can redirect\na call to an operator to avoid the cost of incorrectly assigning a category to a spoken utterance and\nmisguiding the dialog manager. This requires the availability of an operator, which incurs a \ufb01xed and\nprede\ufb01ned price. Other examples arise in the design of a search engine or an information extraction\nsystem, where, rather than taking the risk of displaying an irrelevant document, the system can resort\nto the help of a more sophisticated, but more time-consuming classi\ufb01er. More generally, this learning\nscenario arises in a wide range of applications including health, bioinformatics, astronomical event\ndetection, active learning, and many others, where abstention is an acceptable option with some cost.\nClassi\ufb01cation with abstention is thus a highly relevant problem.\nThe standard approach for tackling this problem is via con\ufb01dence-based abstention: a real-valued\nfunction h is learned for the classi\ufb01cation problem and the points x for which its magnitude |h(x)| is\nsmaller than some threshold are rejected. Bartlett and Wegkamp [1] gave a theoretical analysis of\nthis approach based on consistency. They introduced a discontinuous loss function taking into account\nthe cost for rejection, upper-bounded that loss by a convex and continuous Double Hinge Loss (DHL)\nsurrogate, and derived an algorithm based on that convex surrogate loss. Their work inspired a series\nof follow-up papers that developed both the theory and practice behind con\ufb01dence-based abstention\n[32, 15, 31]. Further related works can be found in Appendix A.\nIn this paper, we present a solution to the problem of classi\ufb01cation with abstention that radically\ndeparts from the con\ufb01dence-based approach. We introduce a general model where a pair (h, r)\nfor a classi\ufb01er h and rejection function r are learned simultaneously. Under this novel framework,\nwe present a Boosting-style algorithm with Abstention, BA, that learns accurately the classi\ufb01er\nand abstention functions. Note that the terminology of \u201cboosting with abstention\u201d was used by\nSchapire and Singer [26] to refer to a scenario where a base classi\ufb01er is allowed to abstain, but\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\n+ - + - \n\n\n\n+ + + + + \n\n- - - - - - - - - \n\ny\n\n\n|h(x)| < \n\n\u2713\n\nx\nh(x) = x + \u2713\nFigure 1: The best predictor h is de\ufb01ned by the threshold \u2713: h(x) = x + \u2713. For c < 1\n2, the\nregion de\ufb01ned by X \uf8ff \u2318 should be rejected. But the corresponding abstention function r de\ufb01ned by\nr(x) = x \u2318 cannot be de\ufb01ned as |h(x)|\uf8ff for any > 0.\nwhere the boosting algorithm itself has to commit to a prediction. This is therefore distinct from the\nscenario of classi\ufb01cation with abstention studied here. Nevertheless, we will introduce and examine\na con\ufb01dence-based Two-Step Boosting algorithm, the TSB algorithm, that consists of \ufb01rst training\nAdaboost and next searching for the best con\ufb01dence-based abstention threshold.\nThe paper is organized as follows. Section 2 describes our general abstention model which consists\nof learning a pair (h, r) simultaneously and compares it with con\ufb01dence-based models. Section 3.2\npresents a series of theoretical results for the problem of learning convex ensembles for classi\ufb01cation\nwith abstention, including the introduction of calibrated convex surrogate losses and general data-\ndependent learning guarantees. In Section 4, we use these learning bounds to design a regularized\nboosting algorithm. We further prove the convergence of the algorithm and present a linear-time\nweak-learning algorithm for a natural family of abstention stumps. Finally, in Section 5, we report\nseveral experimental results comparing the BA algorithm with the DHL and the TSB algorithms.\n\n2 Preliminaries\n\nIn this section, we \ufb01rst introduce a general model for learning with abstention [7] and then compare\nit with con\ufb01dence-based models.\n\n2.1 General abstention model\nWe assume as in standard supervised learning that the training and test points are drawn i.i.d. according\nto some \ufb01xed but unknown distribution D over X \u21e5 {1, +1}. We consider the learning scenario of\nbinary classi\ufb01cation with abstention. Given an instance x 2 X, the learner has the option of abstaining\nfrom making a prediction for x at the price of incurring a non-negative loss c(x), or otherwise making\na prediction h(x) using a predictor h and incurring the standard zero-one loss 1yh(x)\uf8ff0 where the\ntrue label is y. Since a random guess achieves an expected cost of at most 1\n2, rejection only makes\nsense for c(x) < 1\n2.\nWe will model the learner by a pair (h, r) where the function r : X ! R determines the points\nx 2 X to be rejected according to r(x) \uf8ff 0 and where the hypothesis h : X ! R predicts labels for\nnon-rejected points via its sign. Extending the loss function considered in Bartlett and Wegkamp [1],\nthe abstention loss for a pair (h, r) is de\ufb01ned as as follows for any (x, y) 2 X \u21e5 {1, +1}:\n\nL(h, r, x, y) = 1yh(x)\uf8ff01r(x)>0 + c(x)1r(x)\uf8ff0.\n\n(1)\nThe abstention cost c(x) is assumed known to the learner. In the following, we assume that c is a\nconstant function, but part of our analysis is applicable to the more general case.\nWe denote by H and R two families of functions mapping X to R and we assume the labeled sample\nS = ((x1, y1), . . . , (xm, ym)) is drawn i.i.d. from Dm. The learning problem consists of determining\na pair (h, r) 2 H \u21e5 R that admits a small expected abstention loss R(h, r), de\ufb01ned as follows:\n\nR(h, r) = E\n\n(x,y)\u21e0D\u21e51yh(x)\uf8ff01r(x)>0 + c1r(x)\uf8ff0\u21e4 .\n\nSimilarly, we de\ufb01ne the empirical loss of a pair (h, r) 2 H \u21e5 R over the sample S by: bRS(h, r) =\nE(x,y)\u21e0S\u21e51yh(x)\uf8ff01r(x)>0 + c1r(x)\uf8ff0\u21e4, where (x, y) \u21e0 S indicates that (x, y) is drawn according\n\nto the empirical distribution de\ufb01ned by S.\n\n2.2 Con\ufb01dence-based abstention model\nCon\ufb01dence-based models are a special case of the general model for learning with rejection presented\nin Section 2.1 corresponding to the pair (h(x), r(x)) = (h(x),|h(x)| ), where is a parameter\n\n2\n\n(2)\n\n\f2 and r\u21e4(x) = |h\u21e4(x)| ( 1\n\nthat changes the threshold of rejection. This speci\ufb01c choice was based on consistency results\nshown in [1]. In particular, the Bayes solution (h\u21e4, r\u21e4) of the learning problem, that is where the\ndistribution D is known, is given by h\u21e4(x) = \u2318(x) 1\n2 c) where\n\u2318(x) = P[Y = +1|x] for any x 2 X, but note that this is not a unique solution. The form of h\u21e4(x)\nfollows by a similar reasoning as for the standard binary classi\ufb01cation problem. It is straightforward\nto see that the optimal rejection function r\u21e4 is non-positive, meaning a point is rejected, if and only if\nmin{\u2318(x), 1 \u2318(x)} c. Equivalently, the following holds: max{\u2318(x) 1\n2 c if\nand only if |\u2318(x) 1\n2 c and using the de\ufb01nition of h\u21e4, we recover the optimal r\u21e4. In light\nof the Bayes solution, the speci\ufb01c choice of the abstention function r is natural; however, requiring\nthe abstention function r to be de\ufb01ned as r(x) = |h(x)| , for some h 2 H, is in general too\nrestrictive when predictors are selected out of a limited subset H of all measurable functions over X.\nConsider the example shown in Figure 1 where H is a family of linear functions. For this simple case,\nthe optimal abstention region cannot be attained as a function of the best predictor h while it can\nbe achieved by allowing to learn a pair (h, r). Thus, the general model for learning with abstention\nanalyzed in Section 2.1 is both more \ufb02exible and more general.\n\n2 \u2318(x)}\uf8ff 1\n\n2|\uf8ff 1\n\n2 , 1\n\n3 Theoretical analysis\n\nThis section presents a theoretical analysis of the problem of learning convex ensembles for classi\ufb01ca-\ntion with abstention. We \ufb01rst introduce general convex surrogate functions for the abstention loss and\nprove a necessary and suf\ufb01cient condition based on their parameters for them to be calibrated. Next\nwe de\ufb01ne the ensemble family we consider and prove general data-dependent learning guarantees for\nit based on the Rademacher complexities of the base predictor and base rejector sets.\n\nf +g+|gf|\n\n2\n\n f +g\n\n2\n\n3.1 Convex surrogates\nWe introduce two types of convex surrogate functions for\nOb-\nserve that the abstention loss L(h, r, x, y) can be equivalently expressed as L(h, r, x, y) =\nIn view of that, since for any f, g 2 R, max(f, g) =\n\nthe abstention loss.\n\nmax1yh(x)\uf8ff01r(x)<0, c 1r(x)\uf8ff0.\n\n2 , the following inequalities hold for a > 0 and b > 0:\n\nwhere u ! 1(u) and u ! 2(u) are two non-increasing convex functions upper-bounding\nu ! 1u\uf8ff0 over R. Let LMB be the convex surrogate de\ufb01ned by the last inequality above:\n\nL(h, r, x, y) = max1yh(x)\uf8ff01r(x)<0, c 1r(x)\uf8ff0\n\uf8ff max1max(yh(x),r(x))\uf8ff0, c 1r(x)\uf8ff0\n\uf8ff0, c 1r(x)\uf8ff0\n\uf8ff max1 yh(x)r(x)\n= max1a [yh(x)r(x)]\uf8ff0, c1b r(x)\uf8ff0\n\uf8ff max\u21e31a [r(x) yh(x)], c 2 b r(x)\u2318,\nLMB(h, r, x, y) = max\u21e31a [r(x) yh(x)], c 2b r(x)\u2318,\nLSB(h, r, x, y) = 1a [r(x) yh(x)] + c 2b r(x).\n1\u2318 and r\u21e4L = 1\n\n2a log \u2318\n\n(3)\nSince LMB is not differentiable everywhere, we upper-bound the convex surrogate LMB as follows:\n\nmax1a [yh(x)r(x)]\uf8ff0, c 1b r(x)\uf8ff0 \uf8ff 1a [r(x) yh(x)] + c 2b r(x). Similarly, we let\n\n(4)\nFigure 2 shows the plots of the convex surrogates LMB and LSB as well as that of the abstention loss.\nLet (h\u21e4L, r\u21e4L) denote the pair that attains the minimum of the expected loss Ex,y(LSB(h, r, x, y)) over\nall measurable functions for 1(u) = 2(u) = exp(u). In Appendix F, we show that with \u2318(x) =\nP(Y = +1|X = x), the pair (h\u21e4L, r\u21e4L) where h\u21e4L = 1\nmakes LSB a calibrated loss, meaning that the sign of the (h\u21e4L, r\u21e4L) that minimizes the expected\nsurrogate loss matches the sign of the Bayes classi\ufb01er (h\u21e4, r\u21e4). More precisely, the following holds.\nTheorem 1 (Calibration of convex surrogate). For a > 0 and b > 0, the inf (h,r) E(x,y)[L(h, r, x, y)]\nis attained at (h\u21e4L, r\u21e4L) such that sign(h\u21e4) = sign(h\u21e4L) and sign(r\u21e4) = sign(r\u21e4L) if and only if\n\na +b log\u21e3 cb\n\nLSB denote this convex surrogate:\n\n2aq 1\n\n\u2318(1\u2318) \u2318\n\nb /a = 2p(1 c)/c.\n\n3\n\n\fFigure 2: The left \ufb01gure is a plot of the abstention loss. The middle \ufb01gure is a plot of the surrogate\nfunction LMB while the right \ufb01gure is a plot of the surrogate loss LSB both for c = 0.45.\n\nThe theorem shows that the classi\ufb01cation and rejection solution obtained by minimizing the surrogate\nloss for that choice of (a, b) coincides with the one obtained using the original loss. In the following,\n\nwe make the explicit choice of a = 1 and b = 2p(1 c)/c for the loss LSB to be calibrated.\n\n3.2 Learning guarantees for ensembles in classi\ufb01cation with abstention\nIn the standard scenario of classi\ufb01cation, it is often easy to come up with simple base classi\ufb01ers that\nmay abstain. As an example, a simple rule could classify a message as spam based on the presence\nof some word, as ham in the presence of some other word, and just abstain in the absence of both,\nas in the boosting with abstention algorithm by Schapire and Singer [26]. Our objective is to learn\nensembles of such base hypotheses to create accurate solutions for classi\ufb01cation with abstention.\nOur ensemble functions are based on the framework described in Section 2.1. Let H and R be two\nfamilies of functions mapping X to [1, 1]. The ensemble family F that we consider is then the\nconvex hull of H \u21e5 R:\n\nF =\u21e2\u21e3 TXt=1\n\n\u21b5tht,\n\nTXt=1\n\n\u21b5trt\u2318 : T 1,\u21b5 t 0,\n\nTXt=1\n\n\u21b5t = 1, ht 2 H, rt 2 R.\n\n(5)\n\nand 1\n\nThus, (h, r) 2 F abstains on input x 2 X when r(x) \uf8ff 0 and predicts the label sign(h(x)) otherwise.\nLet u ! 1(u) and u ! 2(u) be two strictly decreasing differentiable convex function upper-\nbounding u ! 1u\uf8ff0 over R. For calibration constants a , b > 0, and cost c > 0, we assume that there\nexist u and v such that 1(a u) < 1 and c 2(v) < 1, otherwise the surrogate would not be useful.\nLet 1\n2 be the inverse functions, which always exist since 1 and 2 are strictly monotone.\nWe will use the following de\ufb01nitions: C1 = 2a 011\n2 (1/c).\n1\nObserve that for 1(u) = 2(u) = exp(u), we simply have C1 = 2a and C2 = 2b .\nTheorem 2. Let H and R be two families of functions mapping X to R. Assume N > 1. Then, for\nany > 0, with probability at least 1 over the draw of a sample S of size m from D, the following\nholds for all (h, r) 2 F:\n\n1 (1) and C2 = 2cb 021\n\n[LMB(h, r, x, y)] + C1Rm(H) + (C1 + C2)Rm(R) +r log 1/\n\n2m\n\n.\n\nR(h, r) \uf8ff E\n\n(x,y)\u21e0S\n\nThe proof is given in Appendix C. The theorem gives effective learning guarantees for ensemble\npairs (h, r) 2 F when the base predictor and abstention functions admit favorable Rademacher\ncomplexities. In earlier work [7], we present a learning bound for a different type of surrogate losses\nwhich can also be extended to hold for ensembles.\nNext, we derive margin-based guarantees in the case where 1(u) = 2(u) = exp(u). For any\n\u21e2> 0, the margin-losses associated to LMB and LSB are denoted by L\u21e2\nSB and de\ufb01ned for all\n(h, r) 2 F and (x, y) 2 X \u21e5 {1, +1} by\n\nMB and L\u21e2\n\nL\u21e2\n\nMB(h, r, x, y) = LMB(h/\u21e2, r/\u21e2, x, y)\n\nand L\u21e2\n\nSB(h, r, x, y) = LSB(h/\u21e2, r/\u21e2, x, y).\n\nTheorem 2 applied to this margin-based loss results in the following corollary.\nCorollary 3. Assume N > 1 and \ufb01x \u21e2> 0. Then, for any > 0, with probability at least 1 over\nthe draw of an i.i.d. sample S of size m from D, the following holds for all f 2 F:\n\nR(h, r) \uf8ff E\n\n(x,y)\u21e0S\n\n[L\u21e2\n\nMB(h, r, x, y)] +\n\n2(a + b )\n\n\u21e2\n\nRm(R) +r log 1/\n\n2m\n\n.\n\n2a\n\u21e2\n\nRm(H) +\n\n4\n\n\fBA(S = ((x1, y1), . . . , (xm, ym)))\n\n2\n\nZt\n\n2m\n\ni=1 Dt(i, 2)\n\nrk,2\n then\n\n\u2318t \u21b5t1,k . Step\n\nfor i 1 to m do\nD1(i, 1) 1\n2m ; D1(i, 2) 1\nfor t 1 to T do\nZ1,t Pm\ni=1 Dt(i, 1); Z2,t Pm\nk argminj2[1,N ] 2Z1,t\u270ft,j + Z1,trj,1 2pc(1 c)Z2,trj,2 . Direction\n2 ) 2pc(1 c)Z2,t\nZ Z1,t(\u270ft,k + rk,1\nif (Z1,t Z)e\u21b5t1,k Ze\u21b5t1,k < m\n2ZtZ +rh m\nelse \u2318t logh m\n2ZtZi2\nrt PN\nht PN\nZt+1 Pm\nfor i 1 to m do\nDt+1(i, 1) \nj=1 \u21b5T,j(hj, rj)\n\ni=1 0rt(xi) yiht(xi) + 02q 1c\n\nc rt(xi)\n02r 1c\n\n0rt(xi)yiht(xi)\n\nZ 1i . Step\n\n\u21b5t \u21b5t1 + \u2318tek\n\nc rt(xi)\n\n; Dt+1(i, 2) \n\nj=1 \u21b5jrj\nj=1 \u21b5jhj\n\n+ Z1,t\n\nZt+1\n\nZt+1\n\n1\n2\n3\n4\n5\n6\n7\n8\n\n9\n10\n11\n12\n13\n14\n\n15\n16\n17\n\n(h, r) PN\n\nreturn (h, r)\n\nFigure 3: Pseudocode of the BA algorithm for both the exponential loss with 1(u) = 2(u) =\nexp(u) as well as for the logistic loss with 1(u) = 2(u) = log2(1 + eu). The parameters include\nthe cost of rejection c and determining the strength of the the \u21b5-constraint for the L1 regularization.\nThe de\ufb01nition of the weighted errors \u270ft,k as well as the expected rejections, rk,1 and rk,2, are given\nin Equation 7. For other surrogate losses, the step size \u2318t is found via a line search or other numerical\nmethods by solving argmin\u2318 F (\u21b5t1 + \u2318ek).\n\nThe bound of Corollary 3 applies similarly to L\u21e2\n\nbe shown to hold uniformly for all \u21e2 2 (0, 1) at the price of a term in O\u21e3q log log 1/\u21e2\n\ntechniques [16, 22] (see Appendix C).\n\nSB since it is an upper bound on L\u21e2\n\nMB. It can further\n\n\u2318 using standard\n\nm\n\n4 Boosting algorithm\n\nHere, we derive a boosting-style algorithm (BA algorithm) for learning an ensemble with the option\nof abstention for both losses LMB and LSB. Below, we describe the algorithm for LSB and refer the\nreader to Appendix H for the version using the loss LMB.\n\n4.1 Objective function\n\nThe BA algorithm solves a convex optimization problem that is based on Corollary 3 for loss\nLSB. Since the last three terms of the right-hand side of the bound of the corollary do not de-\npend on \u21b5, this suggests to select \u21b5 as the solution of min\u21b52\nSB(h, r, xi, yi). Via\na change of variable \u21b5 \u21b5/\u21e2 that does not affect the optimization problem, we can equiv-\nIntroducing the\nalently search for min\u21b50\nLagrange variable associated to the constraintPT\nt=1 \u21b5t \uf8ff 1/\u21e2, the problem can rewritten as:\nt=1 \u21b5t. Letting {(h1, r1), . . . , (hN , rN )} be the set of base\nmin\u21b50\nfunctions pairs for the classi\ufb01er and rejection function, we can rewrite the optimization problem as\n\ni=1 LSB(h, r, xi, yi) such thatPT\n\nmPm\ni=1 L\u21e2\nt=1 \u21b5t \uf8ff 1/\u21e2.\n\nmPm\n\ni=1 LSB(h, r, xi, yi) + PT\n\nmPm\n\n1\n\n1\n\n1\n\n5\n\n\fthe minimization over \u21b5 0 of\n\n1\nm\n\nmXi=1\n\n\u21e3 NXj=1\n\n\u21b5jrj(xi)yi\n\nNXj=1\n\n\u21b5jhj(xi)\u2318+c \u21e3b\n\nNXj=1\n\nThus, the following is the objective function of our optimization problem:\n\nF (\u21b5) =\n\n1\nm\n\nmXi=1\n\nrt(xi) yiht(xi) + c b rt(xi) + \n\n\u21b5j.\n\n(6)\n\n\u21b5j.\n\nNXj=1\n\n\u21b5jrj(xi)\u2318+\nNXj=1\n\n4.2 Projected coordinate descent\nThe problem min\u21b50 F (\u21b5) is a convex optimization problem, which we solve via projected\ncoordinate descent. Let ek be the kth unit vector in RN and let F 0(\u21b5, ej) be the directional\nderivative of F along the direction ej at \u21b5. The algorithm consists of the following three\nsteps. First, it determines the direction of maximal descent by k = argmaxj2[1,N ] |F 0(\u21b5t1, ej)|.\nSecond, it calculates the best step \u2318 along the direction that preserves non-negativity of \u21b5 by\n\u2318 = argmin\u21b5t1+\u2318ek0 F (\u21b5t1 + \u2318ek). Third, it updates \u21b5t1 to \u21b5t = \u21b5t1 + \u2318ek.\nThe pseudocode of the BA algorithm is given in Figure 3. The step and direction are based on\nF 0(\u21b5t1, ej). For any t 2 [1, T ], de\ufb01ne a distribution Dt over the pairs (i, n), with n in {1, 2}\n\nDt(i, 1) =\n\n0rt1(xi) yiht1(xi)\n\nZt\n\nand Dt(i, 2) =\n\n0b rt1(xi)\n\nZt\n\n,\n\nwhere Zt\n\ni=1 0rt1(xi) yiht1(xi) +\n0b rt1(xi). In order to derive an explicit formulation of the descent direction that is based\non the weighted error of the classi\ufb01cation function hj and the expected value of the rejection func-\ntion rj, we use the distributions D1,t and D2,t de\ufb01ned by Dt(i, 1)/Z1,t and Dt(i, 1)/Z2,t where\nZ1,t = Pm\ni=1 Dt(i, 2) are the normalization factors. Now, for any\nj 2 [1, N ] and s 2 [1, T ], we can de\ufb01ne the weighted error \u270ft,j and the expected value of the\nrejection function, rj,1 and rj,2, over distribution D1,t and D2,t as follows:\n\nis the normalization factor given by Zt = Pm\ni=1 Dt(i, 1) and Z2,t = Pm\n2h1 E\n\n[yihj(xi)]i, rj,1 = E\n\nUsing these de\ufb01nition, we show (see Appendix D) that the descent direction is given by\n\n[rj(xi)], and rj,2 = E\n\n[rj(xi)].\n\n(7)\n\n\u270ft,j = 1\n\ni\u21e0D1,t\n\ni\u21e0D2,t\n\ni\u21e0D1,t\n\nk = argmin\nj2[1,N ]\n\n2Z1,t\u270ft,j + Z1,trj,1 2pc(1 c)Z2,trj,2.\n\nThis equation shows that Z1,t and 2pc(1 c)Z2,t re-scale the weighted error and expected rejection.\n\nThus, \ufb01nding the best descent direction by minimizing this equation is equivalent to \ufb01nding the best\nscaled trade-off between the misclassi\ufb01cation error and the average rejection cost. The step size can\nin general be found via line search or other numerical methods, but we have derived a closed-form\nsolution of the step size for both the exponential and logistic loss (see Appendix D.2). Further details\nof the derivation of projected coordinate descent on F are also given in Appendix D.\nNote that for rt ! 0+ in Equation 6, that is when the rejection terms are dropped in the objective, we\nretrieve the L1-regularized Adaboost. As for Adaboost, we can de\ufb01ne a weak learning assumption\nwhich requires that the directional derivative along at least one base pair be non-zero. For = 0, it\ndoes not hold when for all j: 2\u270fs,j 1 = rj,1 + 2pc(1c)Z2,t\nrj,2, which corresponds to a balance\nbetween the edge and rejection costs for all j. Observe that in the particular case when the rejection\nfunctions are zero, it coincides with the standard weak learning assumption for Adaboost (\u270fs,j = 1\n2\nfor all j).\nThe following theorem provides the convergence of the projected coordinate descent algorithm for\nour objective function, F (\u21b5). The proof is given in Appendix E.\nTheorem 4. Assume that is twice differentiable and that 00(u) > 0 for all u 2 R. Then, the\nprojected coordinate descent algorithm applied to F converges to the solution \u21b5\u21e4 of the optimization\nproblem max\u21b50 F (\u21b5). If additionally is strongly convex over the path of the iterates \u21b5t then\nthere exists \u2327> 0 and \u232b> 0 such that for all t >\u2327 , F (\u21b5t+1)F (\u21b5\u21e4) \uf8ff1 1\n\u232bF (\u21b5t)F (\u21b5\u21e4).\n\nZ1,t\n\n6\n\n\f\n\n1\n\nR\n\n2\n\n+\n\n\n\n1\n\n+\n\n2\n\nR\n\nFigure 4: Illustration of the abstention stumps on a variable X.\n\nSpeci\ufb01cally, this theorem holds for the exponential loss (u) = exp(u) and the logistic loss\n(u) = log2(1 + eu) since they are strongly convex over the compact set containing the \u21b5ts.\n4.3 Abstention stumps\nWe \ufb01rst de\ufb01ne a family of base hypotheses, abstention stumps, that can be viewed as extensions of the\nstandard boosting stumps to the setting of classi\ufb01cation with abstention. An abstention stump h\u27131,\u27132\nover the feature X is de\ufb01ned by two thresholds \u27131,\u2713 2 2 R with \u27131 \uf8ff \u27132. There are 6 different such\nstumps, Figure 4 illustrates two of them. For the left \ufb01gure, points with variables X less than or equal\nto \u27131 are labeled negatively, those with X \u27132 are labeled positively, and those with X between \u27131\nand \u27132 are rejected. In general, an abstention stump is de\ufb01ned by the pairh\u27131,\u27132(X), r\u27131,\u27132(X)\nwhere, for Figure 4-left, h\u27131,\u27132(X) = 1X\uf8ff\u27131 + 1X>\u2713 2 and r\u27131,\u27132(X) = 1\u27131 0, to de\ufb01ne a family of base predictor and base rejector pairs of\nthe form (h(x), \u02c6r(x)). Since \u21b5t is non-negative, the value is needed to correct for over-rejection\nby previously selected abstention stumps. The can be automatically learned by adding to the set\nof base pairs the constant functions (h0, r0) = (0,1). An ensemble solution returned by the BA\nalgorithm is therefore of the formPt \u21b5tht(x),Pt \u21b5trt(x) where \u21b5ts are the weights assigned to\neach base pair.\nNow, consider a sample of m points sorted by the value of X, which we denote by X1 \uf8ff\u00b7\u00b7\u00b7\uf8ff Xm.\nFor abstention stumps, the derivative of the objective, F , can be further simpli\ufb01ed (see Appendix G)\nsuch that the problem can be reduced to \ufb01nding an abstention stump with the minimal expected\nabstention loss l(\u27131,\u2713 2), that is\n\nmXi=1\n\n\u27131,\u27132\n\nargmin\n\n2Dt(i, 1)[1yi=+11Xi\uf8ff\u27131 + 1yi=11Xi>\u27132] +2Dt(i, 1) cb Dt(i, 2)1\u27131\u27132 +2Dt(i, 1) cb Dt(i, 2)1Xi\uf8ff\u27132. (9)\n\n(8)\n\n+ argmin\n\n\u27132\n\nThe optimization Problems (8) and (9) can be solved in linear time, via a method similar to that\nof \ufb01nding the optimal threshold for a standard zero-one loss boosting stump. When the condition\n\u27131 <\u2713 2 does not hold, we can simply revert to \ufb01nding the minimum of l(\u27131,\u2713 2) in the naive way. In\npractice, we \ufb01nd most often that the optimal solution of Problem 8 and Problem 9 satis\ufb01es \u27131 <\u2713 2.\n\n5 Experiments\n\nIn this section, we present the results of experiments with our abstention stump BA algorithm based\non LSB for several datasets. We compare the BA algorithm with the DHL algorithm [1], as well as a\n\n7\n\n\fcod\n\n 0.3\n\n 0.225\n\n 0.15\n\n 0.075\n\n 0\n0.05\n\n0.15\n\n0.25\nCost\n\n0.35\n\n0.45\n\nbanknote\n\n 0.16\n\n 0.12\n\n 0.08\n\n 0.04\n\npima\n\n 0.5\n\n 0.375\n\n 0.25\n\n 0.125\n\n 0\n0.05\n\n0.15\n\n0.25\nCost\n\n0.35\n\n0.45\n\nhaberman\n\n 0.3\n\n 0.225\n\n 0.15\n\n 0.075\n\n \n\ns\ns\no\nL\nn\no\ni\nt\nc\ne\nj\ne\nR\n\n \n\ns\ns\no\nL\nn\no\ni\nt\nc\ne\nj\ne\nR\n\n \n\ns\ns\no\nL\nn\no\ni\nt\nc\ne\nj\ne\nR\n\n \n\ns\ns\no\nL\nn\no\ni\nt\nc\ne\nj\ne\nR\n\n \n\ns\ns\no\nL\nn\no\ni\nt\nc\ne\nj\ne\nR\n\n \n\ns\ns\no\nL\nn\no\ni\nt\nc\ne\nj\ne\nR\n\nskin\n\n 0.3\n\n 0.225\n\n 0.15\n\n 0.075\n\n 0\n0.05\n\n0.15\n\n0.25\nCost\n\n0.35\n\n0.45\n\naustralian\n\n 0.2\n\n 0.15\n\n 0.1\n\n 0.05\n\n 0\n0.05\n\n0.15\n\n0.25\nCost\n\n0.35\n\n0.45\n\n 0\n0.05\n\n0.15\n\n0.25\nCost\n\n0.35\n\n0.45\n\n 0\n0.05\n\n0.15\n\n0.25\nCost\n\n0.35\n\n0.45\n\nFigure 5: Average rejection loss on the test set as a function of the abstention cost c for the TSB\nAlgorithm (in orange), the DHL Algorithm (in red) and the BA Algorithm (in blue) based on LSB.\n\ncon\ufb01dence-based boosting algorithm TSB. Both of these algorithms are described in further detail\nin Appendix B. We tested the algorithms on six data sets from UCI\u2019s data repository, speci\ufb01cally\naustralian, cod, skin, banknote, haberman, and pima. For more information about the data sets,\nsee Appendix I. For each data set, we implemented the standard 5-fold cross-validation where we\nrandomly divided the data into training, validation and test set with the ratio 3:1:1. Using a different\nrandom partition, we repeated the experiments \ufb01ve times. For all three algorithms, the cost values\nranged over c 2{ 0.05, 0.1, . . . , 0.5} while threshold ranged over 2{ 0.08, 0.16, . . . , 0.96}. For\nthe BA algorithm, the regularization parameter ranged over 2{ 0, 0.05, . . . , 0.95}. All experi-\nments for BA were based on T = 200 boosting rounds. The DHL algorithm used polynomial kernels\nwith degree d 2{ 1, 2, 3} and it was implemented in CVX [8]. For each cost c, the hyperparameter\ncon\ufb01guration was chosen to be the set of parameters that attained the smallest average rejection loss\non the validation set. For that set of parameters we report the results on the test set.\nWe \ufb01rst compared the con\ufb01dence-based TSB algorithm with the BA and DHL algorithms (\ufb01rst row\nof Figure 5). The experiments show that, while TSB can sometimes perform better than DHL, in\na number of cases its performance is dramatically worse as a function of c and, in all cases it is\noutperformed by BA. In Appendix J, we give the full set of results for the TSB algorithm.\nIn view of that, our next series of results focus on the BA and DHL algorithms, directly designed to\noptimize the rejection loss, for 3 other datasets (second row of Figure 5). Overall, the \ufb01gures show\nthat BA outperforms the state-of-the-art DHL algorithm for most values of c, thereby indicating that\nBA yields a signi\ufb01cant improvement in practice. We have also successfully run BA on the CIFAR-10\ndata set (boat and horse images) which contains 10,000 instances and we believe that our algorithm\ncan scale to much larger datasets. In contrast, training DHL on such larger samples did not terminate\nas it is based on a costly QCQP. In Appendix J, we present tables that report the average and standard\ndeviation of the abstention loss as well as the fraction of rejected points and the classi\ufb01cation error\non non-rejected points.\n\n6 Conclusion\n\nWe introduced a general framework for classi\ufb01cation with abstention where the predictor and\nabstention functions are learned simultaneously. We gave a detailed study of ensemble learning\nwithin this framework including: new surrogate loss functions proven to be calibrated, Rademacher\ncomplexity margin bounds for ensemble learning of the pair of predictor and abstention functions,\na new boosting-style algorithm, the analysis of a natural family of base predictor and abstention\nfunctions, and the results of several experiments showing that BA algorithm yield a signi\ufb01cant\nimprovement over the con\ufb01dence-based algorithms DHL and TSB. Our algorithm can be further\nextended by considering more complex base pairs such as more general ternary decision trees with\nrejection leaves. Moreover, our theory and algorithm can be generalized to the scenario of multi-class\nclassi\ufb01cation with abstention, which we have already initiated.\n\nAcknowledgments\nThis work was partly funded by NSF CCF-1535987 and IIS-1618662.\n\n8\n\n\fReferences\n[1] P. Bartlett and M. Wegkamp. Classi\ufb01cation with a reject option using a hinge loss. JMLR, 2008.\n[2] A. Bounsiar, E. Grall, and P. Beauseroy. Kernel based rejection method for supervised classi\ufb01cation. In\n\nWASET, 2007.\n\n[3] H. L. Capitaine and C. Frelicot. An optimum class-rejective decision rule and its evaluation. In ICPR,\n\n2010.\n\n[4] K. Chaudhuri and C. Zhang. Beyond disagreement-based agnostic active learning. In NIPS, 2014.\n[5] C. Chow. An optimum character recognition system using decision function. IEEE Trans. Comput., 1957.\n[6] C. Chow. On optimum recognition error and reject trade-off. IEEE Trans. Comput., 1970.\n[7] C. Cortes, G. DeSalvo, and M. Mohri. Learning with rejection. In ALT, 2016.\n[8] I. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0, Aug. 2012.\n[9] B. Dubuisson and M. Masson. Statistical decision rule with incomplete knowledge about classes. In PR,\n\n1993.\n\n[10] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classi\ufb01cation. JMLR, 2010.\n[11] R. El-Yaniv and Y. Wiener. Agnostic selective classi\ufb01cation. In NIPS, 2011.\n[12] C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 2001.\n[13] G. Fumera and F. Roli. Support vector machines with embedded reject option. In ICPR, 2002.\n[14] G. Fumera, F. Roli, and G. Giacinto. Multiple reject thresholds for improving classi\ufb01cation reliability. In\n\nICAPR, 2000.\n\n[15] Y. Grandvalet, J. Keshet, A. Rakotomamonjy, and S. Canu. Suppport vector machines with a reject option.\n\nIn NIPS, 2008.\n\n[16] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of\n\ncombined classi\ufb01ers. Annals of Statistics, 30, 2002.\n\n[17] T. Landgrebe, D. Tax, P. Paclik, and R. Duin. The interaction between classi\ufb01cation and reject performance\n\nfor distance-based reject-option classi\ufb01ers. PRL, 2005.\n\n[18] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, New\n\nYork, 1991.\n\n[19] M. Littman, L. Li, and T. Walsh. Knows what it knows: A framework for self-aware learning. In ICML,\n\n2008.\n\n[20] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable\n\nminimization. Journal of Optimization Theory and Applications, 1992.\n\n[21] I. Melvin, J. Weston, C. S. Leslie, and W. S. Noble. Combining classi\ufb01ers for improved classi\ufb01cation of\n\nproteins from sequence or structure. BMC Bioinformatics, 2008.\n\n[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.\n[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in python. In JMLR, 2011.\n\n[24] T. Pietraszek. Optimizing abstaining classi\ufb01ers using ROC analysis. In ICML, 2005.\n[25] C. Santos-Pereira and A. Pires. On optimal reject rules and ROC curves. PRL, 2005.\n[26] R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine\n\nLearning, 39(2-3):135\u2013168, 2000.\n\n[27] D. Tax and R. Duin. Growing a multi-class classi\ufb01er with a reject option. In Pattern Recognition Letters,\n\n2008.\n\n[28] F. Tortorella. An optimal reject rule for binary classi\ufb01ers. In ICAPR, 2001.\n[29] K. Trapeznikov and V. Saligrama. Supervised sequential classi\ufb01cation under budget constraints.\n\nAISTATS, 2013.\n\nIn\n\n[30] J. Wang, K. Trapeznikov, and V. Saligrama. An lp for sequential learning under budgets. In JMLR, 2014.\n[31] M. Yuan and M. Wegkamp. Classi\ufb01cation methods with reject option based on convex risk minimizations.\n\nIn JMLR, 2010.\n\n[32] M. Yuang and M. Wegkamp. Support vector machines with a reject option. In Bernoulli, 2011.\n[33] C. Zhang and K. Chaudhuri. The extended littlestone\u2019s dimension for learning with mistakes and abstentions.\n\nIn COLT, 2016.\n\n9\n\n\f", "award": [], "sourceid": 914, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": "Google Research"}, {"given_name": "Giulia", "family_name": "DeSalvo", "institution": "New York University"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute, NYU & Google"}]}