{"title": "Learning Confidence Sets using Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 4929, "page_last": 4938, "abstract": "The goal of confidence-set learning in the binary classification setting is to construct two sets, each with a specific probability guarantee to cover a class. An observation outside the overlap of the two sets is deemed to be from one of the two classes, while the overlap is an ambiguity region which could belong to either class. Instead of plug-in approaches, we propose a support vector classifier to construct confidence sets in a flexible manner. Theoretically, we show that the proposed learner can control the non-coverage rates and minimize the ambiguity with high probability. Efficient algorithms are developed and numerical studies illustrate the effectiveness of the proposed method.", "full_text": "Learning Con\ufb01dence Sets using Support Vector\n\nMachines\n\nDepartment of Mathematical Sciences\n\nDepartment of Mathematical Sciences\n\nWenbo Wang\n\nBinghamton University\nBinghamton, NY 13902\n\nXingye Qiao*\n\nBinghamton University\nBinghamton, NY 13902\n\nwang2@math.binghamton.edu\n\nqiao@math.binghamton.edu\n\nAbstract\n\nThe goal of con\ufb01dence-set learning in the binary classi\ufb01cation setting [14] is to\nconstruct two sets, each with a speci\ufb01c probability guarantee to cover a class. An\nobservation outside the overlap of the two sets is deemed to be from one of the two\nclasses, while the overlap is an ambiguity region which could belong to either class.\nInstead of plug-in approaches, we propose a support vector classi\ufb01er to construct\ncon\ufb01dence sets in a \ufb02exible manner. Theoretically, we show that the proposed\nlearner can control the non-coverage rates and minimize the ambiguity with high\nprobability. Ef\ufb01cient algorithms are developed and numerical studies illustrate the\neffectiveness of the proposed method.\n\n1\n\nIntroduction\n\nIn binary classi\ufb01cation problems, the training data consist of independent and identically distributed\npairs (Xi, Yi), i = 1, 2, ..., n drawn from an unknown joint distribution P , with Xi \u2208 X \u2282 Rp,\nand Yi \u2208 {\u22121, 1}. While the misclassi\ufb01cation rate is a good assessment of the overall classi\ufb01cation\nperformance, it does not directly provide con\ufb01dence for the classi\ufb01cation decision. Lei [14] proposed\na new framework for classi\ufb01ers, named classi\ufb01cation with con\ufb01dence, using notions of con\ufb01dence and\nef\ufb01ciency. In particular, a classi\ufb01er \u03c6(x) therein is set-valued, i.e., the decision may be {\u22121},{1}, or\n{\u22121, 1}. Such a classi\ufb01er corresponds to two overlapped regions in the sample space X , C\u22121 and\nC1, and they satisfy that C\u22121 \u222a C1 = X . With these regions, we have the set-valued classi\ufb01er\n\n\uf8f1\uf8f2\uf8f3{\u22121}, when x \u2208 C\u22121\\C1\n\n{1}, when x \u2208 C1\\C\u22121\n{\u22121, 1}, when x \u2208 C\u22121 \u2229 C1\n\n.\n\n\u03c6(x) =\n\nThose points in the \ufb01rst two sets are classi\ufb01ed to a single class as by traditional classi\ufb01ers. However,\nthose in the overlap receive a decision of {\u22121, 1}, hence may belong to either class. When the option\nof {\u22121, 1} is forbidden, the set-valued classi\ufb01er degenerates to a traditional classi\ufb01er.\nLei [14] de\ufb01ned the notion of con\ufb01dence as the probability 100(1\u2212\u03b1j)% that set Cj covers population\nclass j for j = \u00b11 (recalling the con\ufb01dence interval in statistics). The notion of ef\ufb01ciency is opposite\nto ambiguity, which refers to the size (or probability measure) of the overlapped region named\nthe ambiguity region. In this framework, one would like to encourage classi\ufb01ers to minimize the\nambiguity when controlling the non-coverage rates. Lei [14] showed that the best such classi\ufb01er, the\nBayes optimal rule, depends on the conditional class probability function \u03b7(x) = P (Y = 1|X = x).\nLei [14] then proposed to use the plug-in method, namely to \ufb01rst estimate \u03b7(x) using, for instance,\nlogistic regression, then plug the estimation into the Bayes solution. Needless to say, its empirical\nperformance highly depends on the estimation accuracy of \u03b7(x). However, it is well known that the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flatter can be more dif\ufb01cult than mere classi\ufb01cation [24, 9, 26], especially when the dimension p is\nlarge [27].\nSupport vector machine [SVM; 5] is a popular classi\ufb01cation method with excellent performance for\nmany real applications. Fern\u00e1ndez-Delgado et al. [7] compared 179 classi\ufb01ers on 121 real data sets\nand concluded that SVM was among the best and most powerful classi\ufb01ers. To avoid estimating the\nconditional class probability \u03b7(x), we propose a support vector classi\ufb01er to construct con\ufb01dence sets\nby empirical risk minimization. Our method is more \ufb02exible as it takes advantage of the powerful\nprediction power of support vector machine.\nWe show in theory that the population minimizer of our optimization is to some extent equivalent to\nthe Bayes optimal rule in [14]. Moreover, in the \ufb01nite-sample case, our classi\ufb01er can control both\nnon-coverage rates while minimizing the ambiguity.\nA closely related problem is the Neyman-Pearson (NP) classi\ufb01cation [4, 19] whose goal is to \ufb01nd a\nboundary for a speci\ufb01c null hypothesis class. It aims to minimize the probability that an observation\nfrom the alternative class falls into this region (the type II error) while controlling the type I error,\ni.e., the non-coverage rate for the null class. See Tong et al. [22] for a survey. Our problem can be\nunderstood as a two-sided NP classi\ufb01cation problem. Other related areas of work are conformal\nlearning, set-valued classi\ufb01cation, or classi\ufb01cation with reject and re\ufb01ne options. See [21], [6], [22],\n[23], [11], [2] and [28].\nThe rest of the article is organized as follows. Some background information is provided in Section\n2. Our main method is introduced in Section 3. A comprehensive theoretical study is conducted in\nSection 4, including the Fisher consistency and novel statistical learning theory. In Section 5, we\npresent ef\ufb01cient algorithms to implement our method. The usefulness of our method is demonstrated\nusing simulation and real data in Section 6. Detailed proofs are in the Supplementary Material.\n\n2 Background and notations\n\nWe \ufb01rst formally de\ufb01ne the problem and give some useful notations.\nIt is desirable to keep the ambiguity as small as possible. On the other hand, we would like as many\nclass j observations as possible to be covered by Cj. Consider predetermined non-coverage rates \u03b1\u22121\nand \u03b11 for the two classes. Let P\u22121 and P1 be the probability measure of X conditional on Y = \u22121\nand +1. Conceptually, we formulate classi\ufb01cation with con\ufb01dence as the optimization below.\n\nP (C\u22121 \u2229 C1)\n\nmin\nC\u22121,C1\n\n(1)\nHere the constraint that Pj(Cj) \u2265 1 \u2212 \u03b1j means that 100(1 \u2212 \u03b1j)% of the observations from class j\nshould be covered by region Cj.\n\nsubject to Pj(Cj) \u2265 1 \u2212 \u03b1j, j = \u00b11, C\u22121 \u222a C1 = X .\n\nFigure 1: The left panel shows the two de\ufb01nite regions and the ambiguity region in the case of\nsymmetric Gaussian distributions. The right penal illustrates the weight function (see Section 3).\n\u22121 = {x : \u03b7(x) \u2264 t\u22121} and\nUnder certain conditions, the Bayes solution of this problem is: C\u2217\n1 = {x : \u03b7(x) \u2265 t1} with t\u22121 and t1 satisfying that P\u22121(\u03b7(X) \u2264 t\u22121) = 1 \u2212 \u03b1\u22121 and\nC\u2217\nP1(\u03b7(X) \u2265 t1) = 1 \u2212 \u03b11. A simple illustrative toy example with two Gaussian distributions on R is\nshown in Figure 1. The two boundaries are shown as the vertical lines, which lead to three decision\n\n2\n\n\u22123\u22122\u2212101230.00.10.20.30.4xdensityClass \u22121Class +1{\u22121}{+1}{\u22121,+1}a1a-1C-1C1\u22122\u22121012\u22120.50.00.51.01.52.02.53.0uloss0\u22121 losshinge lossweightweighted hinge loss\fregions, {\u22121}, {+1}, and {\u22121, +1}. The non-coverage rate \u03b1\u22121 for class \u22121 is shown on the right\ntail of the red curve (similarly, \u03b11 for class 1 on the left tail of blue curve.) In reality, the underlying\ndistribution will be more complicated than a simple multivariate Gaussian distribution and the true\nboundary may be beyond linearity. In these cases, \ufb02exible approaches such as SVM will work better.\nCon\ufb01dence sets may be seen as equivalent to classi\ufb01cation with reject options [11, 2, 10] via different\nparameterizations. The Bayes rule in this article is different from the Bayes rule in the literature of\nclassi\ufb01cation with reject options. In that context, the Bayes rule depends on a comparison between\n\u03b7(\u00b7) and a predetermined cost of rejection d. But it does not lead to a guarantee of the coverage\nprobabilities for the corresponding con\ufb01dence sets. Here instead, the cutoff for the Bayes rule is\ncalibrated to achieve the desired coverage probabilities.\n\n3 Learning con\ufb01dence sets using SVM\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n(cid:80)n\n\n(cid:88)\n\ni:yi=j\n\n(cid:88)\n\ni:yi=j\n\nTo avoid estimating \u03b7, we propose to solve the empirical counterpart of (1) directly using SVM. Here,\nwe present two variants of our method. We start with an original version to illustrate the basic idea.\nThen we introduce an improvement.\nUnlike the regular SVM, the proposed classi\ufb01er has two (not one) separating boundaries. They are\nde\ufb01ned as {x : f (x) = \u2212\u03b5} and {x : f (x) = +\u03b5} where f is the discriminant function, and \u03b5 \u2265 0.\nThe positive region C1 is {x : f (x) \u2265 \u2212\u03b5} and the negative region C\u22121 is {x : f (x) \u2264 \u03b5}. Hence\nwhen \u2212\u03b5 \u2264 f (x) \u2264 \u03b5, observation x falls into the ambiguity region {\u22121, 1}.\nDe\ufb01ne R(f, \u03b5) = P (|Y f (X)| \u2264 \u03b5) the probability measure of the ambiguity. We may rewrite\nproblem (1) in terms of the function f and threshold \u03b5,\n\nmin\n\u03b5\u2208R+,f\n\nR(f, \u03b5),\n\nsubject to Pj(Y f (X) < \u2212\u03b5) \u2264 \u03b1j, j = \u00b11.\n\n(2)\n\nReplacing the probability measures above by the empirical measures, we can obtain,\n\nmin\n\u03b5\u2208R+,f\n\n1\nn\n\n1{\u2212\u03b5 \u2264 f (xi) \u2264 \u03b5},\n\nsubject to 1\nnj\n\n1{yif (xi) < \u2212\u03b5} \u2264 \u03b1j, j = \u00b11.\n\nIt is easy to show that as long as the equalities in the constraints are achieved at the optimum, we can\nobtain the same minimizer if the objective function is changed to 1\nn\nFor ef\ufb01cient and realistic optimization, we replace the indicator function 1{u \u2264 0} in the objective\nfunction and constraints by the Hinge loss function (1 \u2212 u)+. The practice of using a surrogate loss\nto bound the non-coverage rates has been widely used in the literature of NP classi\ufb01cation, see [19].\nTo simplify the presentation, we denote Ha(u) = (1 + a \u2212 u)+ as the a-Hinge Loss and it can be\nseen that Ha(x) coincides with the original Hinge loss when a = 0. Our initial classi\ufb01er can be\nrepresented by the following optimization:\n\n1{yif (xi) \u2212 \u03b5 \u2264 0}.\n\ni=1\n\nmin\n\u03b5\u2208R+,f\n\n1\nn\n\nH\u03b5(yif (xi)) + \u03bbJ(f ),\n\nsubject to 1\nnj\n\nH\u2212\u03b5(yif (xi)) \u2264 \u03b1j, j = \u00b11.\n\n(3)\n\nHere J is a regularization term to control the complexity of the discriminant function f. When f\ntakes the linear form of f (x) = xT \u03b2 + b, J(f ) can be L2-norm (cid:107)\u03b2(cid:107)2 or L1-norm |\u03b2|.\nIn SVM, yf (x) is called the functional margin, which measures the signed distance from x to the\nboundary {x : f (x) = 0}. Positive and large value of yf (x) means the observation is correctly\nclassi\ufb01ed, and is far away from the boundary. In our situation, we compare yf (x) with +\u03b5 and \u2212\u03b5\nrespectively. If yf (x) < \u2212\u03b5, then x is not covered by Cy (hence is misclassi\ufb01ed, in the classi\ufb01cation\nlanguage). On the other hand, if yf (x) \u2264 \u03b5, then x either satis\ufb01es that yf (x) < \u2212\u03b5 as above, or\nfalls into the ambiguity, which is why we try to minimize the sum of H\u03b5(yif (xi)).\n\nyi=j H\u2212\u03b5(yif (xi)) for both classes, we aim to control the non-coverage rates.\nSince H\u2212\u03b5(u) \u2265 1{u < \u2212\u03b5} (the latter indicates the occurrence of non-coverage) for nega-\ntively large u. It may be more conservative by using the Hinge loss than the indicator function\n1{yif (xi) < \u2212\u03b5} in the constraint to control the non-coverage rates. We alleviate this problem by\nimposing a weight wi to each observation in the constraint. In particular, this weight is chosen to\nbe wi = max{1, H\u2212\u03b5(y \u02c6f (x))}\u22121, where \u02c6f is a reasonable guess of the \ufb01nal minimizer f. Our goal\n\nBy constraining(cid:80)\n\n3\n\n\fis to weight the Hinge loss in the constraint, wiH\u2212\u03b5(yif (xi)), so that it approximates the indicator\nfunction 1{yif (xi) < \u2212\u03b5}. This may be illustrated by Figure 1 in which the blue bold line is the\nresult of multiplying the weight (red dashed) by the Hinge loss (purple dotted), which is close to the\nindicator function (black dot-dashed). Note that by weighting the Hinge loss, the impact of those\nobservations with very negatively large u = yf (x) value is reduced to 1. The adaptive weighted\nversion of our method changes constraint (3) to 1\nnj\n\n(cid:80)\ni:yi=j wiH\u2212\u03b5(yif (xi)) \u2264 \u03b1j, j = \u00b11.\n\nIn practice, we adopt an iterative approach, and use the estimated f from the previous iteration to\ncalculate the weight for each observation at the current iteration. We start with equal weights for each\nobservation, solve the optimization problem with the weights obtained in the last iteration, and then\ncalculate the new weights for the next iteration. [25] \ufb01rst used this idea in their work of adaptively\nweighted large margin classi\ufb01ers for the purpose of robust classi\ufb01cation.\n\n4 Theoretical Properties\n\nIn this section we study the theoretical properties of the proposed method. We start with population\nlevel properties in Section 4.1. In Section 4.2, we discuss the \ufb01nite-sample properties using novel\nstatistical learning theory.\n\n4.1 Fisher consistency and excess risk\n\nAssume that P\u22121 and P1 are continuous with density function p\u22121 and p1, and \u03c0j = P (Y = j)\nis positive for j = \u00b11. Moreover, \u03b7(X) is continuous and has positive density function almost\neverywhere, and t\u22121 and t1 are quantiles of \u03b7(X). They satis\ufb01es P\u22121(\u03b7(X) \u2264 t\u22121) = 1 \u2212 \u03b1\u22121\nand P1(\u03b7(X) \u2265 t1) = 1 \u2212 \u03b11. We need to make assumptions on the dif\ufb01culty level of the\nclassi\ufb01cation task. In particular, the classi\ufb01cation should be dif\ufb01cult enough so that overlapping\nregions is meaningful (otherwise, there will be almost no ambiguity even at small non-coverage\nrates.)\nAssumption 1. t\u22121 \u2265 1\nAssumption 2. \u2203c > 0, t\u22121 \u2212 c \u2265 1\n2 \u2265 t1 + c.\n1 = {x : \u03b7(x) \u2265 t1}\nEach assumption implies that the union of C\u2217\nis X . Otherwise, there will be a gap around the boundary {x : \u03b7(x) = 1/2}. It is easy to see that\nAssumption 2 is stronger than Assumption 1.\nFisher consistency concerns the Bayes optimal rule, which is the minimizer of problem (2). In (4)\nbelow, we replace the loss function in the objective function of (2) with risk under the Hinge loss.\n\n\u22121 = {x : \u03b7(x) \u2264 t\u22121} and C\u2217\n\n2 \u2265 t1.\n\nmin RH (f, \u03b5),\n\nsubject to Pj(Y f (X) < \u2212\u03b5) \u2264 \u03b1j, j = \u00b11,\n\n(4)\n\nwhere RH (f, \u03b5) = E[H\u03b5(Y f (X))].\nTheorem 1 shows that for any \ufb01xed \u03b5, the minimizer of (4) is the same as the Bayes rule [14].\nTheorem 1. Under Assumption 1, for any \ufb01xed \u03b5 \u2265 0, function\n\n\u03b5 \u00b7 sign(\u03b7(x) \u2212 1\n2 ),\n\u2212(1 + \u03b5),\nis the minimizer to (4) and a minimizer to (2).\n\nf\u2217(x) =\n\n\u03b7(x) > t\u2212\nt+ \u2264 \u03b7(x) \u2264 t\u2212\nf (x) < t+\n\n.\n\n\uf8f1\uf8f2\uf8f3 1 + \u03b5,\n\nA key result in many machine learning literature (such as [3], [30] or [2]) was that the excess risk\nof 0-1 classi\ufb01cation loss is bounded by the excess risk of surrogate loss. Here we show a similar\nresult for the con\ufb01dence set problem. That is, the excess ambiguity R(f, \u03b5) \u2212 R(f\u2217, \u03b5) vanishes as\nRH (f, \u03b5) \u2212 RH (f\u2217, \u03b5) goes to 0.\nTheorem 2. Under Assumption (2), for any \u03b5 \u2265 0, and \u2200f satisfying the constraints in (2), there\nexists C\n\n2c > 0 such that the following inequality holds,\n\n4c2 + 1\n\n= 1\n\n(cid:48)\n\n(cid:48)\n\n(RH (f, \u03b5) \u2212 RH (f\u2217, \u03b5)) \u2265 R(f ) \u2212 R(f\u2217).\n\nC\n\nNote that C\n\n(cid:48)\n\ndoes not depend on \u03b5.\n\n4\n\n\f4.2 Finite-sample properties\nDenote the Reproducing Kernel Hilbert Space (RKHS) with bounded norm as HK(s) = {f :\nX \u2192 R|f (x) = h(x) + b, h \u2208 HK,||h||HK \u2264 s, b \u2208 R} and r = supx\u2208X K(x, x). For a\n\ufb01xed \u03b5, de\ufb01ne the space of constrained discriminant functions as F\u03b5((\u03b1\u22121, \u03b11)) = {f : X \u2192\n(cid:80)\nR|E(H\u2212\u03b5(Y f (X))|Y = j) \u2264 \u03b1j, j = \u00b11}, and its empirical counterpart as \u02c6F\u03b5((\u03b1\u2212, \u03b1+)) =\n{f : X \u2192 R|n\u22121\ni:yi=j H\u2212\u03b5(yif (xi)) \u2264 \u03b1j, j = \u00b11}. Moreover, we de\ufb01ne the feasible\nfunction space F\u03b5(\u03ba, s) = HK(s) \u2229 F\u03b5((\u03b1\u22121 \u2212 \u03ba\u221a\n)) and its empirical counterpart\n\u02c6F\u03b5(\u03ba, s) = HK(s)\u2229 \u02c6F\u03b5((\u03b1\u22121\u2212 \u03ba\u221a\n)). Lastly, consider a subset of the Cartesian product\nof the above feasible function space and the space for \u03b5, F(\u03ba, s) = {(f, \u03b5), f \u2208 F\u03b5(\u03ba, s), \u03b5 \u2265 0}\nand its empirical counterpart \u02c6F(\u03ba, s) = {(f, \u03b5), f \u2208 \u02c6F\u03b5(\u03ba, s), \u03b5 \u2265 0}. Then optimization problem\n(3) of our proposed method can be written as\n\n, \u03b11 \u2212 \u03ba\u221a\n\n, \u03b11\u2212 \u03ba\u221a\n\nn\u22121\n\nn\u22121\n\nn1\n\nn1\n\nj\n\nn(cid:88)\n\ni=1\n\nmin\n\n(f,\u03b5)\u2208 \u02c6F (0,s)\n\n1\nn\n\nH\u03b5(yif (xi)).\n\n(5)\n\n\u221a\n\n\u221a\nsr/\n\nIn Theorem 3, we give the \ufb01nite-sample upper bound for the non-coverage rate.\nTheorem 3. Let (f, \u03b5) be a solution to optimization problem (5), then with probability at least 1\u2212 2\u03b6,\nZ =\nPj(Y f (X) < \u2212\u03b5) \u2264 E[H\u2212\u03b5(Y f (X))|Y = j] \u2264 1\nnj\n\nn, Tn(\u03b6) = {2srlog(1/\u03b6)/n}1/2 and r = supX K(x, x)\n\nH\u2212\u03b5(yif (xi)) + 3Tnj (\u03b6) + Z(nj).\n\n(cid:88)\n\nyi=j\n\nTheorem 3 suggests that if we want to control the non-coverage rate on average at the nominal \u03b1\u22121 or\n\u03b11 rates with high probability, we should choose the \u03b1\u22121 or \u03b11 values to be slightly smaller than the\ndesired ones in optimization (3) in practice. In particular, we need to make 1\nyi=j H\u2212\u03b5(yif (xi))+\nnj\n3Tnj (\u03b6)+Z(nj) \u2264 \u03b1j. Note that the remainder terms 3Tnj (\u03b6)+Z(nj) will vanish as n\u22121, n1 \u2192 \u221e.\nThe next theorem ensures that the empirical ambiguity probability from solving (5) based on a\n\ufb01nite sample will converge to the ambiguity given by the solution on an in\ufb01nite sample (under the\nconstraints E(H\u2212\u03b5(Y f (X))|Y = j) \u2264 \u03b1j, j = \u00b11).\nTheorem 4. Let ( \u02c6f , \u02c6\u03b5) be the solution of the optimization problem (6)\n\n(cid:80)\n\nH\u03b5(yif (xi)).\n\n(6)\n\n\u221a\n\nwith \u03ba = (6log( 1\n\u03b6 ) + 1)\n(i). \u02c6f \u2208 F\u02c6\u03b5(0, s), and\n(ii). RH ( \u02c6f , \u02c6\u03b5) \u2212 min\n\n(f,\u03b5)\u2208F (0,s)\n\nsr. Then with probability 1 \u2212 6\u03b6, and large enough n\u22121 and n1 we have\nRH (f, \u02c6\u03b5) \u2264 \u03ba(2n\u22121/2 + 4 min{\u03b1\u22121, \u03b11}\u22121 min{\u221a\n\nn1}\u22121).\n\nn\u22121,\n\n\u221a\n\nn(cid:88)\n\ni=1\n\nmin\n\n(f,\u03b5)\u2208 \u02c6F (\u03ba,s)\n\n1\nn\n\nIn our study we analyze formula (5) where J(f ) appears in the constraint instead of the regularized\nformula (3) for technical convenience. This comes at a price of a \ufb01xed upper bound s on J(f ). We\ncan revise the statements of Theorems 3 and 4 so that s increases with n to in\ufb01nity (with a price of a\nslower convergence rate.) It is possible to derive the results for the regularized version based on (3).\nSince at the optimality it is easy to show that J(f ) \u2264 2/\u03bb (this is done by showing that the objective\nis at most 2, when f \u2261 0 and \u03b5 = 1,) we may rewrite s in Theorem 3 in terms of \u03bb.\n\n5 Algorithms\n\nIn this section, we give details of the algorithm. Similar to the SVM implementation, we propose to\nsolve the dual problem. We start with the linear SVM with L2 norm for illustrative purposes. After\nintroducing two sets of slack variables, \u03b7i = (1\u2212\u03b5\u2212yi(xT\ni \u03b2+b))+,\nwe can show that (3) is equivalent to (7),\n\ni \u03b2+b))+ and \u03bei = (1+\u03b5\u2212yi(xT\n\n\u03bei\n\n(7)\n\nmin\n\n\u0398\n\n1\n2\n\n||\u03b2||2\n\n2 + \u03bb(cid:48) n(cid:88)\n\ni\n\n5\n\n\fsubject to yi(xT\n\n\u03bei \u2265 0,\n\n(cid:88)\ni \u03b2 + b) \u2265 1 + \u03b5 \u2212 \u03bei,\nwi\u03b7i \u2264 n\u22121\u03b1\u22121,\n\nyi=\u22121\n\nyi(xT\n\u03b7i \u2265 0,\n\ni \u03b2 + b) \u2265 1 \u2212 \u03b5 \u2212 \u03b7i\nwi\u03b7i \u2264 n1\u03b11,\n\n(cid:88)\n\nyi=1\n\nfor all i = 1, 2, ..., n,\n\n\u03b5 \u2265 0.\n\nHere \u0398 is the collection of all variables of interest, namely \u0398 = {\u03b5, \u03b2, b,{\u03bei}n\nthen solve it via the quadratic programming below,\n\ni=1,{\u03b7i}n\n\ni=1}. We can\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\nj=1\n\nmin\n\u0398(cid:48)\n\n1\n2\n\n(\u03b6i + \u03c4i)(\u03b6j + \u03c4j)yiyjx(cid:48)\n\n\u03b6i \u2212 n(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\nixj \u2212 n(cid:88)\nn(cid:88)\ni \u03b6iyixi +(cid:80)n\n\ni=1\n\n\u03c4i + n\u22121\u03b1\u22121\u03b8\u22121 + n1\u03b11\u03b81\n\n(8)\n\nn(cid:88)\n\ni=1\n\n\u03b6i \u2212 n(cid:88)\n\ni=1\n\n\u03c4i \u2265 0.\n\nsubject to 0 \u2264 \u03b6i \u2264 \u03bb(cid:48),\n\n0 \u2264 \u03c4i \u2264 \u03b8yiwi,\n\n\u03b6iyi +\n\n\u03c4iyi = 0,\n\ni=1,{\u03c4i}n\n\nHere \u0398(cid:48) = {{\u03b6i}n\ni=1, \u03b8\u22121, \u03b81} consists of all the variables in the dual problem. The above\noptimization may be solved by any ef\ufb01cient quadratic programming routine. After solving the dual\ni \u03c4iyixi. Then we can plug \u03b2 into the primal\n\nproblem, we can \ufb01nd \u03b2 by \u03b2 = (cid:80)n\nKernel Hilbert Space (RKHS) with a positive de\ufb01nite kernel K, f (x) =(cid:80)n\n\nproblem and \ufb01nd b and \u03b5 by linear programming.\nFor nonlinear f, we can adopt the widely used \u2018kernel trick\u2019. Assume f belongs to a Reproducing\ni=1 ciK(xi, x) + b. In\nthis case the dual problem is the same as above except that x(cid:48)\nixj is replaced by K(xi, xj). After\nthe solution has been found, we then have ci = \u03b6i + \u03c4i. Common choices for the kernel function\nincludes the Gaussian kernel and the polynomial kernel.\n\n6 Numerical Studies\n\nIn this section, we compare our con\ufb01dence-support vector machine (CSVM) method and methods\nbased on the plug-in principal, including L2 penalized logistic regression [12], kernel logistic\nregression [31], kNN [1], random forest [15] and SVM [5] using both simulated and real data.\nIn the study, we use solver Cplex to solve the quadratic programming problem arising in CSVM. For\nother methods, we use existing R packages glmnet, gelnet, class, randomForest and e1071.\n\n6.1 Simulation\n\nWe study the numerical performance over a large variety of sample sizes. In each case, an independent\ntuning set with the same sample size as the training set is generated for parameter tuning. The testing\nset has 20000 observations (10000 or nearly 10000 for each class). We run the simulation multiple\ntimes (1,000 times for Example 1 and 100 times for Example 2 and 3) and report the average and\nstandard error. Both non-coverage rates are set to 0.05.\nthe best parameter \u03bb and the hyper-parameter for kernel methods as follows.\nWe select\nWe search for the optimal \u03c1 in the Gaussian kernel exp (\u2212(cid:107)x \u2212 y(cid:107)2/\u03c12) from the grid\n10\u02c6{\u22120.5,\u22120.25, 0, 0.25, 0.5, 0.75, 1} and the optimal degree for polynomial kernel from\n{2, 3, 4}. For each \ufb01xed candidate hyper-parameter, we choose \u03bb from a grid of candidate values\nranging from 10\u22124 to 102 by the following two-step searching scheme. We \ufb01rst do a rough search\nwith a larger stride {10\u22124, 10\u22123.5, . . . , 102} and get the best parameter \u03bb1. Then we do a \ufb01ne search\nfrom \u03bb1\u00d7{10\u22120.5, 10\u2212.4, . . . , 100.5}. After that, we choose the optimal pair which gives the smallest\ntuning ambiguity and has the two non-coverage rates for the tuning set controlled.\nTo adapt traditional classi\ufb01cation methods to the con\ufb01dence set learning problem, we use the plug-in\nprincipal [14]. To improve the performance, we make use of the suggested robust implementation in\n[14] for all the methods. Speci\ufb01cally, we \ufb01rst obtain an estimate of \u03b7 (such as by logistic regression,\nkernel logistic regression, kNN and random forest) or a monotone proxy of it (such as the discriminant\nfunction f in CSVM and SVM), then choose thresholds \u02c6t\u22121 and \u02c6t1 which are two sample quantiles\n\nof(cid:98)\u03b7(x) (or f (x)) among the tuning set so that the non-coverage rates for the tuning set match the\nnominal rates. The \ufb01nal predicted sets are induced by thresholding(cid:98)\u03b7(x) (or f (x)) using \u02c6t\u22121 and \u02c6t1.\n\nBecause there are two non-coverage rates and one ambiguity size to compare here, how to make fair\ncomparison becomes a tricky problem since one classi\ufb01er can sacri\ufb01ce the non-coverage rate to gain\n\n6\n\n\fFigure 2: Scatter plots of the \ufb01rst two dimensions for the simulated data with Bayes rules showing\nthe two de\ufb01nite regions and the ambiguity region.\n\n2 ), \u03a31 = diag( 1\n\nin ambiguity. One by-product of the robust implementation above is that the non-coverage rate of\nmost of the methods will become very similar and we only need to compare the size of the ambiguity.\nWe also include a naive SVM approach (\u2019SVM_r\u2019 in plots below) whose discriminant function is\nobtained in the traditional way, but which induces con\ufb01dence sets by thresholding in the same way\ndescribed above.\nWe consider three different simulation scenarios. In the \ufb01rst scenario we compare the linear ap-\nproaches (SVM and penalized logistic regression), while in the next two cases we consider nonlinear\nmethods. In all cases, we add additional noise dimensions to the data. These noise covariates are\nnormally distributed with mean 0 and \u03a3 = diag(1/p), where p is the total dimension of the data.\nExample 1 (Linear model with nonlinear Bayes rule): In this scenario, we have two normally\ndistributed classes with different covariance matrices. In particular, denote X|Y = j \u223c N (\u00b5j, \u03a3j)\nfor j = \u00b11, then \u00b5\u22121 = (\u22122, 1)T , \u00b51 = (1, 0)T , and \u03a3\u22121 = diag(2, 1\n2 , 2). The\nprior probabilities of both classes are the same. Lastly, we add eight dimensions of noise covariates\nto the data. The data are illustrated in the left penal of Figure 2. We compare linear CSVM, and the\nplug-in methods L2 penalized logistic regression [8] and naive linear SVM to estimate \u03b7.\nExample 2 (Moderate dimensional polynomial boundary): This case is similar to the one in [29].\nFirst we generate x1 \u223c Unif[\u22121, 1] and x2 \u223c Unif[\u22121, 1]. De\ufb01ne functions fj(x) = j(\u22123.6x2\n1 +\n2 \u2212 0.8), j = \u00b11. Then we set \u03b7(x) = f1(x)/(f\u22121(x) + f1(x)), where x = (x1, x2). We then\n7.2x2\nadd 98 covariates on top of the 2-dimensional signal. The data are illustrated in the middle penal of\nFigure 2. In this scenario, we choose to use the polynomial kernel for all the kernel based methods.\nExample 3 (High-dimensional donut): We \ufb01rst generate a two-dimensional data, (ri, \u03b8i) where\n\u03b8i \u223c Unif[0, 2\u03c0], ri|(Y = \u22121) \u223c Uniform[0, 1.2], and ri|(Y = +1) \u223c Unif[0.8, 2]. Then we\nde\ufb01ne the two-dimensional Xi = (ri cos(\u03b8i), ri sin(\u03b8i)). The data are illustrated in the right penal\nof Figure 2. We then add 498 covariates on top of the 2-dimensional signal. We use the Gaussian\nkernel, K(x, y; \u03c1) = exp (\u2212(cid:107)x \u2212 y(cid:107)2/\u03c12) for all the kernel based methods.\nOur methods are improved using the robust implementation. The results are reported in Figure 3.\nWe also show the performance of CSVM with weighting but without robust implementation. For\nExample 1, our CSVM method gives a signi\ufb01cantly smaller ambiguity than either logistic regression\nor naive SVM. In Example 2 and Example 3, our method gives a smaller or at least comparable\nambiguity to the best plug-in method, which is kernel logistic regression. Our weighted CSVM\nperforms the best when sample size is small in the linear case and it outperforms kNN, Random\nForest and naive SVM in nonlinear cases. It is not surprising that the naive SVM method performs\nsigni\ufb01cantly worse than all other methods in the nonlinear settings, as the hinge loss is well known\nto not lead to consistent estimates for class probabilities (see [18]). The non-coverage rates (not\nshown here) of CSVM, random forest, kernel logistic regression and naive SVM methods are close to\neach other while CSVM without robust implementation and kNN have similar non-coverage rates. A\ndetailed comparison can be found in the Supplementary Material.\n\n6.2 Real Data Analysis\n\nWe conduct the comparison on the hand-written zip code data [13]. The data set consists of many\n16 \u00d7 16 pixel images of handwritten digits. It is widely used in the classi\ufb01cation literature. There are\n\n7\n\n\u22124\u2212202\u22122\u2212101234x1x2\u22121.0\u22120.50.00.51.0\u22121.0\u22120.50.00.51.0x1x2\u22122\u22121012\u22122\u22121012x1x2Class \u22121Class +1\fboth training and testing sets de\ufb01ned in it. Lei [14] used the same dataset for illustrating the plug-in\nmethods. We choose this dataset to directly compare with the plug-in methods.\nFollowing Lei [14], to form a binary classi\ufb01cation problem, we use the subset of the data containing\ndigits {0, 6, 8, 9}. Images with digits 0, 6, 9 are labeled as class \u22121 (they are digits with one circle)\nand those with digit 8 (two circles) are labeled as class +1. Previous studies [21] pointed out that\nthere was discrepancies between the training and testing set of this data set. So in this study we \ufb01rst\nmixed the training and testing data and then randomly split into new training, tuning and testing data.\nThe training and tuning data both have sample size 800, with 600 from class \u22121 and 200 from class 1\nto preserve the unbalance nature of the data set. During training, we oversample class 1 by counting\neach observation three times to alleviate the unbalanced classes issue.\nAlthough Lei [14] set both nominal non-coverage rates to be 0.05 in their study which focused on\nlinear methods, it needs to be pointed out that many nonlinear classi\ufb01ers, such as SVM with Gaussian\nkernel, can achieve this non-coverage rate without introducing any ambiguity. Therefore we reduce\nthe non-coverage rate to 0.01 for both classes to make the task more challenging.\nWe apply Gaussian kernel for CSVM, and compare with kernel logistic regression with Gaussian\nkernel, random forest, kNN and naive SVM with Gaussian kernel on this data set.\n\nFigure 4: An illustration of CSVM method using t-SNE. The left penal shows the true labels, and the\nright panel the predicted label for weighted CSVM.\n\nThe results are summarized in Table 1 with numbers in percentage. CSVM gives better results than\nall the plug-in methods. We plot the zip code data using t-distributed stochastic neighbor embedding\n(t-SNE) [17] to give a visualization of our method and the data.\n\nFigure 3: Outcome of ambiguities in three simulation settings. Non-coverage rates are similar among\ndifferent methods and are not shown here. CSVM has the smallest ambiguity.\n\n8\n\n\u221220\u22121001020\u221220\u22121001020tSNE1tSNE2\u221220\u22121001020\u221220\u22121001020tSNE1tSNE2labelAmbiguityClass \u22121Class 10.00.10.250100150200sampleambiguity0.000.250.500.75100150200250300sampleambiguity0.000.250.500.75100200300400500sampleambiguityclassifiercsvm_r_wcsvm_wknn_rlogi_rrf_rsvm_r\fClassi\ufb01er\nNon-coverage(-1)\nNon-coverage(+1)\nAmbiguity\n\nCSVM\n0.05(0.005)\n0.56(0.06)\n8.29(0.18)\n\nCSVM(r)\n1.02(0.05)\n1.19(0.11)\n2.52(0.13)\n\nKNN(r)\n0.81(0.04)\n1.04(0.09)\n10.21(2.12)\n\nKLR(r)\n0.98(0.05)\n1.25(0.10)\n3.46(0.17)\n\nRF(r)\n0.95(0.04)\n1.10(0.11)\n7.55(0.37)\n\nnaive SVM(r)\n1.00(0.05)\n1.27(0.11)\n2.66(0.13)\n\nTable 1: CSVM gives better or comparable outcome to the best plug-in method.\n\nIt can be seen that the ambiguity region mainly lies on the boundary between the two classes. In\nparticular, they cover those points which appear to be closer to the class other than the one they really\nbelong to. Moreover, it can be seen that the union of the ambiguity region and the predicted region\nfor either class, covers almost all the ground of that class (de\ufb01ned by the true labels). This is not\nsurprising since the non-coverage rate of CSVM is set to be a small number of 1% in this case.\n\n7 Conclusion and future works\n\nIn this work, we propose to learn con\ufb01dence sets using support vector machine. Instead of a plug-in\napproach, we use empirical risk minimization to train the classi\ufb01er. Theoretical studies have shown\nthe effectiveness of our approach in controlling the non-coverage rate and minimizing the ambiguity.\nWe make use of many well understood advantages of SVM to solve the problem. For instance the\n\u2018kernel trick\u2019 allows more \ufb02exibility and empowers us to conduct classi\ufb01cation in nonlinear cases.\nHinge loss function is not the only surrogate loss that can be used. There are many other useful loss\nfunctions with good properties in different scenarios [16].\nCon\ufb01dence set learning for multi-class case is also an interesting future work. This has a natural\nconnection to the literature of multi-class classi\ufb01cation with con\ufb01dence [20], classi\ufb01cation with reject\nand re\ufb01ne options [28] and conformal learning [21].\n\nReferences\n[1] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression.\n\nThe American Statistician, 46(3):175\u2013185, 1992.\n\n[2] Peter L Bartlett and Marten H Wegkamp. Classi\ufb01cation with a reject option using a hinge loss.\n\nJournal of Machine Learning Research, 9(Aug):1823\u20131840, 2008.\n\n[3] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[4] Adam Cannon, James Howse, Don Hush, and Clint Scovel. Learning with the neyman-pearson\nand min-max criteria. Los Alamos National Laboratory, Tech. Rep. LA-UR, pages 02\u20132951,\n2002.\n\n[5] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):\n\n273\u2013297, 1995.\n\n[6] Christophe Denis and Mohamed Hebiri. Con\ufb01dence sets with expected sizes for multiclass\n\nclassi\ufb01cation. arXiv preprint arXiv:1608.08783, 2016.\n\n[7] Manuel Fern\u00e1ndez-Delgado, Eva Cernadas, Sen\u00e9n Barro, and Dinani Amorim. Do we need\nhundreds of classi\ufb01ers to solve real world classi\ufb01cation problems. J. Mach. Learn. Res, 15(1):\n3133\u20133181, 2014.\n\n[8] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized\n\nlinear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.\n\n[9] Johannes F\u00fcrnkranz and Eyke H\u00fcllermeier. Preference learning: An introduction. In Preference\n\nlearning, pages 1\u201317. Springer, 2010.\n\n[10] Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and St\u00e9phane Canu.\n\nSup-\nport vector machines with a reject option.\nIn D. Koller, D. Schuurmans, Y. Ben-\ngio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21,\npages 537\u2013544. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/\n3594-support-vector-machines-with-a-reject-option.pdf.\n\n9\n\n\f[11] Radu Herbei and Marten H Wegkamp. Classi\ufb01cation with reject option. Canadian Journal of\n\nStatistics, 34(4):709\u2013721, 2006.\n\n[12] Saskia Le Cessie and Johannes C Van Houwelingen. Ridge estimators in logistic regression.\n\nApplied statistics, pages 191\u2013201, 1992.\n\n[13] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne\nHubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.\nNeural computation, 1(4):541\u2013551, 1989.\n\n[14] Jing Lei. Classi\ufb01cation with con\ufb01dence. Biometrika, page asu038, 2014.\n\n[15] Andy Liaw, Matthew Wiener, et al. Classi\ufb01cation and regression by randomforest. R news, 2\n\n(3):18\u201322, 2002.\n\n[16] Yufeng Liu, Hao Helen Zhang, and Yichao Wu. Hard or soft classi\ufb01cation? large-margin uni\ufb01ed\n\nmachines. Journal of the American Statistical Association, 106(493):166\u2013177, 2011.\n\n[17] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[18] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized\n\nlikelihood methods. Advances in large margin classi\ufb01ers, 10(3):61\u201374, 1999.\n\n[19] Philippe Rigollet and Xin Tong. Neyman-pearson classi\ufb01cation, convexity and stochastic\n\nconstraints. Journal of Machine Learning Research, 12(Oct):2831\u20132855, 2011.\n\n[20] Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classi\ufb01ers with\n\nbounded error levels. Journal of the American Statistical Association, (just-accepted), 2017.\n\n[21] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine\n\nLearning Research, 9(Mar):371\u2013421, 2008.\n\n[22] Xin Tong, Yang Feng, and Anqi Zhao. A survey on neyman-pearson classi\ufb01cation and sugges-\ntions for future research. Wiley Interdisciplinary Reviews: Computational Statistics, 8(2):64\u201381,\n2016.\n\n[23] Vladimir Vovk, Ilia Nouretdinov, Valentina Fedorova, Ivan Petej, and Alex Gammerman. Crite-\nria of ef\ufb01ciency for set-valued classi\ufb01cation. Annals of Mathematics and Arti\ufb01cial Intelligence,\npages 1\u201326, 2017.\n\n[24] Junhui Wang, Xiaotong Shen, and Yufeng Liu. Probability estimation for large-margin classi\ufb01ers.\n\nBiometrika, 95(1):149\u2013167, 2007.\n\n[25] Yichao Wu and Yufeng Liu. Adaptively weighted large margin classi\ufb01ers. Journal of Computa-\n\ntional and Graphical Statistics, 22(2):416\u2013432, 2013.\n\n[26] Yichao Wu, Hao Helen Zhang, and Yufeng Liu. Robust model-free multiclass probability\n\nestimation. Journal of the American Statistical Association, 105(489):424\u2013436, 2010.\n\n[27] Chong Zhang and Yufeng Liu. Multicategory large-margin uni\ufb01ed machines. The Journal of\n\nMachine Learning Research, 14(1):1349\u20131386, 2013.\n\n[28] Chong Zhang, Wenbo Wang, and Xingye Qiao. On reject and re\ufb01ne options in multicategory\n\nclassi\ufb01cation. Journal of the American Statistical Association, 2017. accepted.\n\n[29] Hao Helen Zhang, Yufeng Liu, Yichao Wu, Ji Zhu, et al. Variable selection for the multicategory\nsvm via adaptive sup-norm regularization. Electronic Journal of Statistics, 2:149\u2013167, 2008.\n\n[30] Tong Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex\n\nrisk minimization. Annals of Statistics, pages 56\u201385, 2004.\n\n[31] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of\n\nComputational and Graphical Statistics, 14(1):185\u2013205, 2005.\n\n10\n\n\f", "award": [], "sourceid": 2387, "authors": [{"given_name": "Wenbo", "family_name": "Wang", "institution": "Binghamton University"}, {"given_name": "Xingye", "family_name": "Qiao", "institution": "Binghamton University"}]}