{"title": "Large-scale probabilistic predictors with and without guarantees of validity", "book": "Advances in Neural Information Processing Systems", "page_first": 892, "page_last": 900, "abstract": "This paper studies theoretically and empirically a method of turning machine-learning algorithms into probabilistic predictors that automatically enjoys a property of validity (perfect calibration) and is computationally efficient. The price to pay for perfect calibration is that these probabilistic predictors produce imprecise (in practice, almost precise for large data sets) probabilities. When these imprecise probabilities are merged into precise probabilities, the resulting predictors, while losing the theoretical property of perfect calibration, are consistently more accurate than the existing methods in empirical studies.", "full_text": "Large-scale probabilistic predictors with and without\n\nguarantees of validity\n\nVladimir Vovk\u2217, Ivan Petej\u2217, and Valentina Fedorova\u2020\n\n\u2217Department of Computer Science, Royal Holloway, University of London, UK\n\n{volodya.vovk,ivan.petej,alushaf}@gmail.com\n\n\u2020Yandex, Moscow, Russia\n\nAbstract\n\nThis paper studies theoretically and empirically a method of turning machine-\nlearning algorithms into probabilistic predictors that automatically enjoys a prop-\nerty of validity (perfect calibration) and is computationally ef\ufb01cient. The price to\npay for perfect calibration is that these probabilistic predictors produce imprecise\n(in practice, almost precise for large data sets) probabilities. When these impre-\ncise probabilities are merged into precise probabilities, the resulting predictors,\nwhile losing the theoretical property of perfect calibration, are consistently more\naccurate than the existing methods in empirical studies.\n\n1\n\nIntroduction\n\nPrediction algorithms studied in this paper belong to the class of Venn\u2013Abers predictors, introduced\nin [1]. They are based on the method of isotonic regression [2] and prompted by the observation that\nwhen applied in machine learning the method of isotonic regression often produces miscalibrated\nprobability predictions (see, e.g., [3, 4]); it has also been reported ([5], Section 1) that isotonic\nregression is more prone to over\ufb01tting than Platt\u2019s scaling [6] when data is scarce. The advantage\nof Venn\u2013Abers predictors is that they are a special case of Venn predictors ([7], Chapter 6), and so\n([7], Theorem 6.6) are always well-calibrated (cf. Proposition 1 below). They can be considered to\nbe a regularized version of the procedure used by [8], which helps them resist over\ufb01tting.\nThe main desiderata for Venn (and related conformal, [7], Chapter 2) predictors are validity, predic-\ntive ef\ufb01ciency, and computational ef\ufb01ciency. This paper introduces two computationally ef\ufb01cient\nversions of Venn\u2013Abers predictors, which we refer to as inductive Venn\u2013Abers predictors (IVAPs)\nand cross-Venn\u2013Abers predictors (CVAPs). The ways in which they achieve the three desiderata\nare:\n\nexperimental results reported in this paper suggest that it is inherited by CVAPs.\n\n\u2022 Validity (in the form of perfect calibration) is satis\ufb01ed by IVAPs automatically, and the\n\u2022 Predictive ef\ufb01ciency is determined by the predictive ef\ufb01ciency of the underlying learning\nalgorithms (so that the full arsenal of methods of modern machine learning can be brought\nto bear on the prediction problem at hand).\n\u2022 Computational ef\ufb01ciency is, again, determined by the computational ef\ufb01ciency of the un-\nderlying algorithm; the computational overhead of extracting probabilistic predictions con-\nsists of sorting (which takes time O(n log n), where n is the number of observations) and\nother computations taking time O(n).\n\nAn advantage of Venn prediction over conformal prediction, which also enjoys validity guarantees,\nis that Venn predictors output probabilities rather than p-values, and probabilities, in the spirit of\nBayesian decision theory, can be easily combined with utilities to produce optimal decisions.\n\n1\n\n\fIn Sections 2 and 3 we discuss IVAPs and CVAPs, respectively. Section 4 is devoted to mini-\nmax ways of merging imprecise probabilities into precise probabilities and thus making IVAPs and\nCVAPs precise probabilistic predictors.\nIn this paper we concentrate on binary classi\ufb01cation problems, in which the objects to be classi\ufb01ed\nare labelled as 0 or 1. Most of machine learning algorithms are scoring algorithms, in that they\noutput a real-valued score for each test object, which is then compared with a threshold to arrive at\na categorical prediction, 0 or 1. As precise probabilistic predictors, IVAPs and CVAPs are ways of\nconverting the scores for test objects into numbers in the range [0, 1] that can serve as probabilities,\nor calibrating the scores. In Section 5 we brie\ufb02y discuss two existing calibration methods, Platt\u2019s\n[6] and the method [8] based on isotonic regression. Section 6 is devoted to experimental com-\nparisons and shows that CVAPs consistently outperform the two existing methods (more extensive\nexperimental studies can be found in [9]).\n\n2\n\nInductive Venn\u2013Abers predictors (IVAPs)\n\nIn this paper we consider data sequences (usually loosely referred to as sets) consisting of observa-\ntions z = (x, y), each observation consisting of an object x and a label y \u2208 {0, 1}; we only consider\nbinary labels. We are given a training set whose size will be denoted l.\nThis section introduces inductive Venn\u2013Abers predictors. Our main concern is how to implement\nthem ef\ufb01ciently, but as functions, an IVAP is de\ufb01ned in terms of a scoring algorithm (see the last\nparagraph of the previous section) as follows:\n\n\u2022 Divide the training set of size l into two subsets, the proper training set of size m and the\n\ncalibration set of size k, so that l = m + k.\n\n\u2022 Train the scoring algorithm on the proper training set.\n\u2022 Find the scores s1, . . . , sk of the calibration objects x1, . . . , xk.\n\u2022 When a new test object x arrives, compute its score s.\n\nFit\nisotonic regression\nto (s1, y1), . . . , (sk, yk), (s, 0) obtaining a function f0.\nisotonic regression to\n(s1, y1), . . . , (sk, yk), (s, 1) obtaining a function f1. The multiprobability prediction\nfor the label y of x is the pair (p0, p1) := (f0(s), f1(s)) (intuitively, the prediction is that\nthe probability that y = 1 is either f0(s) or f1(s)).\n\nFit\n\nNotice that the multiprobability prediction (p0, p1) output by an IVAP always satis\ufb01es p0 < p1, and\nso p0 and p1 can be interpreted as the lower and upper probabilities, respectively; in practice, they\nare close to each other for large training sets.\nFirst we state formally the property of validity of IVAPs (adapting the approach of [1] to IVAPs).\nA random variable P taking values in [0, 1] is perfectly calibrated (as a predictor) for a random\nvariable Y taking values in {0, 1} if E(Y | P ) = P a.s. A selector is a random variable taking\nvalues in {0, 1}. As a general rule, in this paper random variables are denoted by capital letters (e.g.,\nX are random objects and Y are random labels).\nProposition 1. Let (P0, P1) be an IVAP\u2019s prediction for X based on a training sequence\n(X1, Y1), . . . , (Xl, Yl). There is a selector S such that PS is perfectly calibrated for Y provided the\nrandom observations (X1, Y1), . . . , (Xl, Yl), (X, Y ) are i.i.d.\n\nOur next proposition concerns the computational ef\ufb01ciency of IVAPs; Proposition 1 will be proved\nlater in this section while Proposition 2 is proved in [9].\nProposition 2. Given the scores s1, . . . , sk of the calibration objects, the prediction rule for com-\nputing the IVAP\u2019s predictions can be computed in time O(k log k) and space O(k). Its application\nto each test object takes time O(log k). Given the sorted scores of the calibration objects, the pre-\ndiction rule can be computed in time and space O(k).\n\nProofs of both statements rely on the geometric representation of isotonic regression as the slope of\nthe GCM (greatest convex minorant) of the CSD (cumulative sum diagram): see [10], pages 9\u201313\n(especially Theorem 1.1). To make our exposition more self-contained, we de\ufb01ne both GCM and\nCSD below.\n\n2\n\n\f\uf8f6\uf8f8 ,\n\ni(cid:88)\n\nj=1\n\nwj,\n\ny(cid:48)\njwj\n\ni = 0, 1, . . . , k(cid:48);\n\nPi :=\n\n\uf8eb\uf8ed i(cid:88)\nj=1 wj and(cid:80)i\n\nj=1\n\nFirst we explain how to \ufb01t isotonic regression to (s1, y1), . . . , (sk, yk) (without necessarily assum-\ning that si are the calibration scores and yi are the calibration labels, which will be needed to cover\nthe use of isotonic regression in IVAPs). We start from sorting all scores s1, . . . , sk in the increasing\norder and removing the duplicates. (This is the most computationally expensive step in our calibra-\ntion procedure, O(k log k) in the worst case.) Let k(cid:48) \u2264 k be the number of distinct elements among\ns1, . . . , sk, i.e., the cardinality of the set {s1, . . . , sk}. De\ufb01ne s(cid:48)\nj, j = 1, . . . , k(cid:48), to be the jth small-\nest element of {s1, . . . , sk}, so that s(cid:48)\nto be the number of times s(cid:48)\n\nk(cid:48). De\ufb01ne wj :=(cid:12)(cid:12)(cid:8)i = 1, . . . , k : si = s(cid:48)\n2 < \u00b7\u00b7\u00b7 < s(cid:48)\n(cid:88)\n\nj occurs among s1, . . . , sk. Finally, de\ufb01ne\n\n1 < s(cid:48)\n\n(cid:9)(cid:12)(cid:12)\n\nj\n\n1\nwj\nto be the average label corresponding to si = s(cid:48)\nj.\nThe CSD of (s1, y1), . . . , (sk, yk) is the set of points\n\ny(cid:48)\nj :=\n\nyi\n\ni=1,...,k:si=s(cid:48)\n\nj\n\nthe GCM between(cid:80)i\u22121\n(cid:80)i\n\nin particular, P0 = (0, 0). The GCM is the greatest convex minorant of the CSD. The value at s(cid:48)\ni,\ni = 1, . . . , k(cid:48), of the isotonic regression \ufb01tted to (s1, y1), . . . , (sk, yk) is de\ufb01ned to be the slope of\nj=1 wj; the values at other s are somewhat arbitrary (namely,\nthe value at s \u2208 (s(cid:48)\ni+1) can be set to anything between the left and right slopes of the GCM at\nj=1 wj) but are never needed in this paper (unlike in the standard use of isotonic regression in\nmachine learning, [8]): e.g., f1(s) is the value of the isotonic regression \ufb01tted to a sequence that\nalready contains (s, 1).\n\ni, s(cid:48)\n\nProof of Proposition 1. Set S := Y . The statement of the proposition even holds conditionally\n\non knowing the values of (X1, Y1), . . . , (Xm, Ym) and the multiset(cid:42)(Xm+1, Ym+1), . . . , (Xl, Yl),\n(X, Y )(cid:43); this knowledge allows us to compute the scores(cid:42)s1, . . . , sk, s(cid:43) of the calibration objects\nthe multiset(cid:42)(s1, Ym+1), . . . , (sk, Yl), (s, Y )(cid:43).\n\nXm+1, . . . , Xl and the test object X. The only remaining randomness is over the equiprobable\npermutations of (Xm+1, Ym+1), . . . , (Xl, Yl), (X, Y ); in particular, (s, Y ) is drawn randomly from\nIt remains to notice that, according to the GCM\nconstruction, the average label of the calibration and test observations corresponding to a given\nvalue of PS is equal to PS.\n\nThe idea behind computing the pair (f0(s), f1(s)) ef\ufb01ciently is to pre-compute two vectors F 0 and\nF 1 storing f0(s) and f1(s), respectively, for all possible values of s. Let k(cid:48) and s(cid:48)\ni be as de\ufb01ned\nabove in the case where s1, . . . , sk are the calibration scores and y1, . . . , yk are the corresponding\nlabels. The vectors F 0 and F 1 are of length k(cid:48), and for all i = 1, . . . , k(cid:48) and both \u0001 \u2208 {0, 1}, F \u0001\ni is\nthe value of f\u0001(s) when s = s(cid:48)\n\ni. Therefore, for all i = 1, . . . , k(cid:48):\n\n\u2022 F 1\n\u2022 F 0\n\ni is also the value of f1(s) when s is just to the left of s(cid:48)\ni;\ni is also the value of f0(s) when s is just to the right of s(cid:48)\ni.\n\ni, the vectors F 0 and F 1 uniquely\n\nSince f0 and f1 can change their values only at the points s(cid:48)\ndetermine the functions f0 and f1, respectively. For details of computing F 0 and F 1, see [9].\nRemark. There are several algorithms for performing isotonic regression on a partially, rather than\nlinearly, ordered set: see, e.g., [10], Section 2.3 (although one of the algorithms described in that\nsection, the Minimax Order Algorithm, was later shown to be defective [11, 12]). Therefore, IVAPs\n(and CVAPs below) can be de\ufb01ned in the situation where scores take values only in a partially or-\ndered set; moreover, Proposition 1 will continue to hold. (For the reader familiar with the notion\nof Venn predictors we could also add that Venn\u2013Abers predictors will continue to be Venn pre-\ndictors, which follows from the isotonic regression being the average of the original function over\ncertain equivalence classes.) The importance of partially ordered scores stems from the fact that\nthey enable us to bene\ufb01t from a possible \u201csynergy\u201d between two or more prediction algorithms [13].\nSuppose, e.g., that one prediction algorithm outputs (scalar) scores s1\nk for the calibration\n\n1, . . . , s1\n\n3\n\n\fAlgorithm 1 CVAP(T, x) // cross-Venn\u2013Abers predictor for training set T\n1: split the training set T into K folds T1, . . . , TK\n2: for k \u2208 {1, . . . , K}\n3:\n4: return GM(p1)/(GM(1 \u2212 p0) + GM(p1))\n\n1) := IVAP(T \\ Tk, Tk, x)\n\n0, pk\n\n(pk\n\ni , s2\n\n1, . . . , s2\n\nobjects x1, . . . , xk and another outputs s2\nk for the same calibration objects; we would like\nto use both sets of scores. We could merge the two sets of scores into composite vector scores,\ni ), i = 1, . . . , k, and then classify a new object x as described earlier using its com-\nsi := (s1\nposite score s := (s1, s2), where s1 and s2 are the scalar scores computed by the two algorithms\nand the partial order between composite scores is component-wise. Preliminary results reported in\n[13] in a related context suggest that the resulting predictor can outperform predictors based on the\nindividual scalar scores. However, we will not pursue this idea further in this paper.\n\n3 Cross Venn\u2013Abers predictors (CVAPs)\n\n1, . . . , pK\n\n1 and GM(1\u2212p0) is the geometric mean of 1\u2212p1\n\nA CVAP is just a combination of K IVAPs, where K is the parameter of the algorithm. It is described\nas Algorithm 1, where IVAP(A, B, x) stands for the output of IVAP applied to A as proper training\nset, B as calibration set, and x as test object, and GM stands for geometric mean (so that GM(p1) is\nthe geometric mean of p1\n0 ). The\nfolds should be of approximately equal size, and usually the training set is split into folds at random\n(although we choose contiguous folds in Section 6 to facilitate reproducibility). One way to obtain\na random assignment of the training observations to folds (see line 1) is to start from a regular array\nin which the \ufb01rst l1 observations are assigned to fold 1, the following l2 observations are assigned\nto fold 2, up to the last lK observations which are assigned to fold K, where |lk \u2212 l/K| < 1 for all\nk, and then to apply a random permutation. Remember that the procedure RANDOMIZE-IN-PLACE\n([14], Section 5.3) can do the last step in time O(l). See the next section for a justi\ufb01cation of the\nexpression GM(p1)/(GM(1 \u2212 p0) + GM(p1)) used for merging the IVAPs\u2019 outputs.\n\n0, . . . , 1\u2212pK\n\n4 Making probability predictions out of multiprobability ones\n\nIn CVAP (Algorithm 1) we merge the K multiprobability predictions output by K IVAPs. In this\nsection we design a minimax way for merging them, essentially following [1]. For the log-loss\nfunction the result is especially simple, GM(p1)/(GM(1 \u2212 p0) + GM(p1)).\nLet us check that GM(p1)/(GM(1 \u2212 p0) + GM(p1)) is indeed the minimax expression under log\nloss. Suppose the pairs of lower and upper probabilities to be merged are (p1\n0 , pK\n1 )\nand the merged probability is p. The extra cumulative loss suffered by p over the correct members\n1, . . . , pK\np1\n\n1 of the pairs when the true label is 1 is\n\n1), . . . , (pK\n\n0, p1\n\nlog\n\np1\n1\np\n\n+ \u00b7\u00b7\u00b7 + log\n\npK\n1\np\n\n,\n\n(1)\n\nand the extra cumulative loss of p over the correct members of the pairs when the true label is 0 is\n\n.\n\n(2)\n\n1 \u2212 p1\n1 \u2212 p\nEqualizing the two expressions we obtain\n1 \u00b7\u00b7\u00b7 pK\np1\npK\n\nlog\n\n=\n\n1\n\n0\n\n+ \u00b7\u00b7\u00b7 + log\n\n1 \u2212 pK\n1 \u2212 p\n\n0\n\n(1 \u2212 p1\n\n0)\u00b7\u00b7\u00b7 (1 \u2212 pK\n0 )\n(1 \u2212 p)K\n\n,\n\nwhich gives the required minimax expression for the merged probability (since (1) is decreasing and\n(2) is increasing in p).\nFor the computations in the case of the Brier loss function, see [9].\nThe argument above (\u201cconditioned\u201d on the proper training set) is also applicable to IVAP, in which\ncase we need to set K := 1; the probability predictor obtained from an IVAP by replacing (p0, p1)\n\n4\n\n\fwith p := p1/(1\u2212 p0 + p1) will be referred to as the log-minimax IVAP. (And CVAP is log-minimax\nby de\ufb01nition.)\n\n5 Comparison with other calibration methods\n\nThe two alternative calibration methods that we consider in this paper are Platt\u2019s [6] and isotonic\nregression [8].\n\n5.1 Platt\u2019s method\n\nPlatt\u2019s [6] method uses sigmoids to calibrate the scores. Platt uses a regularization procedure ensur-\ning that the predictions of his method are always in the range\n\n(cid:18) 1\n\n,\n\nk+ + 1\nk+ + 2\n\nk\u2212 + 2\n\n(cid:19)\n\n,\n\nwhere k\u2212 is the number of calibration observations labelled 0 and k+ is the number of calibration\nobservations labelled 1. It is interesting that the predictions output by the log-minimax IVAP are in\nthe same range (except that the end-points are now allowed): see [9].\n\n5.2\n\nIsotonic regression\n\nThere are two standard uses of isotonic regression: we can train the scoring algorithm using what\nwe call a proper training set, and then use the scores of the observations in a disjoint calibration\n(also called validation) set for calibrating the scores of test objects (as in [5]); alternatively, we can\ntrain the scoring algorithm on the full training set and also use the full training set for calibration (it\nappears that this was done in [8]). In both cases, however, we can expect to get an in\ufb01nite log loss\nwhen the test set becomes large enough (see [9]).\nThe presence of regularization is an advantage of Platt\u2019s method: e.g., it never suffers an in\ufb01nite\nloss when using the log loss function. There is no standard method of regularization for isotonic\nregression, and we do not apply one.\n\n6 Empirical studies\n\n(cid:26)\u2212 log p\n\nThe main loss function (cf., e.g., [15]) that we use in our empirical studies is the log loss\n\nif y = 1\nif y = 0,\n\n(3)\nwhere log is binary logarithm, p \u2208 [0, 1] is a probability prediction, and y \u2208 {0, 1} is the true label.\nAnother popular loss function is the Brier loss\n\n\u2212 log(1 \u2212 p)\n\n\u03bblog(p, y) :=\n\n\u03bbBr(p, y) := 4(y \u2212 p)2.\n\n(4)\nWe choose the coef\ufb01cient 4 in front of (y \u2212 p)2 in (4) and the base 2 of the logarithm in (3) in\norder for the minimax no-information predictor that always predicts p := 1/2 to suffer loss 1.\nAn advantage of the Brier loss function is that it still makes it possible to compare the quality of\nprediction in cases when prediction algorithms (such as isotonic regression) give a categorical but\nwrong prediction (and so are simply regarded as in\ufb01nitely bad when using log loss).\nThe loss of a probability predictor on a test set will be measured by the arithmetic average of the\nlosses it suffers on the test set, namely, by the mean log loss (MLL) and the mean Brier loss (MBL)\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nMLL :=\n\n1\nn\n\n\u03bblog(pi, yi), MBL :=\n\n1\nn\n\n\u03bbBr(pi, yi),\n\n(5)\n\nwhere yi are the test labels and pi are the probability predictions for them. We will not be checking\ndirectly whether various calibration methods produce well-calibrated predictions, since it is well\nknown that lack of calibration increases the loss as measured by loss functions such as log loss\n\n5\n\n\fand Brier loss (see, e.g., [16] for the most standard decomposition of the latter into the sum of the\ncalibration error and re\ufb01nement error).\nIn this section we compare log-minimax IVAPs (i.e., IVAPs whose outputs are replaced by proba-\nbility predictions, as explained in Section 4) and CVAPs with Platt\u2019s method [6] and the standard\nmethod [8] based on isotonic regression; the latter two will be referred to as \u201cPlatt\u201d and \u201cIsotonic\u201d\nin our tables and \ufb01gures. For both IVAPs and CVAPs we use the log-minimax procedure (the\nBrier-minimax procedure leads to virtually identical empirical results). We use the same underly-\ning algorithms as in [1], namely J48 decision trees (abbreviated to \u201cJ48\u201d), J48 decision trees with\nbagging (\u201cJ48 bagging\u201d), logistic regression (sometimes abbreviated to \u201clogistic\u201d), naive Bayes,\nneural networks, and support vector machines (SVM), as implemented in Weka [17] (University of\nWaikato, New Zealand). The underlying algorithms (except for SVM) produce scores in the interval\n[0, 1], which can be used directly as probability predictions (referred to as \u201cUnderlying\u201d in our tables\nand \ufb01gures) or can be calibrated using the methods of [6, 8] or the methods proposed in this paper\n(\u201cIVAP\u201d or \u201cCVAP\u201d in the tables and \ufb01gures).\nFor illustrating our results in this paper we use the adult data set available from the UCI repository\n[18] (this is the main data set used in [6] and one of the data sets used in [8]); however, the picture that\nwe observe is typical for other data sets as well (cf. [9]). We use the original split of the data set into\na training set of Ntrain = 32, 561 observations and a test set of Ntest = 16, 281 observations. The\nresults of applying the four calibration methods (plus the vacuous one, corresponding to just using\nthe underlying algorithm) to the six underlying algorithms for this data set are shown in Figure 1.\nThe six top plots report results for the log loss (namely, MLL, as de\ufb01ned in (5)) and the six bottom\nplots for the Brier loss (namely, MBL). The underlying algorithms are given in the titles of the plots\nand the calibration methods are represented by different line styles, as explained in the legends. The\nhorizontal axis is labelled by the ratio of the size of the proper training set to that of the calibration\nset (except for the label all, which will be explained later); in particular, in the case of CVAPs\nit is labelled by the number K \u2212 1 one less than the number of folds. In the case of CVAPs, the\ntraining set is split into K equal (or as close to being equal as possible) contiguous folds: the \ufb01rst\n(cid:100)Ntrain/K(cid:101) training observations are included in the \ufb01rst fold, the next (cid:100)Ntrain/K(cid:101) (or (cid:98)Ntrain/K(cid:99))\nin the second fold, etc. (\ufb01rst (cid:100)\u00b7(cid:101) and then (cid:98)\u00b7(cid:99) is used unless Ntrain is divisible by K). In the case\nof the other calibration methods, we used the \ufb01rst (cid:100) K\u22121\nK Ntrain(cid:101) training observation as the proper\ntraining set (used for training the scoring algorithm) and the rest of the training observations are\nused as the calibration set.\nIn the case of log loss, isotonic regression often suffers in\ufb01nite losses, which is indicated by the\nabsence of the round marker for isotonic regression; e.g., only one of the log losses for SVM is\n\ufb01nite. We are not trying to use ad hoc solutions, such as clipping predictions to the interval [\u0001, 1\u2212 \u0001]\nfor a small \u0001 > 0, since we are also using the bounded Brier loss function. The CVAP lines tend to\nbe at the bottom in all plots; experiments with other data sets also con\ufb01rm this.\nThe column all in the plots of Figure 1 refers to using the full training set as both the proper\ntraining set and calibration set. (In our of\ufb01cial de\ufb01nition of IVAP we require that the last two sets\nbe disjoint, but in this section we continue to refer to IVAPs modi\ufb01ed in this way simply as IVAPs;\nin [1], such prediction algorithms were referred to as SVAPs, simpli\ufb01ed Venn\u2013Abers predictors.)\nUsing the full training set as both the proper training set and calibration set might appear naive (and\nis never used in the extensive empirical study [5]), but it often leads to good empirical results on\nlarger data sets. However, it can also lead to very poor results, as in the case of \u201cJ48 bagging\u201d (for\nIVAP, Platt, and Isotonic), the underlying algorithm that achieves the best performance in Figure 1.\nA natural question is whether CVAPs perform better than the alternative calibration methods in Fig-\nure 1 (and our other experiments) because of applying cross-over (in moving from IVAP to CVAP)\nor because of the extra regularization used in IVAPs. The \ufb01rst reason is undoubtedly important for\nboth loss functions and the second for the log loss function. The second reason plays a smaller role\nfor Brier loss for relatively large data sets (in the lower half of Figure 1 the curves for Isotonic\nand IVAP are very close to each other), but IVAPs are consistently better for smaller data sets even\nwhen using Brier loss. In Tables 1 and 2 we apply the four calibration methods and six underlying\nalgorithms to a much smaller training set, namely to the \ufb01rst 5, 000 observations of the adult data\nset as the new training set, following [5]; the \ufb01rst 4, 000 training observations are used as the proper\ntraining set, the following 1, 000 training observations as the calibration set, and all other observa-\ntions (the remaining training and all test observations) are used as the new test set. The results are\n\n6\n\n\fFigure 1: The log and Brier losses of the four calibration methods applied to the six prediction\nalgorithms on the adult data set. The numbers on the horizontal axis are ratios m/k of the size\nof the proper training set to the size of the calibration set; in the case of CVAPs, they can also be\nexpressed as K\u22121, where K is the number of folds (therefore, column 4 corresponds to the standard\nchoice of 5 folds in the method of cross-validation). Missing curves or points on curves mean that\nthe corresponding values either are too big and would squeeze unacceptably the interesting parts of\nthe plot if shown or are in\ufb01nite (such as many results for isotonic regression under log loss).\n\nshown in Tables 1 for log loss and 2 for Brier loss. They are consistently better for IVAP than for IR\n(isotonic regression). Results for nine very small data sets are given in Tables 1 and 2 of [1], where\nthe results for IVAP (with the full training set used as both proper training and calibration sets, la-\nbelled \u201cSVA\u201d in the tables in [1]) are consistently (in 52 cases out of the 54 using Brier loss) better,\nusually signi\ufb01cantly better, than for isotonic regression (referred to as DIR in the tables in [1]).\nThe following information might help the reader in reproducing our results (in addition to our code\nbeing publicly available [9]). For each of the standard prediction algorithms within Weka that we\nuse, we optimise the parameters by minimising the Brier loss on the calibration set, apart from the\ncolumn labelled all. (We cannot use the log loss since it is often in\ufb01nite in the case of isotonic\nregression.) We then use the trained algorithm to generate the scores for the calibration and test sets,\nwhich allows us to compute probability predictions using Platt\u2019s method, isotonic regression, IVAP,\nand CVAP. All the scores apart from SVM are already in the [0, 1] range and can be used as prob-\nability predictions. Most of the parameters are set to their default values, and the only parameters\n\n7\n\n1234all0.490 0.495 0.500 0.505 SVM PlattIsotonicIVAPCVAP1234all0.486 0.488 0.490 0.492 0.494 0.496 logistic regression UnderlyingIsotonicIVAPCVAP1234all0.452 0.454 0.456 0.458 0.460 naive Bayes IsotonicIVAPCVAP1234all0.440 0.460 0.480 0.500 0.520 0.540 J48 UnderlyingPlattIsotonicIVAPCVAP1234all0.420 0.440 0.460 0.480 J48 bagging UnderlyingPlattIsotonicIVAPCVAPLog Loss1234all0.440 0.450 0.460 0.470 0.480 neural networks UnderlyingPlattIsotonicIVAPCVAPStudent Version of MATLAB1234all0.430 0.432 0.434 0.436 0.438 SVM PlattIsotonicIVAPCVAP1234all0.4285 0.4290 0.4295 0.4300 0.4305 0.4310 logistic regression UnderlyingIsotonicIVAPCVAP1234all0.400 0.420 0.440 0.460 naive Bayes UnderlyingPlattIsotonicIVAPCVAP1234all0.380 0.390 0.400 0.410 0.420 0.430 J48 UnderlyingPlattIsotonicIVAPCVAP1234all0.360 0.370 0.380 0.390 0.400 J48 bagging UnderlyingPlattIsotonicIVAPCVAPBrier Loss1234all0.390 0.400 0.410 0.420 neural networks UnderlyingPlattIsotonicIVAPCVAPStudent Version of MATLAB\fTable 1: The log loss for the four calibration methods and six underlying algorithms for a small\nsubset of the adult data set\n\nalgorithm\n\nJ48\n\nJ48 bagging\n\nlogistic\n\nnaive Bayes\n\nneural networks\n\nSVM\n\nIR\n\nIVAP\nPlatt\n0.5226 \u221e 0.5117\n0.4949 \u221e 0.4733\n0.5111 \u221e 0.4981\n0.5534 \u221e 0.4839\n0.5175 \u221e 0.5023\n0.5221 \u221e 0.5015\n\nCVAP\n0.5102\n0.4602\n0.4948\n0.4747\n0.4805\n0.4997\n\nTable 2: The analogue of Table 1 for the Brier loss\nalgorithm\n\nIR\n\nPlatt\n0.4463\n0.4225\n0.4470\n0.4670\n0.4525\n0.4550\n\n0.4378\n0.4153\n0.4417\n0.4329\n0.4574\n0.4450\n\nJ48\n\nJ48 bagging\n\nlogistic\n\nnaive Bayes\n\nneural networks\n\nSVM\n\nIVAP\n0.4370\n0.4123\n0.4377\n0.4311\n0.4440\n0.4408\n\nCVAP\n0.4368\n0.3990\n0.4342\n0.4227\n0.4234\n0.4375\n\nthat are optimised are C (pruning con\ufb01dence) for J48 and J48 bagging, R (ridge) for logistic regres-\nsion, L (learning rate) and M (momentum) for neural networks (MultilayerPerceptron), and\nC (complexity constant) for SVM (SMO, with the linear kernel); naive Bayes does not involve any\nparameters. Notice that none of these parameters are \u201chyperparameters\u201d, in that they do not control\nthe \ufb02exibility of the \ufb01tted prediction rule directly; this allows us to optimize the parameters on the\ntraining set for the all column. In the case of CVAPs, we optimise the parameters by minimis-\ning the cumulative Brier loss over all folds (so that the same parameters are used for all folds). To\napply Platt\u2019s method to calibrate the scores generated by the underlying algorithms we use logistic\nregression, namely the function mnrfit within MATLAB\u2019s Statistics toolbox. For isotonic regres-\nsion calibration we use the implementation of the PAVA in the R package fdrtool (namely, the\nfunction monoreg).\nFor further experimental results, see [9].\n\n7 Conclusion\n\nThis paper introduces two new computationally ef\ufb01cient algorithms for probabilistic prediction,\nIVAP, which can be regarded as a regularised form of the calibration method based on isotonic\nregression, and CVAP, which is built on top of IVAP using the idea of cross-validation. Whereas\nIVAPs are automatically perfectly calibrated, the advantage of CVAPs is in their good empirical\nperformance.\nThis paper does not study empirically upper and lower probabilities produced by IVAPs and CVAPs,\nwhereas the distance between them provides information about the reliability of the merged proba-\nbility prediction. Finding interesting ways of using this extra information is one of the directions of\nfurther research.\n\nAcknowledgments\n\nWe are grateful to the conference reviewers for numerous helpful comments and observations, to\nVladimir Vapnik for sharing his ideas about exploiting synergy between different learning algo-\nrithms, and to participants in the conference Machine Learning: Prospects and Applications (Octo-\nber 2015, Berlin) and NIPS 2015 (Montr\u00b4eal, Canada) for their questions and comments. The \ufb01rst\nauthor has been partially supported by EPSRC (grant EP/K033344/1) and AFOSR (grant \u201cSemantic\nCompletions\u201d). The second and third authors are grateful to their home institutions for funding their\ntrips to Montr\u00b4eal.\n\n8\n\n\fReferences\n[1] Vladimir Vovk and Ivan Petej. Venn\u2013Abers predictors. In Nevin L. Zhang and Jin Tian, editors,\nProceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 829\u2013\n838, Corvallis, OR, 2014. AUAI Press.\n\n[2] Miriam Ayer, H. Daniel Brunk, George M. Ewing, W. T. Reid, and Edward Silverman. An\nempirical distribution function for sampling with incomplete information. Annals of Mathe-\nmatical Statistics, 26:641\u2013647, 1955.\n\n[3] Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Smooth isotonic regres-\nsion: a new method to calibrate predictive models. AMIA Summits on Translational Science\nProceedings, 2011:16\u201320, 2011.\n\n[4] Antonis Lambrou, Harris Papadopoulos, Ilia Nouretdinov, and Alex Gammerman. Reliable\nIn\nprobability estimates based on support vector machines for large multiclass datasets.\nLazaros Iliadis, Ilias Maglogiannis, Harris Papadopoulos, Kostas Karatzas, and Spyros Sioutas,\neditors, Proceedings of the AIAI 2012 Workshop on Conformal Prediction and its Applications,\nvolume 382 of IFIP Advances in Information and Communication Technology, pages 182\u2013191,\nBerlin, 2012. Springer.\n\n[5] Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learn-\nIn Proceedings of the Twenty Third International Conference on Machine\n\ning algorithms.\nLearning, pages 161\u2013168, New York, 2006. ACM.\n\n[6] John C. Platt. Probabilities for SV machines. In Alexander J. Smola, Peter L. Bartlett, Bernhard\nSch\u00a8olkopf, and Dale Schuurmans, editors, Advances in Large Margin Classi\ufb01ers, pages 61\u201374.\nMIT Press, 2000.\n\n[7] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random\n\nWorld. Springer, New York, 2005.\n\n[8] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision\ntrees and naive Bayesian classi\ufb01ers.\nIn Carla E. Brodley and Andrea P. Danyluk, editors,\nProceedings of the Eighteenth International Conference on Machine Learning, pages 609\u2013\n616, San Francisco, CA, 2001. Morgan Kaufmann.\n\n[9] Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors\nwith and without guarantees of validity. Technical Report arXiv:1511.00213 [cs.LG],\narXiv.org e-Print archive, November 2015. A full version of this paper.\n\n[10] Richard E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. Daniel Brunk. Statistical In-\nference under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley,\nLondon, 1972.\n\n[11] Chu-In Charles Lee. The Min-Max algorithm and isotonic regression. Annals of Statistics,\n\n11:467\u2013477, 1983.\n\n[12] Gordon D. Murray. Nonconvergence of the minimax order algorithm. Biometrika, 70:490\u2013491,\n\n1983.\n\n[13] Vladimir N. Vapnik.\n\nIntelligent learning: Similarity control and knowledge transfer. Talk\nat the 2015 Yandex School of Data Analysis Conference Machine Learning: Prospects and\nApplications, 6 October 2015, Berlin.\n\n[14] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction\n\nto Algorithms. MIT Press, Cambridge, MA, third edition, 2009.\n\n[15] Vladimir Vovk. The fundamental nature of the log loss function.\n\nIn Lev D. Beklemishev,\nAndreas Blass, Nachum Dershowitz, Berndt Finkbeiner, and Wolfram Schulte, editors, Fields\nof Logic and Computation II: Essays Dedicated to Yuri Gurevich on the Occasion of His 75th\nBirthday, volume 9300 of Lecture Notes in Computer Science, pages 307\u2013318, Cham, 2015.\nSpringer.\n\n[16] Allan H. Murphy. A new vector partition of the probability score. Journal of Applied Meteo-\n\nrology, 12:595\u2013600, 1973.\n\n[17] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H.\nWitten. The WEKA data mining software: an update. SIGKDD Explorations, 11:10\u201318, 2011.\n\n[18] A. Frank and A. Asuncion. UCI machine learning repository, 2015.\n\n9\n\n\f", "award": [], "sourceid": 585, "authors": [{"given_name": "Vladimir", "family_name": "Vovk", "institution": "Royal Holloway, Univ of London"}, {"given_name": "Ivan", "family_name": "Petej", "institution": "Royal Holloway, University of London"}, {"given_name": "Valentina", "family_name": "Fedorova", "institution": "Yandex"}]}