{"title": "Self-calibrating Probability Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 1133, "page_last": 1140, "abstract": "", "full_text": "Self-calibrating Probability Forecasting\n\nVladimir Vovk\n\nComputer Learning Research Centre\n\nDepartment of Computer Science\n\nRoyal Holloway, University of London\n\nEgham, Surrey TW20 0EX, UK\n\nvovk@cs.rhul.ac.uk\n\nGlenn Shafer\n\nRutgers School of Business\nNewark and New Brunswick\n\n180 University Avenue\nNewark, NJ 07102, USA\n\ngshafer@andromeda.rutgers.edu\n\nIlia Nouretdinov\n\nComputer Learning Research Centre\n\nDepartment of Computer Science\n\nRoyal Holloway, University of London\n\nEgham, Surrey TW20 0EX, UK\n\nilia@cs.rhul.ac.uk\n\nAbstract\n\nIn the problem of probability forecasting the learner\u2019s goal is to output,\ngiven a training set and a new object, a suitable probability measure on\nthe possible values of the new object\u2019s label. An on-line algorithm for\nprobability forecasting is said to be well-calibrated if the probabilities\nit outputs agree with the observed frequencies. We give a natural non-\nasymptotic formalization of the notion of well-calibratedness, which we\nthen study under the assumption of randomness (the object/label pairs\nare independent and identically distributed). It turns out that, although\nno probability forecasting algorithm is automatically well-calibrated in\nour sense, there exists a wide class of algorithms for \u201cmultiprobability\nforecasting\u201d (such algorithms are allowed to output a set, ideally very\nnarrow, of probability measures) which satisfy this property; we call the\nalgorithms in this class \u201cVenn probability machines\u201d. Our experimental\nresults demonstrate that a 1-Nearest Neighbor Venn probability machine\nperforms reasonably well on a standard benchmark data set, and one of\nour theoretical results asserts that a simple Venn probability machine\nasymptotically approaches the true conditional probabilities regardless,\nand without knowledge, of the true probability measure generating the\nexamples.\n\n1 Introduction\n\nWe are interested in the on-line version of the problem of probability forecasting: we ob-\nserve pairs of objects and labels sequentially, and after observing the nth object xn the\ngoal is to give a probability measure pn for its label; as soon as pn is output, the label\nyn of xn is disclosed and can be used for computing future probability forecasts. A good\n\n\freview of early work in this area is Dawid [1]. In this introductory section we will assume\nthat yn \u2208 {0, 1}; we can then take pn to be a real number from the interval [0, 1] (the\nprobability that yn = 1 given xn); our exposition here will be very informal.\nThe standard view ( [1], pp. 213\u2013216) is that the quality of probability forecasting systems\nhas two components: \u201creliability\u201d and \u201cresolution\u201d. At the crudest level, reliability requires\nthat the forecasting system should not lie, and resolution requires that it should say some-\nthing useful. To be slightly more precise, consider the \ufb01rst n forecasts pi and the actual\nlabels yi.\nThe most basic test\nIf pn \u2248 yn, the forecasts are \u201cunbiased in the large\u201d.\nA more re\ufb01ned test would look at the subset of i for which pi is close to a given value p\u2217,\nand compare the relative frequency of yi = 1 in this subset, say yn(p\u2217), with p\u2217. If\n\ni=1 pi with the overall relative frequency yn := n\u22121(cid:80)n\n\nis to compare the overall average forecast probability pn :=\ni=1 yi of 1s among yi.\n\nn\u22121(cid:80)n\n\nyn(p\u2217) \u2248 p\u2217 for all p\u2217,\n\n(1)\n\nthe forecasts are \u201cunbiased in the small\u201d, \u201creliable\u201d, \u201cvalid\u201d, or \u201cwell-calibrated\u201d; in later\nsections, we will use \u201cwell-calibrated\u201d, or just \u201ccalibrated\u201d, as a technical term. Forecasting\nsystems that pass this test at least get the frequencies right; in this sense they do not lie.\n\nIt is easy to see that there are reliable forecasting systems that are virtually useless. For\nexample, the de\ufb01nition of reliability does not require that the forecasting system pay any\nattention to the objects xi. In another popular example, the labels follow the pattern\n\n(cid:189)\n\nyi =\n\n1 if i is odd\n0 otherwise.\n\nThe forecasts pi = 0.5 are reliable, at least asymptotically (0.5 is the right relative fre-\nquency) but not as useful as p1 = 1, p2 = 0, . . . ; the \u201cresolution\u201d (which we do not de\ufb01ne\nhere) of the latter forecasts is better.\n\nIn this paper we construct forecasting systems that are automatically reliable. To achieve\nthis, we allow our prediction algorithms to output sets of probability measures Pn instead\nof single measures pn; typically the sets Pn will be small (see \u00a75).\nThis paper develops the approach of [2\u20134], which show that it is possible to produce valid,\nasymptotically optimal, and practically useful p-values; the p-values can be then used for\nregion prediction. Disadvantages of p-values, however, are that their interpretation is less\ndirect than that of probabilities and that they are easy to confuse with probabilities; some\nauthors have even objected to any use of p-values (see, e.g., [5]). In this paper we use the\nmethodology developed in the previous papers to produce valid probabilities rather than\np-values.\n\nAll proofs are omitted and can be found in [6].\n\n2 Probability forecasting and calibration\nFrom this section we start rigorous exposition. Let P(Y) be the set of all probability\nmeasures on a measurable space Y. We use the following protocol in this paper:\n\nMULTIPROBABILITY FORECASTING\nPlayers: Reality, Forecaster\nProtocol:\n\n\fFOR n = 1, 2, . . . :\n\nReality announces xn \u2208 X.\nForecaster announces Pn \u2286 P(Y).\nReality announces yn \u2208 Y.\n\nIn this protocol, Reality generates examples zn = (xn, yn) \u2208 Z := X \u00d7 Y consisting\nof two parts, objects xn and labels yn. After seeing the object xn Forecaster is required\nto output a prediction for the label yn. The usual probability forecasting protocol requires\nthat Forecaster output a probability measure; we relax this requirement by allowing him to\noutput a family of probability measures (and we are interested in the case where the families\nPn become smaller and smaller as n grows). It can be shown (we omit the proof and even\nthe precise statement) that it is impossible to achieve automatic well-calibratedness, in our\n\ufb01nitary sense, in the probability forecasting protocol.\nIn this paper we make the simplifying assumption that the label space Y is \ufb01nite; in many\ninformal explanations it will be assumed binary, Y = {0, 1}. To avoid unnecessary techni-\ncalities, we will also assume that the families Pn chosen by Forecaster are \ufb01nite and have\nno more than K elements; they will be represented by a list of length K (elements in the list\ncan repeat). A probability machine is a measurable strategy for Forecaster in our protocol,\nwhere at each step he is required to output a sequence of K probability measures.\n\nThe problem of calibration is usually treated in an asymptotic framework. Typical asymp-\ntotic results, however, do not say anything about \ufb01nite data sequences; therefore, in this\npaper we will only be interested in the non-asymptotic notion of calibration. All needed\nformal de\ufb01nitions will be given, but space limitations prevent us from including detailed\nexplanations and examples, which can be found in [6].\nLet us \ufb01rst limit the duration of the game, replacing n = 1, 2, . . . in the multiprobability\nforecasting protocol by n = 1, . . . , N for a \ufb01nite horizon N. It is clear that, regardless of\nformalization, we cannot guarantee that miscalibration, in the sense of (1) being violated,\nwill never happen: for typical probability measures, everything can happen, perhaps with a\nsmall probability. The idea of our de\ufb01nition is: a prediction algorithm is well-calibrated if\nany evidence of miscalibration translates into evidence against the assumption of random-\nness. Therefore, we \ufb01rst need to de\ufb01ne ways of testing calibration and randomness; this\nwill be done following [7].\nA game N-martingale is a function M on sequences of the form x1, p1, y1, . . . , xn, pn, yn,\nwhere n = 0, . . . , N, xi \u2208 X, pi \u2208 P(Y), and yi \u2208 Y, that satis\ufb01es\n\n(cid:90)\n\nM(x1, p1, y1, . . . , xn\u22121, pn\u22121, yn\u22121) =\n\nM(x1, p1, y1, . . . , xn, pn, y)pn(dy)\n\nY\n\nfor all x1, p1, y1, . . . , xn, pn, n = 1, . . . , N. A calibration N-martingale is a nonnegative\ngame N-martingale that is invariant under permutations:\n\nM(x1, p1, y1, . . . , xN , pN , yN ) = M(x\u03c0(1), p\u03c0(1), y\u03c0(1), . . . , x\u03c0(N ), p\u03c0(N ), y\u03c0(N ))\nfor any x1, p1, y1, . . . , xN , pN , yN and any permutation \u03c0 : {1, . . . , N} \u2192 {1, . . . , N}.\nTo cover the multiprobability forecasting protocol, we extend the domain of de\ufb01nition for\na calibration N-martingale M from sequences of the form x1, p1, y1, . . . , xn, pn, yn,\nwhere p1, . . . , pn are single probability measures on Y,\nto sequences of the form\nx1, P1, y1, . . . , xn, Pn, yn, where P1, . . . , Pn are sets of probability measures on Y,\nby\n\nM(x1, P1, y1, . . . , xn, Pn, yn) :=\n\ninf\n\np1\u2208P1,...,pn\u2208Pn\n\nM(x1, p1, y1, . . . , xn, pn, yn).\n\n\fA QN -martingale, where Q is a probability measure on Z, is a function S on sequences of\nthe form x1, y1, . . . , xn, yn, where n = 0, . . . , N, xi \u2208 X, and yi \u2208 Y, that satis\ufb01es\n\n(cid:90)\n\nS(x1, y1, . . . , xn\u22121, yn\u22121) =\n\nS(x1, y1, . . . , xn\u22121, yn\u22121, x, y)Q(dx, dy)\n\nZ\n\nfor all x1, y1, . . . , xn\u22121, yn\u22121, n = 1, . . . , N.\nIf a nonnegative QN -martingale S starts with S((cid:164)) = 1 and ends with S(x1, y1, . . . , yN )\nvery large, then we may reject Q as the probability measure generating individual examples\n(xn, yn). This interpretation is supported by Doob\u2019s inequality. Analogously, if a game N-\nmartingale M starts with M((cid:164)) = 1 and ends with M(x1, P1, y1, . . . , yN ) very large, then\nwe may reject the hypothesis that each Pn contains the true probability measure for yn. If\nM is a calibration N-martingale, this event is interpreted as evidence of miscalibration.\n(The restriction to calibration N-martingales is motivated by the fact that (1) is invariant\nunder permutations).\nWe call a probability machine F N-calibrated if for any probability measure Q on Z\nand any nonnegative calibration N-martingale M with M((cid:164)) = 1, there exists a QN -\nmartingale S with S((cid:164)) = 1 such that\n\nM(x1, F (x1), y1, . . . , xN , F (x1, y1, . . . , xN ), yN ) \u2264 S(x1, y1, . . . , xN , yN )\n\nfor all x1, y1, . . . , xN , yN . We say that F is \ufb01nitarily calibrated if it is N-calibrated for\neach N.\n\n3 Self-calibrating probability forecasting\n\nNow we will describe a general algorithm for multiprobability forecasting. Let N be the\nsets of all positive integer numbers. A sequence of measurable functions An : Zn \u2192 Nn,\nn = 1, 2, . . . , is called a taxonomy if, for any n \u2208 N, any permutation \u03c0 of {1, . . . , n}, any\n(z1, . . . , zn) \u2208 Zn, and any (\u03b11, . . . , \u03b1n) \u2208 Nn,\n\n(\u03b11, . . . , \u03b1n) = An(z1, . . . , zn) =\u21d2 (\u03b1\u03c0(1), . . . , \u03b1\u03c0(n)) = An(z\u03c0(1), . . . , z\u03c0(n)).\n\nIn other words,\n\nAn : (z1, . . . , zn) (cid:55)\u2192 (\u03b11, . . . , \u03b1n)\n\nis a taxonomy if every \u03b1i is determined by the bag1(cid:42)z1, . . . , zn(cid:43) and zi. We let |B| stand\nthe division of(cid:42)z1, . . . , zn(cid:43), where zn is understood (only for the purpose of this de\ufb01nition)\n\nfor the number of elements in a set B. The Venn probability machine associated with (An)\nis the probability machine which outputs the following K = |Y| probability measures py,\ny \u2208 Y, at the nth step: complement the new object xn by the postulated label y; consider\nto be (xn, y), into groups (formally, bags) according to the values of An (i.e., zi and zj are\nassigned to the same group if and only if \u03b1i = \u03b1j, where the \u03b1s are de\ufb01ned by (2)); \ufb01nd the\nempirical distribution py \u2208 P(Y) of the labels in the group G containing the nth example\nzn = (xn, y):\n\n(2)\n\npy({y(cid:48)}) :=\n\n|{(x\u2217, y\u2217) \u2208 G : y\u2217 = y(cid:48)}|\n\n.\n\n|G|\n\nA Venn probability machine (VPM) is the Venn probability machine associated with some\ntaxonomy.\n\nTheorem 1 Any Venn probability machine is \ufb01nitarily calibrated.\n\n1By \u201cbag\u201d we mean a collection of elements, not necessarily distinct. \u201cBag\u201d and \u201cmultiset\u201d are\n\nsynonymous, but we prefer the former term in order not to overload the pre\ufb01x \u201cmulti\u201d.\n\n\fIt is clear that VPM depends on the taxonomy only through the way it splits the examples\nz1, . . . , zn into groups; therefore, we may specify only the latter when constructing speci\ufb01c\nVPMs.\n\nRemark The notion of VPM is a version of Transductive Con\ufb01dence Machine (TCM)\nintroduced in [8] and [9], and Theorem 1 is a version of Theorem 1 in [2].\n\n4 Discussion of the Venn probability machine\n\nIn this somewhat informal section we will discuss the intuitions behind VPM, considering\nonly the binary case Y = {0, 1} and considering the probability forecasts pi to be elements\nof [0, 1] rather than P({0, 1}), as in \u00a71. We start with the almost trivial Bernoulli case,\nwhere the objects xi are absent,2 and our goal is to predict, at each step n = 1, 2, . . . , the\nnew label yn given the previous labels y1, . . . , yn\u22121. The most naive probability forecast\nis pn = k/(n \u2212 1), where k is the number of 1s among the \ufb01rst n \u2212 1 labels. (Often\n\u201cregularized\u201d forms of k/(n \u2212 1), such as Laplace\u2019s rule of succession (k + 1)/(n + 1),\nare used.)\nIn the Bernoulli case there is only one natural VPM: the multiprobability forecast for yn is\n{k/n, (k+1)/n}. Indeed, since there are no objects xn, it is natural to take the one-element\ntaxonomy An at each step, and this produces the VPM Pn = {k/n, (k + 1)/n}. It is clear\nthat the diameter 1/n of Pn for this VPM is the smallest achievable. (By the diameter of a\nset we mean the supremum of distances between its points.)\nNow let us consider the case where xn are present. The probability forecast k/(n \u2212 1)\nfor yn will usually be too crude, since the known population z1, . . . , zn\u22121 may be very\nheterogeneous. A reasonable statistical forecast would take into account only objects xi\nthat are similar, in a suitable sense, to xn. A simple modi\ufb01cation of the Bernoulli forecast\nk/(n \u2212 1) is as follows:\n\n1. Split the available objects x1, . . . , xn into a number of groups.\n2. Output k(cid:48)/n(cid:48) as the predicted probability that yn = 1, where n(cid:48) is the number\nof objects among x1, . . . , xn\u22121 in the same group as xn and k(cid:48) is the number of\nobjects among those n(cid:48) that are labeled as 1.\n\nAt the \ufb01rst stage, a delicate balance has to be struck between two contradictory goals: the\ngroups should be as large as possible (to have a reasonable sample size for estimating prob-\nabilities); the groups should be as homogeneous as possible. This problem is sometimes\nreferred to as the \u201creference class problem\u201d; according to K\u0131l\u0131nc\u00b8 [10], John Venn was the\n\ufb01rst to formulate and analyze this problem with due philosophical depth.\n\nThe procedure offered in this paper is a simple modi\ufb01cation of the standard procedure\ndescribed in the previous paragraph:\n\n0. Consider the two possible completions of the known data\n\n(z1, . . . , zn\u22121, xn) = ((x1, y1), . . . , (xn\u22121, yn\u22121), xn) :\n\nin one (called the 0-completion) xn is assigned label 0, and in the other (called the\n1-completion) xn is assigned label 1.\n\n1. In each completion, split all examples z1, . . . , zn\u22121, (xn, y) into a number of\ngroups, so that the split does not depend on the order of examples (y = 0 for\nthe 0-partition and y = 1 for the 1-partition).\n\n2Formally, this correspond in our protocol to the situation where |X| = 1, and so xn, although\n\nnominally present, do not carry any information.\n\n\f2. In each completion, output k(cid:48)/n(cid:48) as the predicted probability that yn = 1, where\nn(cid:48) is the number of examples among z1, . . . , zn\u22121, (xn, y) in the same group as\n(xn, y) and k(cid:48) is the number of examples among those n(cid:48) that are labeled as 1.\n\nIn this way, we will have not one but two predicted probabilities that yn = 1; but in\npractically interesting cases we can hope that these probabilities will be close to each other\n(see the next section).\n\nVenn\u2019s reference class problem reappears in our procedure as the problem of avoiding over-\nand under\ufb01tting. A taxonomy with too many groups means over\ufb01tting; it is punished by\nthe large diameter of the multiprobability forecast (importantly, this is visible, unlike the\nstandard approaches). Too few groups means under\ufb01tting (and poor resolution).\n\nImportant advantages of our procedure over the naive procedure are: our procedure is self-\ncalibrating; there exists an asymptotically optimal VPM (see \u00a76); we can use labels in\nsplitting examples into groups (this will be used in the next section).\n\n5 Experiments\n\nIn this section, we will report the results for a natural taxonomy applied to the well-known\nUSPS data set of hand-written digits; this taxonomy is inspired by the 1-Nearest Neighbor\nalgorithm. First we describe the taxonomy, and then the way in which we report the results\nfor the VPM associated with this taxonomy.\n\nSince the data set is relatively small (9298 examples in total), we have to use a crude\ntaxonomy: two examples are assigned to the same group if their nearest neighbors have\nthe same label; therefore, the taxonomy consists of 10 groups. The distance between two\nexamples is de\ufb01ned as the Euclidean distance between their objects (which are 16 \u00d7 16\nmatrices of pixels and represented as points in R256).\nThe algorithm processes the nth object xn as follows. First it creates the 10 \u00d7 10 matrix\nA whose entry Ai,j, i, j = 0, . . . , 9, is computed by assigning i to xn as label and \ufb01nding\n\nthe fraction of examples labeled j among the examples in the bag(cid:42)z1, . . . , zn\u22121, (xn, i)(cid:43)\n\nbelonging to the same group as (xn, i). The quality of a column of this matrix is its mini-\nmum entry. Choose a column (called the best column) with the highest quality; let the best\ncolumn be jbest. Output jbest as the prediction and output\n\n(cid:184)\n\n(cid:183)\n\nmin\ni=0,...,9\n\nAi,jbest , max\ni=0,...,9\n\nAi,jbest\n\nas the interval for the probability that this prediction is correct. If the latter interval is [a, b],\nthe complementary interval [1\u2212 b, 1\u2212 a] is called the error probability interval. In Figure 1\nwe show the following three curves: the cumulative error curve En :=\ni=1 erri, where\n(cid:54)= yi) is made at step i and erri = 0 otherwise; the\nerri = 1 if an error (in the sense jbest\ncumulative lower error probability curve Ln :=\ni=1 li and the cumulative upper error\nprobability curve Un :=\ni=1 ui, where [li, ui] is the error probability interval output by\nthe algorithm for the label yi. The values En, Ln and Un are plotted against n. The plot\ncon\ufb01rms that the error probability intervals are calibrated.\n\n(cid:80)n\n\n(cid:80)n\n\n(cid:80)n\n\n6 Universal Venn probability machine\n\nThe following result asserts the existence of a universal VPM. Such a VPM can be con-\nstructed quite easily using the histogram approach to probability estimation [11].\n\n\fFigure 1: On-line performance of the 1-Nearest Neighbor VPM on the USPS data set\n(9298 hand-written digits, randomly permuted). The dashed line shows the cumulative\nnumber of errors En and the solid ones the cumulative upper and lower error probability\ncurves Un and Ln. The mean error EN /N is 0.0425 and the mean probability interval\n(1/N)[LN , UN ] is [0.0407, 0.0419], where N = 9298 is the size of the data set. This\n\ufb01gure is not signi\ufb01cantly affected by statistical variation (due to the random choice of the\npermutation of the data set).\n\nTheorem 2 Suppose X is a Borel space. There exists a VPM such that, if the examples are\ngenerated from Q\u221e,\n\n\u03c1(Q(\u00b7| xn), p) \u2192 0 (n \u2192 \u221e)\n\nsup\np\u2208Pn\n\nin probability, where \u03c1 is the variation distance, Q(\u00b7| xn) is a \ufb01xed version of the regular\nconditional probabilities for yn given xn, and Pn are the multiprobabilities produced by\nthe VPM.\n\nThis theorem shows that not only all VPMs are reliable but some of them also have asymp-\ntotically optimal resolution. The version of this result for p-values was proved in [4].\n\n7 Comparisons\n\nIn this section we brie\ufb02y and informally compare this paper\u2019s approach to standard ap-\nproaches in machine learning.\n\nTwo most important approaches to analysis of machine-learning algorithms are Bayesian\nlearning theory and PAC theory (the recent mixture, the PAC-Bayesian theory, is part of\nPAC theory in its assumptions). This paper is in a way intermediate between Bayesian\nlearning (no empirical justi\ufb01cation for probabilities is required) and PAC learning (the goal\nis to \ufb01nd or bound the true probability of error, not just to output calibrated probabilities).\nAn important difference of our approach from the PAC approach is that we are interested\nin the conditional probabilities for the label given the new object, whereas PAC theory\n(even in its \u201cdata-dependent\u201d version, as in [12\u201314]) tries to estimate the unconditional\nprobability of error.\n\n010002000300040005000600070008000900010000050100150200250300350400\fAcknowledgments\n\nWe are grateful to Phil Dawid for a useful discussion and to the anonymous referees\nfor suggestions for improvement. This work was partially supported by EPSRC (grant\nGR/R46670/01), BBSRC (grant 111/BIO14428), and EU (grant IST-1999-10226).\n\nReferences\n[1] A. Philip Dawid. Probability forecasting. In Samuel Kotz, Norman L. Johnson, and\nCampbell B. Read, editors, Encyclopedia of Statistical Sciences, volume 7, pages\n210\u2013218. Wiley, New York, 1986.\n\n[2] Vladimir Vovk. On-line Con\ufb01dence Machines are well-calibrated. In Proceedings\nof the Forty Third Annual Symposium on Foundations of Computer Science, pages\n187\u2013196, Los Alamitos, CA, 2002. IEEE Computer Society.\n\n[3] Vladimir Vovk, Ilia Nouretdinov, and Alex Gammerman. Testing exchangeability\non-line.\nIn Tom Fawcett and Nina Mishra, editors, Proceedings of the Twentieth\nInternational Conference on Machine Learning, pages 768\u2013775, Menlo Park, CA,\n2003. AAAI Press.\n\n[4] Vladimir Vovk. Universal well-calibrated algorithm for on-line classi\ufb01cation.\n\nIn\nBernhard Sch\u00a8olkopf and Manfred K. Warmuth, editors, Learning Theory and Ker-\nnel Machines: Sixteenth Annual Conference on Learning Theory and Seventh Kernel\nWorkshop, volume 2777 of Lecture Notes in Arti\ufb01cial Intelligence, pages 358\u2013372,\nBerlin, 2003. Springer.\n\n[5] James O. Berger and Mohan Delampady. Testing precise hypotheses (with discus-\n\nsion). Statistical Science, 2:317\u2013352, 1987.\n\n[6] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a\n\nRandom World. Springer, New York, to appear.\n\n[7] Glenn Shafer and Vladimir Vovk. Probability and Finance: It\u2019s Only a Game! Wiley,\n\nNew York, 2001.\n\n[8] Craig Saunders, Alex Gammerman, and Vladimir Vovk. Transduction with con\ufb01-\ndence and credibility. In Proceedings of the Sixteenth International Joint Conference\non Arti\ufb01cial Intelligence, pages 722\u2013726, 1999.\n\n[9] Vladimir Vovk, Alex Gammerman, and Craig Saunders. Machine-learning appli-\ncations of algorithmic randomness.\nIn Proceedings of the Sixteenth International\nConference on Machine Learning, pages 444\u2013453, San Francisco, CA, 1999. Morgan\nKaufmann.\n\n[10] Berna E. K\u0131l\u0131nc\u00b8. The reception of John Venn\u2019s philosophy of probability. In Vincent F.\nHendricks, Stig Andur Pedersen, and Klaus Frovin J\u00f8rgensen, editors, Probability\nTheory: Philosophy, Recent History and Relations to Science, pages 97\u2013121. Kluwer,\nDordrecht, 2001.\n\n[11] Luc Devroye, L\u00b4aszl\u00b4o Gy\u00a8or\ufb01, and G\u00b4abor Lugosi. A Probabilistic Theory of Pattern\n\nRecognition. Springer, New York, 1996.\n\n[12] Nick Littlestone and Manfred K. Warmuth. Relating data compression and learnabil-\n\nity. Technical report, University of California, Santa Cruz, 1986.\n\n[13] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony.\nStructural risk minimization over data-dependent hierarchies. IEEE Transactions on\nInformation Theory, 44:1926\u20131940, 1998.\n\n[14] David A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh\nAnnual Conference on Computational Learning Theory, pages 230\u2013234, New York,\n1998. Association for Computing Machinery.\n\n\f", "award": [], "sourceid": 2462, "authors": [{"given_name": "Vladimir", "family_name": "Vovk", "institution": null}, {"given_name": "Glenn", "family_name": "Shafer", "institution": null}, {"given_name": "Ilia", "family_name": "Nouretdinov", "institution": null}]}