{"title": "PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 769, "page_last": 776, "abstract": null, "full_text": "PAC-Bayes Bounds for the Risk of the Majority Vote\n\nand the Variance of the Gibbs Classi\ufb01er\n\nAlexandre Lacasse, Franc\u00b8ois Laviolette and Mario Marchand\n\nD\u00b4epartement IFT-GLO\n\nUniversit\u00b4e Laval\nQu\u00b4ebec, Canada\n\nFirstname.Secondname@ift.ulaval.ca\n\nPascal Germain\n\nD\u00b4epartement IFT-GLO\n\nUniversit\u00b4e Laval Qu\u00b4ebec, Canada\n\nNicolas Usunier\n\nLaboratoire d\u2019informatique de Paris 6\n\nUniversit\u00b4e Pierre et Marie Curie, Paris, France\n\nPascal.Germain.1@ulaval.ca\n\nNicolas.Usunier@lip6.fr\n\nAbstract\n\nWe propose new PAC-Bayes bounds for the risk of the weighted majority vote that\ndepend on the mean and variance of the error of its associated Gibbs classi\ufb01er. We\nshow that these bounds can be smaller than the risk of the Gibbs classi\ufb01er and can\nbe arbitrarily close to zero even if the risk of the Gibbs classi\ufb01er is close to 1/2.\nMoreover, we show that these bounds can be uniformly estimated on the training\ndata for all possible posteriors Q. Moreover, they can be improved by using a\nlarge sample of unlabelled data.\n\n1 Introduction\n\nThe PAC-Bayes approach, initiated by [1], aims at providing PAC guarantees to \u201cBayesian-like\u201d\nlearning algorithms. Within this approach, we consider a prior1 distribution P over a space of\nclassi\ufb01ers that characterizes our prior belief about good classi\ufb01ers (before the observation of the\ndata) and a posterior distribution Q (over the same space of classi\ufb01ers) that takes into account the\nadditional information provided by the training data. A remarkable result, known as the \u201cPAC-\nBayes theorem\u201d, provides a risk bound for the Q-weigthed majority-vote by bounding the risk of an\nassociated stochastic classi\ufb01er called the Gibbs classi\ufb01er. Bounds previously existed which showed\nthat you can de-randomize back to the Majority Vote classi\ufb01er, but these come at a cost of worse\nrisk. Naively, one would expect that the de-randomized classi\ufb01er would perform better. Indeed, it is\nwell-known that voting can dramatically improve performance when the \u201ccommunity\u201d of classi\ufb01ers\ntend to compensate the individual errors. The actual PAC-Bayes framework is currently unable to\nevaluate whether or not this compensation occurs. Consequently, this framework can not currently\nhelp in producing highly accurate voted combinations of classi\ufb01ers.\nIn this paper, we present new PAC-Bayes bounds on the risk of the Majority Vote classi\ufb01er based\non the estimation of the mean and variance of the errors of the associated Gibbs classi\ufb01er. These\nbounds allow to prove that a suf\ufb01cient condition to provide an accurate combination is (1) that the\nerror of the Gibbs classi\ufb01er is less than half and (2) the mean pairwise covariance of the errors of\nthe classi\ufb01ers appearing in the vote is small. In general, the bound allows to detect when the voted\ncombination provably outperforms its associated Gibbs classi\ufb01er.\n\n1Priors have been used for many years in statistics. The priors in this paper have only indirect links with the\n\nBayesian priors. We will nevertheless use this language, since it comes from previous work.\n\n\f2 Basic De\ufb01nitions\nWe consider binary classi\ufb01cation problems where the input space X consists of an arbitrary subset\nof Rn and the output space Y = {\u22121, +1}. An example z def= (x, y) is an input-output pair where\nx \u2208 X and y \u2208 Y. Throughout the paper, we adopt the PAC setting where each example z is drawn\naccording to a \ufb01xed, but unknown, probability distribution D on X \u00d7 Y.\nWe consider learning algorithms that work in a \ufb01xed hypothesis space H of binary classi\ufb01ers (de-\n\ufb01ned without reference to the training data). The risk R(h) of any classi\ufb01er h: X \u2192 Y is de\ufb01ned\nas the probability that h misclassi\ufb01es an example drawn according to D:\n\n(cid:179)\n\n(cid:180)\n\nR(h)\n\ndef=\n\nPr\n\n(x,y)\u223cD\n\nh(x) (cid:54)= y\n\n=\n\nE\n\n(x,y)\u223cD\n\nI(h(x) (cid:54)= y),\n\nwhere I(a) = 1 if predicate a is true and 0 otherwise.\nGiven a training set S, m will always represent its number of examples. Moreover, if S =\n(cid:104)z1, . . . , zm(cid:105), the empirical risk RS(h) on S, of any classi\ufb01er h, is de\ufb01ned according to:\n\nm(cid:88)\n\ni=1\n\nRS(h) def=\n\n1\nm\n\nI(h(xi) (cid:54)= yi).\n\nAfter observing the training set S, the task of the learner is to choose a posterior distribution Q over\nH such that the Q-weighted Majority Vote classi\ufb01er, BQ, will have the smallest possible risk. On\nany input training example x, the output, BQ(x), of the Majority Vote classi\ufb01er BQ (also called the\nBayes classi\ufb01er) is given by:\n\n(cid:184)\n\n(cid:183)\n\nBQ(x) def= sgn\n\nE\nh\u223cQ\n\nh(x)\n\n,\n\nwhere sgn(s) = +1 if real number s > 0 and sgn(s) = \u22121 otherwise. The output of the determin-\nistic Majority Vote classi\ufb01er BQ is thus closely related to the output of a stochastic classi\ufb01er called\nthe Gibbs classi\ufb01er. To classify an input example x, the Gibbs classi\ufb01er GQ chooses randomly a\n(deterministic) classi\ufb01er h according to Q to classify x. The true risk R(GQ) and the empirical risk\nRS(GQ) of the Gibbs classi\ufb01er are thus given by:\n\nR(GQ) def= E\nh\u223cQ\n\nR(h) = E\nh\u223cQ\n\nRS(GQ) def= E\nh\u223cQ\n\nRS(h) = E\nh\u223cQ\n\n1\nm\n\ni=1\n\n(x,y)\u223cD\n\nE\n\nm(cid:88)\n\nI(h(x) (cid:54)= y)\n\nI(h(xi) (cid:54)= yi).\n\n(1)\n\n(2)\n\nThe PAC-Bayes theorem gives a tight risk bound for the Gibbs classi\ufb01er GQ that depends on how\nfar is the chosen posterior Q from a prior P that must be chosen before observing the data. The\nPAC-Bayes theorem was \ufb01rst proposed by [2]. The bound presented here can be found in [3].\nTheorem 1 (PAC-Bayes Theorem) For any prior distribution P over H, and any \u03b4 \u2208 ]0, 1], we\nhave\n\nPr\n\nS\u223cDm\n\n\u2200 Q overH : kl(RS(GQ)(cid:107)R(GQ)) \u2264 1\nm\n\nKL(Q(cid:107)P ) + ln m + 1\n\n\u2265 1 \u2212 \u03b4 ,\n\n\u03b4\n\nwhere KL(Q(cid:107)P ) is the Kullback-Leibler divergence between Q and P :\n\n(cid:184)(cid:182)\n\n(cid:181)\n\n(cid:183)\n\nKL(Q(cid:107)P ) def= E\nh\u223cQ\n\nln Q(h)\nP (h) ,\n\nand where kl(q(cid:107)p) is the Kullback-Leibler divergence between the Bernoulli distributions with prob-\nability of success q and probability of success p: kl(q(cid:107)p) def= q ln q\n\np + (1 \u2212 q) ln 1\u2212q\n1\u2212p .\n\nThis theorem has recently been generalized by [4] to the sample-compression setting. In this paper,\nhowever, we restrict ourselves to the more common case where the set H of classi\ufb01ers is de\ufb01ned\nwithout reference to the training data.\n\n\fA bound given for the risk of Gibbs classi\ufb01ers can straightforwardly be turned into a bound for the\nrisk of Majority Vote classi\ufb01ers. Indeed, whenever BQ misclassi\ufb01es x, at least half of the classi\ufb01ers\n(under measure Q), misclassi\ufb01es x. It follows that the error rate of GQ is at least half of the error\nrate of BQ. Hence R(BQ) \u2264 2R(GQ). A method to decrease the R(BQ)/R(GQ) ratio to 1 + \u0001 (for\nsome small positive \u0001) has been proposed by [5] for large-margin classi\ufb01ers. For a suitably chosen\nprior and posterior, [5] have also shown that RS(GQ) is small when the corresponding Majority Vote\nclassi\ufb01er BQ achieves a large separating margin on the training data. Consequently, the PAC-Bayes\ntheorem yields a tight risk bound for large margin classi\ufb01ers.\nEven if we can imagine situations where R(BQ) > R(GQ), they have been rarely encountered in\npractice. In fact, situations where R(BQ) is much smaller than R(GQ) seem to occur much more\noften. For example, consider the extreme case where the true label y of x is 1 iff Eh\u223cQh(x) > 1/2.\nIn this case R(BQ) = 0 whereas R(GQ) can be as high as 1/2 \u2212 \u0001 for some arbitrary small \u0001. The\nsituations where R(BQ) is much smaller than R(GQ) are not captured by the PAC-Bayes theorem.\nIn the next section, we provide a bound on R(BQ) that depends on R(GQ) and other properties\nthat can be estimated from the training data. This bound can be arbitrary close to 0 even for a large\nR(GQ) as long as R(GQ) < 1/2 and as long as we have a suf\ufb01ciently large population of classi\ufb01ers\nfor which their errors are suf\ufb01ciently \u201cuncorrelated\u201d.\n\n3 A Bound on R(BQ) that Can Be Much Smaller than R(GQ)\n\nAll of our relations between R(BQ) and R(GQ) arise by considering the Q-weight WQ(x, y) of\nclassi\ufb01ers making errors on example (x, y):\n\nWQ(x, y) def= E\nh\u223cQ\n\nI(h(x) (cid:54)= y) .\n\nClearly, we have:\n\n(WQ(x, y) > 1/2) \u2264 R(BQ) \u2264\n\nPr\n\n(x,y)\u223cD\n\n(WQ(x, y) \u2265 1/2).\n\nPr\n\n(x,y)\u223cD\n\nHence, Pr\n\n(x,y)\u223cD\n\n(WQ(x, y) \u2265 1/2) gives a very tight upper bound on R(BQ). Moreover,\n\nE\n\n(x,y)\u223cD\n\nWQ(x, y) =\n\nE\n\n(x,y)\u223cD\n\nE\nh\u223cQ\n\nI(h(x) (cid:54)= y) = R(GQ)\n\n(cid:179)\n\n(cid:182)\n\n(cid:180)2\n\nand\n\nVar\n\n(x,y)\u223cD\n\n(WQ) =\n\nE\n\n(x,y)\u223cD\n\n(cid:181)\n(cid:179)\n\n(cid:180)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(cid:180)\n\n(WQ)2 \u2212\n\nE\n\n(cid:179)\n\nh1\u223cQ\n\nE\n\n(x,y)\u223cD\n\nWQ\nI(h1(x) (cid:54)= y) E\nh2\u223cQ\n\nE\n\n(x,y)\u223cD\n\ncoverr(h1, h2) ,\n\n=\n\n=\n\ndef=\n\nE\n\n(x,y)\u223cD\n\nE\n\nh1\u223cQ\n\nE\n\nh2\u223cQ\n\nE\n\nh1\u223cQ\n\nE\n\nh2\u223cQ\n\nI(h2(x) (cid:54)= y)\n\n\u2212 R2(GQ)\n\nI(h1(x) (cid:54)= y)I(h2(x) (cid:54)= y) \u2212 R(h1)R(h2)\n\nwhere coverr(h1, h2) denotes the covariance of the errors of h1 and h2 on examples drawn by D.\nThe next theorem is therefore a direct consequence of the one-sided Chebychev (or Cantelli-\nChebychev) inequality [6]: Pr (WQ \u2265 a + E(WQ)) \u2264 Var(WQ)\nVar(WQ)+a2\nTheorem 2 For any distribution Q over a class of classi\ufb01ers, if R(GQ) \u2264 1/2 then we have\n\nfor any a > 0.\n\nR(BQ) \u2264\n\nVar\n\n(x,y)\u223cD\n\nVar\n\n(WQ)\n\n(x,y)\u223cD\n(WQ) + (1/2 \u2212 R(GQ))2 =\n\nVar\n\n(x,y)\u223cD\n\nE\n\n(x,y)\u223cD\n\n(1 \u2212 2WQ)\n(1 \u2212 2WQ)2\n\nWe will always use here the \ufb01rst form of CQ. However, note that 1 \u2212 2WQ =\nh\u2208H Q(h)yh(x)\nis just the margin of the Q-convex combination realized on (x, y). Hence, the second form of CQ is\nsimply the variance of the margin divided by its second moment!\n\ndef= CQ.\n\n(cid:80)\n\n\f(cid:88)\n\nh\u2208H\n\n(cid:88)\n\n(cid:88)\n\nh1\u2208H\n\nh2\u2208H:\nh2(cid:54)=h1\n\nVar\n\n(x,y)\u223cD\n\n(WQ) \u2264 1\n4\n\n(cid:80)\n\nThe looser two-sided Chebychev inequality was used in [7] to bound the risk of random forests.\nHowever, the one-sided bound CQ is much tighter. For example, the two-sided bound in [7] diverges\nwhen R(GQ) \u2192 1/2, but CQ \u2264 1 whenever R(GQ) \u2264 1/2. In fact, as explained in [8], the\none-sided Chebychev bound is the tightest possible upper bound for any random variable which is\nbased only on its expectation and variance.\nThe next result shows that, when the number of voters tends to in\ufb01nity (and the weight of each voter\ntends to zero), the variance of WQ will tend to 0 provided that the average of the covariance of the\nrisks of all pairs of distinct voters is \u2264 0. In particular, the variance will always tend to 0 if the risk\nof the voters are pairwise independent.\nProposition 3 For any countable class H of classi\ufb01ers and any distribution Q over H, we have\n\nQ2(h) +\n\nQ(h1)Q(h2)coverr(h1, h2).\n\n(cid:80)\nThe proof is straightforward and is left to the reader. The key observation that comes out of this result\nh\u2208H Q2(h) is usually much smaller than one. Consider, for example, the case where Q\nis that\nh\u2208H Q2(h) = 1/n. Moreover, if coverr(h1, h2) \u2264 0\nis uniform on H with |H| = n. Then q =\nfor each pair of distinct classi\ufb01ers in H, then Var(WQ)\u2264 1/(4n). Hence, in these cases, we have\nthat CQ\u2208O(1/n) whenever 1/2\u2212R(GQ) is larger than some positive constant independent of n.\nThus, even when R(GQ) is large, we see that R(BQ) can be arbitrarily close to 0 as we increase the\nnumber of classi\ufb01ers having non-positive pairwise covariance of their risk.\nTo further motivate the use of CQ, we have investigated, on several UCI binary classi\ufb01cation data\nsets, how R(GQ), Var(WQ) and CQ are respectively related to R(BQ). The results of Figure 1 have\nbeen obtained with the Adaboost [9] algorithm used with \u201cdecision stumps\u201d as weak learners. Each\ndata set was split in two halves: one used for training and the other for testing. In the chart relating\nR(GQ) and R(BQ), we see that we almost always have R(BQ) < R(GQ). There is, however, no\nclear correlation between R(BQ) and R(GQ). We also see no clear correlation between R(BQ) and\nVar(WQ) in the second chart. In contrast, the chart of CQ vs R(BQ) shows a strong correlation.\nIndeed, it is almost a linear relation!\n\nFigure 1: Relation, on various data sets, between R(BQ) and R(GQ), Var(WQ), and CQ.\n\n00,10,20,30,40,500,10,20,30,40,5R(BQ) on testR(GQ) on testbreast-cancerbreast-wcredit-ghepatitisionospherekr-vs-kplabormushroomsicksonarvote00,050,10,150,20,2500,10,20,30,40,5R(BQ) on testVar(WQ) on testbreast-cancerbreast-wcredit-ghepatitisionospherekr-vs-kplabormushroomsicksonarvote00,20,40,60,8100,10,20,30,40,5R(BQ) on testCQ on testbreast-cancerbreast-wcredit-ghepatitisionospherekr-vs-kplabormushroomsicksonarvote\f4 New PAC-Bayes Theorems\n\nA uniform estimate of CQ can be obtained if we have uniform upper bounds on R(GQ) and on\nthe variance of WQ. While the original PAC-Bayes theorem provides an upper bound on R(GQ)\nthat holds uniformly for all posteriors Q, obtaining such bounds for the variance of a random vari-\nable is still an issue. To achieve this goal, we will have to generalize the PAC-Bayes theorem for\nexpectations over pairs of classi\ufb01ers since E(W 2\nDe\ufb01nition 4 For any probability distribution Q over H, we de\ufb01ne the expected joint error (eQ), the\nexpected joint success (sQ), and the expected disagreement (dQ) as\n\nQ) is fundamentally such an expectation.\n\n(cid:180)\n(cid:180)\n\n(cid:179)\n(cid:179)\n(cid:179)\n\neQ\n\nsQ\n\ndQ\n\ndef=\n\ndef=\n\ndef=\n\nE\n\nh1\u223cQ\n\nE\n\nh2\u223cQ\n\nE\n\nh1\u223cQ\n\nE\n\nh2\u223cQ\n\nE\n\n(x,y)\u223cD\n\nE\n\n(x,y)\u223cD\n\nE\n\nh1\u223cQ\n\nE\n\nh2\u223cQ\n\nE\n\n(x,y)\u223cD\n\nI(h1(x) (cid:54)= y)I(h2(x) (cid:54)= y)\n\nI(h1(x) = y)I(h2(x) = y)\nI(h1(x) (cid:54)= h2(x))\n\n.\n\n(cid:161)\n\n(cid:80)m\ni=1 I(h1(x) (cid:54)= y)I(h2(x) (cid:54)= y)\n\n, etc.\n\n(cid:180)\n(cid:162)\n\nusual, i.e., (cid:99)eQ\n\ndef= E\n\nh1\u223cQ\n\nE\n\nh2\u223cQ\n\n1\nm\n\nThe empirical estimates, over a training set S = (cid:104)z1, . . . , zm(cid:105), of these expectations are de\ufb01ned as\n\nIt is easy to see that\n\neQ = E\n\n(x,y)\u223cD\n\nW 2\nQ ,\n\nsQ = E\n\n(x,y)\u223cD\n\n(1\u2212WQ)2,\n\nand\n\ndQ = E\n\n(x,y)\u223cD\n\n2WQ(1\u2212WQ) .\n\nThus, we have eQ + sQ + dQ = 1 and 2eQ + dQ = 2R(GQ). This implies,\n\nR(GQ) = eQ +\n\n\u00b7 dQ =\n\n\u00b7 (1 + eQ \u2212 sQ)\n\n1\n2\n\n1\n2\n\nVar(WQ) = eQ\u2212(R(GQ))2 = eQ\u2212(eQ +\n\n1\n2\n\n\u00b7dQ)2 = eQ\u2212 1\n4\n\n\u00b7(1 + eQ \u2212 sQ)2\n\nMoreover, in that new setting, the denominator of CQ can elegantly be rewritten as\n\nVar(WQ) + (1/2 \u2212 R(GQ))2 = 1/4 \u2212 dQ/2.\nThe next theorem can be used to bound separately either eQ, sQ or dQ.\nTheorem 5 For any prior distribution P over H, and any \u03b4 \u2208 ]0, 1], we have:\n(m + 1)\n\n(cid:181)\n\u2200 Q overH : kl((cid:99)\u03b1Q(cid:107)\u03b1Q) \u2264 1\n\n2\u00b7KL(Q(cid:107)P ) + ln\n\n(cid:183)\n\nPr\n\n(cid:184)(cid:182)\n\n\u2265 1 \u2212 \u03b4\n\nm\n\n\u03b4\n\nS\u223cDm\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\nwhere \u03b1Q can be either eQ, sQ or dQ.\nIn contrast with Theorem 5, the next theorem will enable us to bound directly Var(WQ), by bound-\ning any pair of expectations among eQ, sQ and dQ.\nTheorem 6 For any prior distribution P over H, and any \u03b4 \u2208 ]0, 1], we have:\n\n(cid:181)\n\u2200 Q overH : kl((cid:99)\u03b1Q,(cid:99)\u03b2Q(cid:107)\u03b1Q, \u03b2Q) \u2264 1\n\n2\u00b7KL(Q(cid:107)P ) + ln\n\n(m + 1)(m + 2)\n\n\u2265 1 \u2212 \u03b4\n\n(cid:184)(cid:182)\n\n(cid:183)\n\nPr\n\nS\u223cDm\nwhere \u03b1Q and \u03b2Q can be any two distinct choices among eQ, sQ and dQ, and where\n1 \u2212 q1 \u2212 q2\n1 \u2212 p1 \u2212 p2\n\nkl(q1, q2(cid:107)p1, p2) def= q1 ln q1\np1\n\n+ (1 \u2212 q1 \u2212 q2) ln\n\n+ q2 ln q2\np2\n\n2\u03b4\n\nm\n\nis the Kullback-Leibler divergence between the distributions of two trivalent random variables Yq\nand Yp with P (Yq = a) = q1, P (Yq = b) = q2 and P (Yq = c) = 1\u2212q1\u2212q2 (and similarly for Yp).\nThe proof of Theorem 5 can be seen as a special case of Theorem 1. The proof of Theorem 6 essen-\ntially follows the proof of Theorem 1 given in [4]; except that it is based on a trinomial distribution\ninstead of a binomial one2.\n\n2For the proofs of these theorems, see a long version of the paper at http://www.ift.ulaval.ca/\n\n\u223claviolette/Publications/publications.html.\n\n\f5 PAC-Bayes Bounds for Var(WQ) and R(BQ)\n\nFrom the two theorems of the preceding section, one can easily derive several PAC-Bayes bounds\nof the variance of WQ and therefore, of the majority vote. Since CQ is a quotient. Thus, an upper\nbound on CQ will degrade rapidly if the bounds on the numerator and the denominator are not tight\n\u2014 especially for majority votes obtained by boosting algorithms where both the numerator and the\ndenominator tend to be small. For this reason, we will derive more than one PAC-Bayes bound for\nthe majority vote, and compare their accuracy. First, we need the following notations that are related\nto Theorems 1, 5 and 6. Given any prior distribution P over H,\nKL(Q(cid:107)P ) + ln\n\n(cid:105)(cid:190)\n\n(m + 1)\n\nR \u03b4\n\n(cid:104)\n\ndef=\n\n,\n\n(cid:189)\n(cid:189)\n(cid:104)\nr : kl(RS(GQ)(cid:107)r) \u2264 1\ne : kl((cid:99)eQ(cid:107)e) \u2264 1\nm\n(cid:189)\n(cid:104)\n2\u00b7KL(Q(cid:107)P ) + ln\nd : kl((cid:99)dQ(cid:107)d) \u2264 1\n(cid:189)\n(e, s) : kl((cid:99)eQ,(cid:99)sQ(cid:107)e, s) \u2264 1\n\n2\u00b7KL(Q(cid:107)P ) + ln\n\n(cid:104)\n\nm\n\nm\n\n(cid:105)(cid:190)\n(cid:105)(cid:190)\n\n,\n\n,\n\n\u03b4\n(m + 1)\n\n(m + 1)\n\n\u03b4\n\n\u03b4\n\nQ,S\n\nE \u03b4\n\nQ,S\n\nD \u03b4\n\nQ,S\n\nA \u03b4\n\nQ,S\n\ndef=\n\ndef=\n\ndef=\n\n2\u00b7KL(Q(cid:107)P ) + ln\n\n(m + 1)(m + 2)\n\n\u03b4\n\nm\n\n(cid:105)(cid:190)\n\n.\n\nv+a = 1\n\n1+a/v , it follows from Theorem 2 that an upper bound of both Var(WQ) and R(GQ)\nSince v\nwill give an upper bound on CQ, and hence on R(BQ). Hence, a \ufb01rst bound can be obtained, from\nEquation 9, by suitably applying Theorem 5 (with \u03b1Q = eQ) and Theorem 1.\nPAC-Bound 1 For any prior distribution P over H, and any \u03b4 \u2208 ]0, 1], we have\n\nPr\n\nS\u223cDm\n\n\u2200 Q overH : Var\n(x,y)\u223cD\n\n\uf8eb\uf8ec\uf8ed\u2200 Q overH : R(BQ) \u2264\n\nPr\n\nS\u223cDm\n\nsupE \u03b4/2\n\nQ,S \u2212\n\nWQ \u2264 supE \u03b4/2\n\ninf R\u03b4/2\n\n\u2265 1 \u2212 \u03b4 ,\n\n(cid:179)\n\nQ,S\n\n(cid:182)\n\n(cid:180)2\n(cid:180)2\n2 \u2212 supR\u03b4/2\n\n1\n\nQ,S\n\nQ,S \u2212\n(cid:179)\n(cid:180)2\nQ,S \u2212\nsupE \u03b4/2\ninf R\u03b4/2\n\n(cid:179)\n\nQ,S\n\ninf R\u03b4/2\n\nQ,S\n\n(cid:179)\n\n+\n\n\uf8f6\uf8f7\uf8f8 \u2265 1 \u2212 \u03b4 .\n\n(cid:180)2\n\nSince Bound 1 necessitates two PAC approximations to calculate the variance, it would be better\nif we could obtain directly an upper bound for Var(WQ). The following result, which is a direct\nconsequence of Theorem 6 and Equation 9, shows how it can be done.\nPAC-Bound 2 For any prior distribution P over H, and any \u03b4 \u2208 ]0, 1], we have\n\n(cid:181)\n\n(cid:195)\n\n(cid:189)\n\ne \u2212 1\n4\n\n(cid:190)(cid:33)\n\n(cid:170)\n\nPr\n\nS\u223cDm\n\n\u2200 Q overH : Var\n(x,y)\u223cD\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\u2200 Q overH : R(BQ) \u2264\n\nPr\n\nS\u223cDm\n\nWQ \u2264\n\nsup\n(e,s)\u2208A \u03b4\n\nQ,S\n\n(cid:169)\n\n(cid:169)\n\nsup\n\ne \u2212 1\n\nQ,S\n\n(e,s)\u2208A\u03b4/2\n4 \u00b7 (1 + e \u2212 s)2\n\ne \u2212 1\n\n4 \u00b7 (1 + e \u2212 s)2\n(cid:170)\n\n(cid:179)\n\n+\n\n1\n\n2 \u2212 supR\u03b4/2\n\nQ,S\n\n(cid:180)2\n\nsup\n\n(e,s)\u2208A\u03b4/2\n\nQ,S\n\n\u00b7 (1 + e \u2212 s)2\n\n\u2265 1 \u2212 \u03b4 ,\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 \u2265 1 \u2212 \u03b4 .\n\nAs illustrated in Figure 2, Bound 2 is generally tighter than Bound 1. This gain is principally due\nto the fact that the values of e and s, that are used to bound the variance, are tied together inside the\nkl() and have to tradeoff their values (e \u201ctries to be\u201d as large as possible and s as small as possible).\nBecause of this tradeoff, e is generally not an upper bound of eQ, and s not a lower bound of sQ.\nIn the semi-supervised framework, we can achieve better results, because the labels of the examples\ndo not affect the value of dQ (see De\ufb01nition 4). Hence, in presence of a large amount of unlabelled\ndata, one can use Theorem 5 to obtain very accurate upper and lower bounds of dQ. This combined\nwith an upper bound of eQ, still computed via Theorem 5 but on the labelled data, gives rise to the\n\n\ffollowing semi-supervised upper bound3 of Var(WQ). The bound on R(BQ) then follows from\nTheorem 2 and Equation 10.\nPAC-Bound 3 (semi-supervised bound) For any prior distribution P over H, and any \u03b4 \u2208 ]0, 1]:\n\u2265 1 \u2212 \u03b4\n\nWQ \u2264 supE \u03b4\n\n\u00b7 inf D \u03b4\n\nsupE \u03b4\n\n(cid:179)\n\nQ,S +\n\nQ,S(cid:48)\n\n1\n2\n\n\u2200 Q overH : Var\n(x,y)\u223cD\n\n(cid:181)\n\uf8eb\uf8ec\uf8ed\u2200 Q overH : R(BQ) \u2264 supE \u03b4\n\nQ,S \u2212\n(cid:179)\n\n(cid:182)\n(cid:180)2\n\uf8f6\uf8f7\uf8f8 \u2265 1 \u2212 \u03b4\n(cid:180)2\n\nPr\nS\u223cDm\nS(cid:48)\u223cDm(cid:48)\n\nunlabelled\n\nPr\nS\u223cDm\nS(cid:48)\u223cDm(cid:48)\n\nunlabelled\n\nQ,S \u2212\n\nsupE \u03b4\n1/4 \u2212 1\n\nQ,S + 1\n2 \u00b7 supD \u03b4\n\n2 \u00b7 inf D \u03b4\nQ,S(cid:48)\n\nQ,S(cid:48)\n\nWe see, on the left part of Figure 2, that Bound 2 on Var(WQ) is much tighter than Bound 1. We\ncan also see that, by using unlabeled data4 to estimate dQ, Bound 3 provides an other signi\ufb01cant\nimprovement. These numerical results were obtained by using Adaboost [9] with decision stumps\non the Mushroom UCI data set (which contains 8124 examples). This data set was randomly split\ninto two halves: one for training and one for testing.\n\nFigure 2: Bounds on Var(WQ) (left) and bounds on R(BQ) (right).\n\nAs illustrated by Figure 2, Bound 2 and Bound 3 are (resp. for the supervised and semi-supervised\nframeworks) very tight upper bounds of the variance. Unfortunately they do not lead to tight upper\nbounds of R(BQ). Indeed, one can see in Figure 2 that after T = 8, all the bounds are degrading\neven if the true value of CQ (on which they are based) continues to decrease. This drawback is due\nto the fact that, when the value of dQ tends to 1/2, the denominator of CQ tends to 0. Hence, if dQ\nis close to 1/2, Var(WQ) must be small as well. Thus, any slack in the bound of Var(WQ) has a\nmultiplicative effect on each of the three proposed PAC-bounds of R(BQ). Unfortunately, boosting\nalgorithms tend to construct majority votes with expected an disagreement dQ just slightly under\n1/2. Based on the next proposition, we will show that this drawback is, in a sense, unavoidable.\n\nProposition 7 (Inapproachability result) Let Q be any distribution over a class of classi\ufb01ers, and\nlet B < 1 be any upper bound of CQ which holds with con\ufb01dence 1 \u2212 \u03b4. If R(GQ) < 1/2 then\n\n(cid:113)\n\n(cid:161)\n\n(cid:162)\n\n1 \u2212 B\nis an upper bound of R(GQ) which holds with con\ufb01dence 1 \u2212 \u03b4.\n\n(1/4 \u2212 dQ/2)\n\n1/2 \u2212\n\nrise to an upper bound of eQ \u2212 (eQ + 1\n\n3It follows, from an easy calculation, that a lower bound of dQ, together with an upper bound of eQ, gives\n2 \u00b7 dQ)2. By Equation 9, we then obtain an upper bound of Var(WQ).\n4The UCI database (used here) does not have any unlabeled examples. To simulate the extreme case\nwhere we have an in\ufb01nite amount of unlabeled data, we simply used the empirical value of dQ computed\non the testing set.\n\n00,020,040,060,080,10,120,140,16111213141TVar(WQ) PAC-Bound 1Var(WQ) PAC-Bound 2Var(WQ) PAC-Bound 3Var(WQ) on test00,20,40,60,811,2111213141TR(BQ) PAC-Bound 1R(BQ) PAC-Bound : 2R(GQ) using Thm 1R(BQ)PAC-Bound 2R(BQ)PAC-Bound 3CQ on testR(BQ)on test\fFor the data set used in Figure 2, Proposition 7, together with Bound 3 on R(BQ) (viewed as\na bound on CQ), gives a PAC-bound on R(GQ) which is just slightly lower (\u2248 0.5%) than the\nclassical PAC-Bayes bound on R(GQ) given by Theorem 1. Since any bound better than Bound 3\nfor CQ will continue to improve the bound on R(GQ), it seems unlikely that such a better bound\nexists. Moreover, this drawback should occur for any bound on the majority vote that only considers\nGibbs\u2019 risk and the variance of WQ because, as already explained, CQ is the tightest possible bound\nof R(BQ) that is based only on E(WQ) and Var(WQ). Hence, to improve our results in the situation\nwhere dQ is closed to 1/2, one will have to consider higher moments. However, it is not clear that\nthis will lead to a better bound of R(BQ) because, even if Theorem 5 generalizes to higher moments,\nits tightness is then degrading. Indeed, for the kth moment, the factor 2 that multiplies KL(Q(cid:107)P )\nin Theorem 5 grows to k. However, it might be possible to overcome this degradation by using a\ngeneralization of Theorem 6 as we have done in this paper to obtain our tightest supervised bound for\nthe variance (Bound 2). Indeed, if we evaluate the tightness of that bound on the variance (w.r.t. its\nvalue on the test set), and compare it with the tightness of the bound on R(GQ) given by Theorem 1,\nwe \ufb01nd that both accuracies are at about 3%. This is to be contrasted with the tightness of Bound 1\nand seems to indicate that we have prevented degradation even if the variance deals with both the\n\ufb01rst and the second moment of WQ; whereas the Gibbs\u2019 risk deals only with the \ufb01rst moment.\n\n6 Conclusion\n\nWe have derived a risk bound for the weighted majority vote that depends on the mean and variance\nof the error of its associated Gibbs classi\ufb01er (Theorem 2). The proposed bound is based on the one-\nsided Chebychev\u2019s inequality, which is the tightest inequality for any real-valued random variables\ngiven only the expectation and the variance. As shown on Figures 1, this bound seems to have a\nstrong predictive power on the risk of the majority vote.\nWe have also shown that the original PAC-Bayes Theorem, together with new ones, can be used to\nobtain high con\ufb01dence estimates of this new risk bound that hold uniformly for all posterior distribu-\ntions. Moreover, the new PAC-Bayes theorems give rise to the \ufb01rst uniform bounds on the variance\nof the Gibbs\u2019s risk (more precisely, the variance of the associate random variable WQ). Even if there\nare arguments showing that bounds of higher moments of WQ should be looser, we have empirically\nfound that one of the proposed bounds (Bound 2) does not show any sign of degradation in compar-\nison with the classical PAC-Bayes bound on R(GQ) (which is the \ufb01rst moment). Surprisingly, there\nis an improvement for Bound 3 in the semi-supervised framework. This also opens up the possibility\nthat the generalization of Theorem 2 to higher moment be applicable to real data. Such generaliza-\ntions might overcome the main drawback of our approach, namely, the fact that the PAC-bounds,\nbased on Theorem 2, degrade when the expected disagreement (dQ) is close to 1/2.\nAcknowledgments: Work supported by NSERC Discovery grants 262067 and 0122405.\n\nReferences\n[1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355\u2013363, 1999.\n[2] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5\u201321, 2003.\n[3] David McAllester. Simpli\ufb01ed PAC-Bayesian margin bounds. Proceedings of the 16th Annual Conference\n\non Learning Theory, Lecture Notes in Arti\ufb01cial Intelligence, 2777:203\u2013215, 2003.\n\n[4] Franc\u00b8ois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed Gibbs classi\ufb01ers.\n\nProc. of the 22nth International Conference on Machine Learning (ICML 2005), pages 481\u2013488, 2005.\n\n[5] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer,\neditors, Advances in Neural Information Processing Systems 15, pages 423\u2013430. MIT Press, Cambridge,\nMA, 2003.\n\n[6] Luc Devroye, L\u00b4aszl\u00b4o Gy\u00a8or\ufb01, and G\u00b4abor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer\n\nVerlag, New York, NY, 1996.\n\n[7] Leo Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n[8] Dimitris Bertsimas and Ioana Popescu. Optimal inequalities in probability theory: A convex optimization\n\napproach. SIAM J. on Optimization, 15(3):780\u2013804, 2005.\n\n[9] Robert E. Schapire and Yoram Singer. Improved boosting using con\ufb01dence-rated predictions. Machine\n\nLearning, 37(3):297\u2013336, 1999.\n\n\f", "award": [], "sourceid": 2959, "authors": [{"given_name": "Alexandre", "family_name": "Lacasse", "institution": null}, {"given_name": "Fran\u00e7ois", "family_name": "Laviolette", "institution": null}, {"given_name": "Mario", "family_name": "Marchand", "institution": null}, {"given_name": "Pascal", "family_name": "Germain", "institution": null}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": null}]}