{"title": "Consistency of weighted majority votes", "book": "Advances in Neural Information Processing Systems", "page_first": 3446, "page_last": 3454, "abstract": "We revisit from a statistical learning perspective the classical decision-theoretic problem of weighted expert voting. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Nitzan-Paroush weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. The bounds we derive are nearly optimal, and several challenging open problems are posed. Experimental results are provided to illustrate the theory.", "full_text": "Consistency of weighted majority votes\n\nDaniel Berend Computer Science Department and Mathematics Department\n\nBen Gurion University\n\nBeer Sheva, Israel berend@cs.bgu.ac.il\n\nAryeh Kontorovich\n\nComputer Science Department Ben Gurion University\n\nBeer Sheva, Israel karyeh@cs.bgu.ac.il\n\nAbstract\n\nWe revisit from a statistical learning perspective the classical decision-theoretic\nproblem of weighted expert voting.\nIn particular, we examine the consistency\n(both asymptotic and \ufb01nitary) of the optimal Nitzan-Paroush weighted majority\nand related rules. In the case of known expert competence levels, we give sharp\nerror estimates for the optimal rule. When the competence levels are unknown,\nthey must be empirically estimated. We provide frequentist and Bayesian analyses\nfor this situation. Some of our proof techniques are non-standard and may be\nof independent interest. The bounds we derive are nearly optimal, and several\nchallenging open problems are posed.\n\n1\n\nIntroduction\n\nImagine independently consulting a small set of medical experts for the purpose of reaching a binary\ndecision (e.g., whether to perform some operation). Each doctor has some \u201creputation\u201d, which can\nbe modeled as his probability of giving the right advice. The problem of weighting the input of\nseveral experts arises in many situations and is of considerable theoretical and practical importance.\nThe rigorous study of majority vote has its roots in the work of Condorcet [1]. By the 70s, the \ufb01eld\nof decision theory was actively exploring various voting rules (see [2] and the references therein).\nA typical setting is as follows. An agent is tasked with predicting some random variable Y \u2208 {\u00b11}\nbased on input Xi \u2208 {\u00b11} from each of n experts. Each expert Xi has a competence level pi \u2208\n(0, 1), which is the probability of making a correct prediction: P(Xi = Y ) = pi. Two simplifying\nassumptions are commonly made:\n\n(i) Independence: The random variables {Xi : i \u2208 [n]} are mutually independent conditioned\n(ii) Unbiased truth: P(Y = +1) = P(Y = \u22121) = 1/2.\n\non the truth Y .\n\nWe will discuss these assumptions below in greater detail; for now, let us just take them as given.\n(Since the bias of Y can be easily estimated from data, only the independence assumption is truly\nrestrictive.) A decision rule is a mapping f : {\u00b11}n \u2192 {\u00b11} from the n expert inputs to the agent\u2019s\n\ufb01nal decision. Our quantity of interest throughout the paper will be the agent\u2019s probability of error,\n(1)\nA decision rule f is optimal if it minimizes the quantity in (1) over all possible decision rules. It\nwas shown in [2] that, when Assumptions (i)\u2013(ii) hold and the true competences pi are known, the\noptimal decision rule is obtained by an appropriately weighted majority vote:\n\nP(f(X) (cid:54)= Y ).\n\nf OPT(x) = sign\n\nwixi\n\n,\n\n(2)\n\n(cid:33)\n\n(cid:32) n(cid:88)\n\ni=1\n\n1\n\n\fwhere the weights wi are given by\n\nwi = log\n\npi\n1 \u2212 pi\n\n,\n\ni \u2208 [n].\n\n(3)\n\nThus, wi is the log-odds of expert i being correct \u2014 and the voting rule in (2), also known as naive\nBayes [3], may be seen as a simple consequence of the Neyman-Pearson lemma [4].\n\nMain results. The formula in (2) raises immediate questions, which apparently have not previ-\nously been addressed. The \ufb01rst one has to do with the consistency of the Nitzan-Paroush optimal\nrule: under what conditions does the probability of error decay to zero and at what rate? In Section 3,\nwe show that the probability of error is controlled by the committee potential \u03a6, de\ufb01ned by\n\nn(cid:88)\n\nn(cid:88)\n\n(pi \u2212 1\n\n2) log\n\npi\n1 \u2212 pi\n\n.\n\n(4)\n\n\u03a6 =\n\n(pi \u2212 1\n\n2)wi =\n\ni=1\n\ni=1\n\nMore precisely, we prove in Theorem 1 that log P(f OPT(X) (cid:54)= Y ) (cid:16) \u2212\u03a6, where (cid:16) denotes equiva-\nlence up to universal multiplicative constants.\nAnother issue not addressed by the Nitzan-Paroush result is how to handle the case where the com-\npetences pi are not known exactly but rather estimated empirically by \u02c6pi. We present two solutions\nto this problem: a frequentist and a Bayesian one. As we show in Section 4, the frequentist approach\ndoes not admit an optimal empirical decision rule. Instead, we analyze empirical decision rules in\nvarious settings: high-con\ufb01dence (i.e., |\u02c6pi \u2212 pi| (cid:28) 1) vs. low-con\ufb01dence, adaptive vs. nonadaptive.\nThe low-con\ufb01dence regime requires no additional assumptions, but gives weaker guarantees (Theo-\nrem 5). In the high-con\ufb01dence regime, the adaptive approach produces error estimates in terms of\nthe empirical \u02c6pis (Theorem 7), while the nonadaptive approach yields a bound in terms of the un-\nknown pis, which still leads to useful asymptotics (Theorem 6). The Bayesian solution sidesteps the\nvarious cases above, as it admits a simple, provably optimal empirical decision rule (Section 5). Un-\nfortunately, we are unable to compute (or even nontrivially estimate) the probability of error induced\nby this rule; this is posed as a challenging open problem.\n\n2 Related work\n\nMachine learning theory typically clusters weighted majority [5, 6] within the framework of online\nalgorithms; see [7] for a modern treatment. Since the online setting is considerably more adversarial\nthan ours, we obtain very different weighted majority rules and consistency guarantees. The weights\nwi in (2) bear a striking similarity to the Adaboost update rule [8, 9]. However, the latter assumes\nweak learners with access to labeled examples, while in our setting the experts are \u201cstatic\u201d. Still, we\ndo not rule out a possible deeper connection between the Nitzan-Paroush decision rule and boosting.\nIn what began as the in\ufb02uential Dawid-Skene model [10] and is now known as crowdsourcing, one\nattempts to extract accurate predictions by pooling a large number of experts, typically without the\nbene\ufb01t of being able to test any given expert\u2019s competence level. Still, under mild assumptions it\nis possible to ef\ufb01ciently recover the expert competences to a high accuracy and to aggregate them\neffectively [11]. Error bounds for the oracle MAP rule were obtained in this model by [12] and\nminimax rates were given in [13].\nIn a recent line of work [14, 15, 16] have developed a PAC-Bayesian theory for the majority vote\nof simple classi\ufb01ers. This approach facilitates data-dependent bounds and is even \ufb02exible enough\nto capture some simple dependencies among the classi\ufb01ers \u2014 though, again, the latter are learners\nas opposed to our experts. Even more recently, experts with adversarial noise have been consid-\nered [17], and ef\ufb01cient algorithms for computing optimal expert weights (without error analysis)\nwere given [18]. More directly related to the present work are the papers of [19], which character-\nizes the consistency of the simple majority rule, and [20, 21, 22] which analyze various models of\ndependence among the experts.\n\n2\n\n\f3 Known competences\n\nIn this section we assume that the expert competences pi are known and analyze the consistency of\nthe Nitzan-Paroush optimal decision rule (2). Our main result here is that the probability of error\nP(f OPT(X) (cid:54)= Y ) is small if and only if the committee potential \u03a6 is large.\nTheorem 1. Suppose that the experts X = (X1, . . . , Xn) satisfy Assumptions (i)-(ii) and\nf OPT : {\u00b11}n \u2192 {\u00b11} is the Nitzan-Paroush optimal decision rule. Then\n\n(i) P(f OPT(X) (cid:54)= Y ) \u2264 exp(cid:0)\u2212 1\n2\u03a6(cid:1).\n\n(ii) P(f OPT(X) (cid:54)= Y ) \u2265\n\n3\n\n\u221a\n8[1 + exp(2\u03a6 + 4\n\n.\n\n\u03a6)]\n\nAs we show in the full paper [27], the upper and lower bounds are both asymptotically tight. The\nremainder of this section is devoted to proving Theorem 1.\n\n3.1 Proof of Theorem 1(i)\nDe\ufb01ne the {0, 1}-indicator variables\n\n(5)\ncorresponding to the event that the ith expert is correct. A mistake f OPT(X) (cid:54)= Y occurs precisely\nwhen1 the sum of the correct experts\u2019 weights fails to exceed half the total mass:\n\n\u03bei = 1{Xi=Y },\n\nSince E\u03bei = pi, we may rewrite the probability in (6) as\n\nP(f OPT(X) (cid:54)= Y ) = P\n\n(cid:32)(cid:88)\n\nP\n\nwi\u03bei \u2264 E\n\ni\n\ni\n\n(cid:33)\n\nwi\n\n.\n\nn(cid:88)\n\ni=1\n\n(pi \u2212 1\n\n2)wi\n\n(cid:32) n(cid:88)\n(cid:34)(cid:88)\n\ni=1\n\nwi\u03bei \u2264 1\n2\n(cid:35)\n\nwi\u03bei\n\n\u2212(cid:88)\n(cid:32)\n\u22122(cid:2)(cid:80)\n\ni\n\n(cid:3)2\n\n(cid:33)\n\n(cid:80)\ni(pi \u2212 1\ni w2\ni\n\n2)wi\n\n(cid:33)\n\n.\n\n(6)\n\n(7)\n\nA standard tool for estimating such sum deviation probabilities is Hoeffding\u2019s inequality. Applied\nto (7), it yields the bound\n\nP(f OPT(X) (cid:54)= Y ) \u2264 exp\n\n,\n\n(8)\n\nwhich is far too crude for our purposes. Indeed, consider a \ufb01nite committee of highly competent\nexperts with pi\u2019s arbitrarily close to 1 and X1 the most competent of all. Raising X1\u2019s competence\nsuf\ufb01ciently far above his peers will cause both the numerator and the denominator in the exponent\n1, making the right-hand-side of (8) bounded away from zero. The inability of\nto be dominated by w2\nHoeffding\u2019s inequality to guarantee consistency even in such a felicitous setting is an instance of its\ngenerally poor applicability to highly heterogeneous sums, a phenomenon explored in some depth in\n[23]. Bernstein\u2019s and Bennett\u2019s inequalities suffer from a similar weakness (see ibid.). Fortunately,\nan inequality of Kearns and Saul [24] is suf\ufb01ciently sharp to yield the desired estimate: For all\np \u2208 [0, 1] and all t \u2208 R,\n\n(1 \u2212 p)e\u2212tp + pet(1\u2212p) \u2264 exp\n\n.\n\n(9)\n\n(cid:18)\n\n(cid:19)\n\n1 \u2212 2p\n\n4 log((1 \u2212 p)/p) t2\n\nRemark. The Kearns-Saul inequality (9) may be seen as a distribution-dependent re\ufb01nement of\nHoeffding\u2019s (which bounds the left-hand-side of (9) by et2/8), and is not nearly as straightforward\nto prove. An elementary rigorous proof is given in [25]. Following up, [26] gave a \u201csoft\u201d proof\nbased on transportation and information-theoretic techniques.\n\n1 Without loss of generality, ties are considered to be errors.\n\n3\n\n\f(cid:80)n\n\n(12)\ncorresponding to the event that the ith expert is correct and put qi = 1 \u2212 pi. The shorthand w \u00b7 \u03b7 =\ni=1 wi\u03b7i will be convenient. We will need some simple lemmata, whose proofs are deferred to\n\n\u03b7i = 2 \u00b7 1{Xi=Y } \u2212 1,\n\nthe journal version [27].\nLemma 2.\n\nP(f OPT(X) = Y ) = 1\n\n2\n\nmax{P (\u03b7), P (\u2212\u03b7)}\n\nand\n\n2\n\nP(f OPT(X) (cid:54)= Y ) = 1\n\n(cid:81)\n\nwhere P (\u03b7) =(cid:81)\nLemma 3. Suppose that s, s(cid:48) \u2208 (0,\u221e)m satisfy (cid:80)m\ni \u2208 [m], for some R < \u221e. Then (cid:80)m\n\ni=1 min{si, s(cid:48)\nLemma 4. De\ufb01ne the function F : (0, 1) \u2192 R by\n\ni:\u03b7i=\u22121 qi.\n\ni:\u03b7i=1 pi\n\n\u03b7\u2208{\u00b11}n\n\nmin{P (\u03b7), P (\u2212\u03b7)} ,\n\ni=1(si + s(cid:48)\ni} \u2265 a/(1 + R).\n\ni) \u2265 a and R\u22121 \u2264 si/s(cid:48)\n\ni \u2264 R,\n\n(cid:88)\n(cid:88)\n\n\u03b7\u2208{\u00b11}n\n\nPut \u03b8i = \u03bei \u2212 pi, substitute into (6), and apply Markov\u2019s inequality:\n\u2264 e\u2212t\u03a6Eexp\n\nP(f OPT(X) (cid:54)= Y ) = P\n\nwi\u03b8i \u2265 \u03a6\n\n(cid:32)\n\u2212(cid:88)\n\n(cid:33)\n\n(cid:32)\n\n\u2212t\n\ni\n\n(cid:33)\n\nwi\u03b8i\n\n.\n\n(10)\n\n(cid:88)\n\ni\n\nNow\n\n\u2264 exp\n\n\u22121 + 2pi\n\nEe\u2212twi\u03b8i = pie\u2212(1\u2212pi)wit + (1 \u2212 pi)epiwit\n4 log(pi/(1 \u2212 pi)) w2\ni t2\n(cid:32)\nwhere the inequality follows from (9). By independence,\n\n(cid:18)\n= (cid:89)\nand hence P(f OPT(X) (cid:54)= Y ) \u2264 exp(cid:0) 1\n\nEe\u2212twi\u03b8i \u2264 exp\n\n(cid:88)\n\nE exp\n\n(cid:32)\n\n(cid:33)\n\n\u2212t\n\nwi\u03b8i\n\ni\n\ni\n\n(cid:19)\n= exp(cid:2) 1\n(cid:88)\n\n(pi \u2212 1\n\n2)wit2(cid:3) ,\n2(pi \u2212 1\n(cid:33)\n= exp(cid:0) 1\n2\u03a6t2(cid:1)\n\n(11)\n\n2\u03a6t2 \u2212 \u03a6t(cid:1) . Choosing t = 1 yields the bound in Theorem 1(i).\n\n2)wit2\n\n1\n2\n\ni\n\n3.2 Proof of Theorem 1(ii)\nDe\ufb01ne the {\u00b11}-indicator variables\n\nThen sup0<x<1 F (x) = 1\n2 .\nContinuing with the main proof, observe that\n\nF (x) = x(1 \u2212 x) log(x/(1 \u2212 x))\n\n.\n\n2x \u2212 1\n\nE [w \u00b7 \u03b7] =\n\n(pi \u2212 qi)wi = 2\u03a6\n\nn(cid:88)\n\ni=1\n\nand Var [w \u00b7 \u03b7] = 4(cid:80)n\n\nDe\ufb01ne the segment I \u2282 R by\n\n(cid:104)\n\nI =\n\n\u221a\n2\u03a6 \u2212 4\n\ni=1 piqiw2\n\ni . By Lemma 4, piqiw2\n\ni \u2264 1\n2(pi \u2212 qi)wi, and hence\nVar [w \u00b7 \u03b7] \u2264 4\u03a6.\n(cid:105)\n\n\u221a\n\n\u03a6, 2\u03a6 + 4\n\n\u03a6\n\n.\n\nChebyshev\u2019s inequality together with (13) and (14) implies that\n\nP (w \u00b7 \u03b7 \u2208 I) \u2265 3\n4 .\n\n4\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n\fConsider an atom \u03b7 \u2208 {\u00b11}n for which w \u00b7 \u03b7 \u2208 I. The proof of Lemma 2 shows that\n\nP (\u03b7)\nP (\u2212\u03b7)\n\n= exp (w \u00b7 \u03b7) \u2264 exp(2\u03a6 + 4\n\n\u221a\n\n\u03a6),\n\n(17)\n\nwhere the inequality follows from (15). Lemma 2 further implies that\n\nP(f OPT(X) (cid:54)= Y ) \u2265 1\n\n2\n\nmin{P (\u03b7), P (\u2212\u03b7)} \u2265\n\n3/4\n\n\u221a\n1 + exp(2\u03a6 + 4\n\n,\n\n\u03a6)\n\n\u03b7\u2208{\u00b11}n:w\u00b7\u03b7\u2208I\n\n(cid:88)\n\nwhere the second inequality follows from Lemma 3, (16) and (17). This completes the proof.\n\n4 Unknown competences: frequentist\n\nOur goal in this section is to obtain, insofar as possible, analogues of Theorem 1 for unknown expert\ncompetences. When the pis are unknown, they must be estimated empirically before any useful\nweighted majority vote can be applied. There are various ways to model partial knowledge of expert\ncompetences [28, 29]. Perhaps the simplest scenario for estimating the pis is to assume that the\nith expert has been queried independently mi times, out of which he gave the correct prediction ki\ntimes. Taking the {mi} to be \ufb01xed, de\ufb01ne the committee pro\ufb01le by k = (k1, . . . , kn); this is the\naggregate of the agent\u2019s empirical knowledge of the experts\u2019 performance. An empirical decision\nrule \u02c6f : (x, k) (cid:55)\u2192 {\u00b11} makes a \ufb01nal decision based on the expert inputs x together with the\ncommittee pro\ufb01le. Analogously to (1), the probability of a mistake is\n\nP( \u02c6f(X, K) (cid:54)= Y ).\n\n(18)\n\nNote that now the committee pro\ufb01le is an additional source of randomness. Here we run into our \ufb01rst\ndif\ufb01culty: unlike the probability in (1), which is minimized by the Nitzan-Paroush rule, the agent\ncannot formulate an optimal decision rule \u02c6f in advance without knowing the pis. This is because no\ndecision rule is optimal uniformly over the range of possible pis. Our approach will be to consider\nweighted majority decision rules of the form\n\n\u02c6f(x, k) = sign\n\n\u02c6w(ki)xi\n\n(19)\n\nand to analyze their consistency properties under two different regimes: low-con\ufb01dence and high-\ncon\ufb01dence. These refer to the con\ufb01dence intervals of the frequentist estimate of pi, given by\n\ni=1\n\n\u02c6pi = ki\nmi\n\n.\n\n(20)\n\n4.1 Low-con\ufb01dence regime\n\nIn the low-con\ufb01dence regime, the sample sizes mi may be as small as 1, and we de\ufb01ne2\n\nwhich induces the empirical decision rule \u02c6f LC.\nRecall the de\ufb01nition of \u03bei from (5) and observe that\n\n\u02c6w(ki) = \u02c6wLC\ni\n\ni \u2208 [n],\n\n:= \u02c6pi \u2212 1\n(21)\n2 ,\nIt remains to analyze \u02c6f LC\u2019s probability of error.\n\n(cid:32) n(cid:88)\n\n(cid:33)\n\nsince \u02c6pi and \u03bei are independent. As in (6), the probability of error (18) is\n\nE(cid:2) \u02c6wLC\n(cid:32) n(cid:88)\n\ni \u03bei\n\n(cid:3) = E[(\u02c6pi \u2212 1\n2)\u03bei] = (pi \u2212 1\n(cid:33)\nn(cid:88)\n\n(cid:32) n(cid:88)\n\n\u02c6wLC\ni\n\n= P\n\nP\n\ni \u03bei \u2264 1\n\u02c6wLC\n2\n\ni=1\n\ni=1\n\ni=1\n\n2)pi,\n\n(cid:33)\n\nZi \u2264 0\n\n,\n\n(22)\n\n(23)\n\n2 For mi min{pi, qi} (cid:28) 1, the estimated competences \u02c6pi may well take values in {0, 1}, in which case\n\nlog(\u02c6pi/\u02c6qi) = \u00b1\u221e. The rule in (21) is essentially a \ufb01rst-order Taylor approximation to w(\u00b7) about p = 1\n2 .\n\n5\n\n\fi (\u03bei \u2212 1\n\nwhere Zi = \u02c6wLC\n(22)), and each Zi takes values in an interval of length 1\napplies:\n\n2). Now the {Zi} are independent random variables, EZi = (pi \u2212 1\n\n2)2 (by\n2. Hence, the standard Hoeffding bound\n\nP( \u02c6f LC(X, K) (cid:54)= Y ) \u2264 exp\n\n(pi \u2212 1\n2)2\n\n(24)\n\n\uf8ee\uf8f0\u2212 8\n\nn\n\n(cid:32) n(cid:88)\n\ni=1\n\n(cid:33)2\uf8f9\uf8fb .\n(cid:80)n\ni=1(pi \u2212 1\n\n1\u221a\nn\n\n2)2 \u2192 \u221e.\n\nWe summarize these calculations in\nTheorem 5. A suf\ufb01cient condition for P( \u02c6f LC(X, K) (cid:54)= Y ) \u2192 0 is\n\nSeveral remarks are in order. First, notice that the error bound in (24) is stated in terms of the un-\nknown {pi}, providing the agent with large-committee asymptotics but giving no \ufb01nitary informa-\ntion; this limitation is inherent in the low-con\ufb01dence regime. Secondly, the condition in Theorem 5\nis considerably more restrictive than the consistency condition \u03a6 \u2192 \u221e implicit in Theorem 1. In-\ndeed, the empirical decision rule \u02c6f LC is incapable of exploiting a single highly competent expert in\nthe way that f OPT from (2) does. Our analysis could be sharpened somewhat for moderate sample\nsizes {mi} by using Bernstein\u2019s inequality to take advantage of the low variance of the \u02c6pis. For\nsuf\ufb01ciently large sample sizes, however, the high-con\ufb01dence regime (discussed below) begins to\ntake over. Finally, there is one sense in which this case is \u201ceasier\u201d to analyze than that of known\n{pi}: since the summands in (23) are bounded, Hoeffding\u2019s inequality gives nontrivial results and\nthere is no need for more advanced tools such as the Kearns-Saul inequality (9) (which is actually\ninapplicable in this case).\n\n4.2 High-con\ufb01dence regime\n\nIn the high-con\ufb01dence regime, each estimated competence \u02c6pi is close to the true value pi with high\nprobability. To formalize this, \ufb01x some 0 < \u03b4 < 1, 0 < \u03b5 \u2264 5, and put qi = 1 \u2212 pi, \u02c6qi = 1 \u2212 \u02c6pi.\nWe will set the empirical weights according to the \u201cplug-in\u201d Nitzan-Paroush rule\n\n\u02c6wHC\ni\n\n:= log\n\n\u02c6pi\n\u02c6qi\n\n,\n\ni \u2208 [n],\n\n(25)\n\ni = \u00b1\u221e. We\nwhich induces the empirical decision rule \u02c6f HC and raises immediate concerns about \u02c6wHC\ngive two kinds of bounds on P( \u02c6f HC (cid:54)= Y ): nonadaptive and adaptive. In the nonadaptive analysis, we\n| (cid:28) 1, and thus a \u201cperturbed\u201d\nshow that for mi min{pi, qi} (cid:29) 1, with high probability |wi \u2212 \u02c6wHC\ni will be \ufb01nite with high probability). In the\nversion of Theorem 1(i) holds (and in particular, wHC\nto take on in\ufb01nite values3 and show (perhaps surprisingly) that this\nadaptive analysis, we allow \u02c6wHC\ni\ndecision rule also asymptotically achieves the rate of Theorem 1(i).\n\ni\n\nNonadaptive analysis. The following result captures our analysis of the nonadaptive agent:\nTheorem 6. Let 0 < \u03b4 < 1 and 0 < \u03b5 < min{5, 2\u03a6/n}. If\n\nmi min{pi, qi} \u2265 3\n\n4\u03b5 + 1 \u2212 1\n\n(cid:18)\u221a\n(cid:19)\u22122\n(cid:20)\n(cid:17) \u2264 \u03b4 + exp\nP(cid:16) \u02c6f HC(X, K) (cid:54)= Y\n\u2212(2\u03a6 \u2212 \u03b5n)2\n\n4n\n\u03b4\n\nlog\n\n4\n\n,\n\n(cid:21)\n\n.\n\ni \u2208 [n],\n\n8\u03a6\n\nthen\n\n(26)\n\n(27)\n\nRemark. For \ufb01xed {pi} and mini\u2208[n] mi \u2192 \u221e, we may take \u03b4 and \u03b5 arbitrarily small \u2014 and in\nthis limiting case, the bound of Theorem 1(i) is recovered.\n\n3 When the decision rule is faced with evaluating sums involving \u221e \u2212 \u221e, we automatically count this as\n\nan error.\n\n6\n\n\fAdaptive analysis. Theorem 6 has the drawback of being nonadaptive, in that its assumptions\n(26) and conclusions (27) depend on the unknown {pi} and hence cannot be evaluated by the agent\n(the bound in (24) is also nonadaptive4). In the adaptive (fully empirical) approach, all results are\nstated in terms of empirically observed quantities:\n(cid:80)n\nTheorem 7. Choose\ni=1(\u02c6pi \u2212 1\n\n\u2265 (cid:80)n\nR \u2229(cid:110) \u02c6f HC(X, K) (cid:54)= Y\n2 . Then P(cid:16)\n(cid:1) \u2264 \u03b4\n\n(cid:111)(cid:17) \u2264 \u03b4.\n\nexp(cid:0)\u2212 1\n\nevent where\n\nlet R be\n\n2) \u02c6wHC\n\n1\u221a\nmi\n\nany5\n\nand\n\nthe\n\ni=1\n\n\u03b4\n\n2\n\ni\n\nObserve that the event R will only occur if the empirical committee potential \u02c6\u03a6 =(cid:80)n\n\nRemark 1. Our interpretation for Theorem 7 is as follows. The agent observes the committee pro\ufb01le\nK, which determines the {\u02c6pi, \u02c6wHC\ni }, and then checks whether the event R has occurred. If not, the\nadaptive agent refrains from making a decision (and may choose to fall back on the low-con\ufb01dence\napproach described previously). If R does hold, however, the agent predicts Y according to \u02c6f HC.\ni=1(\u02c6pi\u2212 1\n2) \u02c6wHC\n2 . But if\nis suf\ufb01ciently large \u2014 i.e., if enough of the experts\u2019 competences are suf\ufb01ciently far from 1\nthis is not the case, little is lost by refraining from a high-con\ufb01dence decision and defaulting to a\nlow-con\ufb01dence one, since near 1\nAs explained above, there does not exist a nontrivial a priori upper bound on P( \u02c6f HC(X, K) (cid:54)= Y )\nabsent any knowledge of the pis. Instead, Theorem 7 bounds the probability of the agent being\n\u201cfooled\u201d by an unrepresentative committee pro\ufb01le.6 Note that we have done nothing to prevent\ni = \u00b1\u221e, and this may indeed happen. Intuitively, there are two reasons for in\ufb01nite \u02c6wHC\n\u02c6wHC\ni : (a)\n\u221a\nnoisy \u02c6pi due to mi being too small, or (b) the ith expert is actually highly (in)competent, which\ncauses \u02c6pi \u2208 {0, 1} to be likely even for large mi. The 1/\nmi term in the bound insures against\ncase (a), while in case (b), choosing in\ufb01nite \u02c6wHC\n\n2 , the two decision procedures are very similar.\n\ni causes no harm (as we show in the proof).\n\ni\n\n(cid:111)(cid:17)\n\nPK,X,Y\n\n= PK,\u03b7\n= EK\n\nRecall that the random variable \u03b7 \u2208 {\u00b11}n, with probability mass function\n\nProof of Theorem 7. We will write the probability and expectation operators with subscripts (such\nas K) to indicate the random variable(s) being summed over. Thus,\n\nR \u2229(cid:110) \u02c6f HC(X, K) (cid:54)= Y\n(cid:16)\nP (\u03b7) =(cid:81)\n(cid:81)\n(cid:0) \u02c6wHC \u00b7 \u03b7 \u2264 0| K(cid:1) = P\u03b7\ni:\u03b7i=\u22121 \u02c6qi, and the set A \u2286 {\u00b11}n by A =(cid:8)x : \u02c6wHC \u00b7 x \u2264 0(cid:9) . Now\nP (\u02c6\u03b7) =(cid:81)\n(cid:81)\n(cid:0) \u02c6wHC \u00b7 \u02c6\u03b7 \u2264 0(cid:1)(cid:12)(cid:12) = |P\u03b7 (A) \u2212 P\u02c6\u03b7 (A)| \u2264 max\n(cid:12)(cid:12)P\u03b7\n(cid:0) \u02c6wHC \u00b7 \u03b7 \u2264 0(cid:1) \u2212 P\u02c6\u03b7\n\n(cid:0)R \u2229(cid:8) \u02c6wHC \u00b7 \u03b7 \u2264 0(cid:9)(cid:1)\n(cid:2)1R \u00b7 P\u03b7\n(cid:0) \u02c6wHC \u00b7 \u03b7 \u2264 0(cid:1) .\n\n(28)\nDe\ufb01ne the random variable \u02c6\u03b7 \u2208 {\u00b11}n (conditioned on K) by the probability mass function\n\n(cid:0) \u02c6wHC \u00b7 \u03b7 \u2264 0| K(cid:1)(cid:3) .\n\ni:\u03b7i=\u22121 qi, is independent of K, and hence\n\n|P\u03b7 (A) \u2212 P\u02c6\u03b7 (A)|\n\ni:\u03b7i=1 \u02c6pi\n\ni:\u03b7i=1 pi\n\nP\u03b7\n\nexp(cid:0)\u2212 1\n\n[33, Lemma 2.2]. By Theorem 1(i), we have P\u02c6\u03b7\n\n(cid:80)n\nwhere the last inequality follows from a standard tensorization property of the total variation\nnorm (cid:107)\u00b7(cid:107)TV, see e.g.\ni=1(\u02c6pi \u2212 1\nR \u2229(cid:110) \u02c6f HC(X, K) (cid:54)= Y\n(cid:16)\n\nInvoking (28), we substitute the right-hand side above into (28) to obtain\n\u2212 1\n\n(cid:1) , and hence P\u03b7\n(cid:111)(cid:17) \u2264 EK\n\n(\u02c6pi \u2212 1\n\nPK,X,Y\n\n2) \u02c6wHC\n\n2) \u02c6wHC\n\n2\n\ni\n\nA\u2286{\u00b11}n\n|pi \u2212 \u02c6pi| =: M,\n\ni=1\n\n= (cid:107)P\u03b7 \u2212 P\u02c6\u03b7(cid:107)TV \u2264 n(cid:88)\n(cid:0) \u02c6wHC \u00b7 \u03b7 \u2264 0(cid:1) \u2264 M + exp(cid:0)\u2212 1\n(cid:34)\nn(cid:88)\nn(cid:88)\n\n(cid:32)\n(cid:32)\n\nM + exp\n\n1R \u00b7\n\n(cid:32)\n\n(cid:34)\n\ni=1\n\n2\n\n2\n\n\u2264 EK[M] + EK\n\n1R exp\n\n\u2212 1\n\n2\n\ni=1\n\n(cid:0) \u02c6wHC \u00b7 \u02c6\u03b7 \u2264 0(cid:1) \u2264\n(cid:1) .\n(cid:80)n\ni=1(\u02c6pi \u2212 1\n(cid:33)(cid:33)(cid:35)\n(cid:33)(cid:35)\n\n2) \u02c6wHC\n\ni\n\ni\n\n(\u02c6pi \u2212 1\n\n2) \u02c6wHC\n\ni\n\n.\n\n4The term oracle was suggested by a referee for this setting.\n5 Actually, as the proof will show, we may take \u03b4 to be a smaller value, but with a more complex dependence\n\n\u221a\non {mi}, which simpli\ufb01es to 2[1 \u2212 (1 \u2212 (2\n\nm)\u22121)n] for mi \u2261 m.\n\n6These adaptive bounds are similar in spirit to empirical Bernstein methods, [30, 31, 32], where the agent\u2019s\n\ncon\ufb01dence depends on the empirical variance.\n\n7\n\n\fBy the de\ufb01nition of R, the second term on the last right-hand side is upper-bounded by \u03b4/2. To\nestimate M, we invoke a simple mean absolute deviation bound (cf. [34]):\n\n(cid:115)\n\nEK |pi \u2212 \u02c6pi| \u2264\n\npi(1 \u2212 pi)\n\nmi\n\n\u2264\n\n1\n\u221a\n2\nmi\n\n,\n\nwhich \ufb01nishes the proof.\nRemark. The improvement mentioned in Footnote 5 is achieved via a re\ufb01nement of the bound\ni=1 |pi \u2212 \u02c6pi| to (cid:107)P\u03b7 \u2212 P\u02c6\u03b7(cid:107)TV \u2264 \u03b1 ({|pi \u2212 \u02c6pi| : i \u2208 [n]}), where \u03b1(\u00b7) is the func-\n\n(cid:107)P\u03b7 \u2212 P\u02c6\u03b7(cid:107)TV \u2264(cid:80)n\n\ntion de\ufb01ned in [33, Lemma 4.2].\nOpen problem. As argued in Remark 1, Theorem 7 achieves the optimal asymptotic rate in {pi}.\nCan the dependence on {mi} be improved, perhaps through a better choice of \u02c6wHC?\n\n5 Unknown competences: Bayesian\n\ni\n\nq\u03b2i\u22121\ni\nB(\u03b1i,\u03b2i)\n\nA shortcoming of Theorem 7 is that when condition R fails, the agent is left with no estimate of the\nerror probability. An alternative (and in some sense cleaner) approach to handling unknown expert\ncompetences pi is to assume a known prior distribution over the competence levels pi. The natural\nchoice of prior for a Bernoulli parameter is the Beta distribution, namely pi \u223c Beta(\u03b1i, \u03b2i) with\ndensity p\u03b1i\u22121\n, where \u03b1i, \u03b2i > 0, qi = 1 \u2212 pi and B(x, y) = \u0393(x)\u0393(y)/\u0393(x + y). Our full\nprobabilistic model is as follows. Each of the n expert competences pi is drawn independently from\na Beta distribution with known parameters \u03b1i, \u03b2i. Then the ith expert, i \u2208 [n], is queried indepen-\ndently mi times, with ki correct predictions and mi\u2212ki incorrect ones. As before, K = (k1, . . . , kn)\nis the (random) committee pro\ufb01le. Absent direct knowledge of the pis, the agent relies on an empiri-\ncal decision rule \u02c6f : (x, k) (cid:55)\u2192 {\u00b11} to produce a \ufb01nal decision based on the expert inputs x together\nwith the committee pro\ufb01le k. A decision rule \u02c6f Ba is Bayes-optimal if it minimizes P( \u02c6f(X, K) (cid:54)= Y ),\nwhich is formally identical to (18) but semantically there is a difference: the former is over the pi\nin addition to (X, Y, K). Unlike the frequentist approach, where no optimal empirical decision rule\n\nwas possible, the Bayesian approach readily admits one: \u02c6f Ba(x, k) = sign ((cid:80)n\n\ni=1 \u02c6wBa\n\ni xi), where\n\n\u02c6wBa\n\ni = log\n\n\u03b1i + ki\n\n.\n\n\u03b2i + mi \u2212 ki\ni \u2212\u2192\nNotice that for 0 < pi < 1, we have \u02c6wBa\nmi\u2192\u221ewi, almost surely, both in the frequentist and\nthe Bayesian interpretations. Unfortunately, although P( \u02c6f Ba(X, K) (cid:54)= Y ) = P( \u02c6wBa \u00b7 \u03b7 \u2264 0) is\na deterministic function of {\u03b1i, \u03b2i, mi}, we are unable to compute it at this point, or even give a\nnon-trivial bound. The main source of dif\ufb01culty is the coupling between \u02c6wBa and \u03b7.\nOpen problem. Give a non-trivial estimate for P( \u02c6f Ba(X, K) (cid:54)= Y ).\n\n(29)\n\n6 Discussion\n\nThe classic and seemingly well-understood problem of the consistency of weighted majority votes\ncontinues to reveal untapped depth and suggest challenging unresolved questions. We hope that the\nresults and open problems presented here will stimulate future research.\n\nReferences\n[1] J.A.N. de Caritat marquis de Condorcet. Essai sur l\u2019application de l\u2019analyse `a la probabilit\u00b4e des d\u00b4ecisions\n\nrendues `a la pluralit\u00b4e des voix. AMS Chelsea Publishing Series. Chelsea Publishing Company, 1785.\n\n[2] S. Nitzan, J. Paroush. Optimal decision rules in uncertain dichotomous choice situations. International\n\nEconomic Review, 23(2):289\u2013297, 1982.\n\n[3] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and\n\nPrediction. 2009.\n\n8\n\n\f[4] J. Neyman, E. S. Pearson. On the problem of the most ef\ufb01cient tests of statistical hypotheses. Phil. Trans.\n\nRoyal Soc. A: Math., Physi. Eng. Sci., 231(694-706):289\u2013337, 1933.\n\n[5] N. Littlestone, M. K. Warmuth. The weighted majority algorithm. In FOCS, 1989.\n[6] N. Littlestone, M. K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212\u2013261, 1994.\n[7] N. Cesa-Bianchi, G. Lugosi. Prediction, learning, and games. 2006.\n[8] Y. Freund, R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. J. Comput. Syst. Sci., 55(1):119\u2013139, 1997.\n\n[9] R. E. Schapire, Y. Freund. Boosting. Foundations and algorithms. 2012.\n[10] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM\n\nalgorithm. Applied Statistics, 28(1):20\u201328, 1979.\n\n[11] F. Parisi, F. Strino, B. Nadler, Y. Kluger. Ranking and combining multiple predictors without labeled data.\n\nProc. Nat. Acad. Sci., 2014+.\n\n[12] H. Li, B. Yu, D. Zhou. Error rate bounds in crowdsourcing models. CoRR, abs/1307.2674, 2013.\n[13] C. Gao, D. Zhou. Minimax Optimal Convergence Rates for Estimating Ground Truth from Crowdsourced\n\nLabels (arXiv:1310.5764), 2014.\n\n[14] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, N. Usunier. PAC-Bayes bounds for the risk of the\n\nmajority vote and the variance of the gibbs classi\ufb01er. In NIPS, 2006.\n\n[15] F. Laviolette, M. Marchand. PAC-Bayes risk bounds for stochastic averages and majority votes of sample-\n\ncompressed classi\ufb01ers. JMLR, 8:1461\u20131487, 2007.\n\n[16] J.-F. Roy, F. Laviolette, M. Marchand. From PAC-Bayes bounds to quadratic programs for majority votes.\n\nIn ICML, 2011.\n\n[17] Y. Mansour, A. Rubinstein, M. Tennenholtz. Robust aggregation of experts signals, preprint 2013.\n[18] E. Eban, E. Mezuman, A. Globerson. Discrete chebyshev classi\ufb01ers. In ICML (2), 2014.\n[19] D. Berend, J. Paroush. When is Condorcet\u2019s jury theorem valid? Soc. Choice Welfare, 15(4):481\u2013488,\n\n1998.\n\n[20] P. J. Boland, F. Proschan, Y. L. Tong. Modelling dependence in simple and indirect majority systems. J.\n\nAppl. Probab., 26(1):81\u201388, 1989.\n\n[21] D. Berend, L. Sapir. Monotonicity in Condorcet\u2019s jury theorem with dependent voters. Social Choice and\n\nWelfare, 28(3):507\u2013528, 2007.\n\n[22] D. P. Helmbold and P. M. Long. On the necessity of irrelevant variables. JMLR, 13:2145\u20132170, 2012.\n[23] D. A. McAllester, L. E. Ortiz. Concentration inequalities for the missing mass and for histogram rule\n\nerror. JMLR, 4:895\u2013911, 2003.\n\n[24] M. J. Kearns, L. K. Saul. Large deviation methods for approximate probabilistic inference. In UAI, 1998.\n[25] D. Berend, A. Kontorovich. On the concentration of the missing mass. Electron. Commun. Probab.,\n\n18(3), 1\u20137, 2013.\n\n[26] M. Raginsky, I. Sason. Concentration of measure inequalities in information theory, communications and\n\ncoding. Foundations and Trends in Communications and Information Theory, 10(1-2):1\u2013247, 2013.\n\n[27] D. Berend, A. Kontorovich. A \ufb01nite-sample analysis of the naive Bayes classi\ufb01er. Preprint, 2014.\n[28] E. Baharad, J. Goldberger, M. Koppel, S. Nitzan. Distilling the wisdom of crowds: weighted aggregation\n\nof decisions on multiple issues. Autonomous Agents and Multi-Agent Systems, 22(1):31\u201342, 2011.\n\n[29] E. Baharad, J. Goldberger, M. Koppel, S. Nitzan. Beyond condorcet: optimal aggregation rules using\n\nvoting records. Theory and Decision, 72(1):113\u2013130, 2012.\n\n[30] J.-Y. Audibert, R. Munos, C. Szepesv\u00b4ari. Tuning bandit algorithms in stochastic environments. In ALT,\n\n2007.\n\n[31] V. Mnih, C. Szepesv\u00b4ari, J.-Y. Audibert. Empirical Bernstein stopping. In ICML, 2008.\n[32] A. Maurer, M. Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009.\n[33] A. Kontorovich. Obtaining measure concentration from Markov contraction. Markov Proc. Rel. Fields,\n\n4:613\u2013638, 2012.\n\n[34] D. Berend, A. Kontorovich. A sharp estimate of the binomial mean absolute deviation with applications.\n\nStatistics & Probability Letters, 83(4):1254\u20131259, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1801, "authors": [{"given_name": "Daniel", "family_name": "Berend", "institution": "Ben Gurion University"}, {"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}]}