{"title": "Streaming Bayesian Inference for Crowdsourced Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 12782, "page_last": 12792, "abstract": "A key challenge in crowdsourcing is inferring the ground truth from noisy and unreliable data. To do so, existing approaches rely on collecting redundant information from the crowd, and aggregating it with some probabilistic method. However, oftentimes such methods are computationally inefficient, are restricted to some specific settings, or lack theoretical guarantees. In this paper, we revisit the problem of binary classification from crowdsourced data. Specifically we propose Streaming Bayesian Inference for Crowdsourcing (SBIC), a new algorithm that does not suffer from any of these limitations. First, SBIC has low complexity and can be used in a real-time online setting. Second, SBIC has the same accuracy as the best state-of-the-art algorithms in all settings. Third, SBIC has provable asymptotic guarantees both in the online and offline settings.", "full_text": "Streaming Bayesian Inference for Crowdsourced\n\nClassi\ufb01cation\n\nEdoardo Manino\n\nUniversity of Southampton\nE.Manino@soton.ac.uk\n\nLong Tran-Thanh\n\nUniversity of Southampton\n\nl.tran-thanh@soton.ac.uk\n\nNicholas R. Jennings\n\nImperial College, London\n\nn.jennings@imperial.ac.uk\n\nAbstract\n\nA key challenge in crowdsourcing is inferring the ground truth from noisy and\nunreliable data. To do so, existing approaches rely on collecting redundant in-\nformation from the crowd, and aggregating it with some probabilistic method.\nHowever, oftentimes such methods are computationally inef\ufb01cient, are restricted to\nsome speci\ufb01c settings, or lack theoretical guarantees. In this paper, we revisit the\nproblem of binary classi\ufb01cation from crowdsourced data. Speci\ufb01cally we propose\nStreaming Bayesian Inference for Crowdsourcing (SBIC), a new algorithm that\ndoes not suffer from any of these limitations. First, SBIC has low complexity and\ncan be used in a real-time online setting. Second, SBIC has the same accuracy\nas the best state-of-the-art algorithms in all settings. Third, SBIC has provable\nasymptotic guarantees both in the online and of\ufb02ine settings.\n\n1\n\nIntroduction\n\nCrowdsourcing works by collecting the annotations of large groups of human workers, typically\nthrough an online platform like Amazon Mechanical Turk1 or Figure Eight.2 On one hand, this\nparadigm can help process high volumes of small tasks that are currently dif\ufb01cult to automate at\nan affordable price [Snow et al., 2008]. On the other hand, the open nature of the crowdsourcing\nprocess gives no guarantees on the quality of the data we collect. Leaving aside malicious attempts\nat thwarting the result of the crowdsourcing process [Downs et al., 2010], even well-intentioned\ncrowdworkers can report incorrect answers [Ipeirotis et al., 2010].\nThus, the success of a crowdsourcing project relies on our ability to reconstruct the ground-truth\nfrom the noisy data we collect. This challenge has attracted the attention of the research community\nwhich has explored a number of algorithmic solutions. Some authors focus on probabilistic inference\non graphical methods, including the early work of Dawid and Skene [1979] on EM estimation,\nvariational inference [Welinder and Perona, 2010; Liu et al., 2012] and belief propagation [Karger et\nal., 2014]. These techniques are stable in most settings, easy to generalise to more complex models\n(e.g. [Kim and Ghahramani, 2012]), but generally require several passes over the entire dataset to\nconverge and lack theoretical guarantees. In contrast, other authors have turned to tensor factorisation\n[Dalvi et al., 2013; Zhang et al., 2016] and the method of moments [Bonald and Combes, 2017].\nThis choice yields algorithms with tractable theoretical behaviour, but the assumptions required to do\nso restrict them to a limited number of settings.\n\n1\n\n2\n\nwww.mturk.com\nwww.figure-eight.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAt the same time, there have been several calls to focus on how we sample the data from the crowd,\nrather than how we aggregate it [Welinder and Perona, 2010; Barowy et al., 2012; Simpson and\nRoberts, 2014; Manino et al., 2018]. All of these end up recommending some form of adaptive\nstrategy, which samples more data on the tasks where the crowd is disagreeing the most. Employing\none of these strategies improves the \ufb01nal accuracy of the crowdsourcing system, but requires the ability\nto work in an online setting. Thus, in order to perform crowdsourcing effectively, our algorithms\nmust be computationally ef\ufb01cient.\nIn this paper, we address these research challenges on the problem of binary classi\ufb01cation from\ncrowdsourced data, and make the following contributions to the state of the art.:\n\ndelivers more than an order of magnitude higher predictive accuracy.\n\nbased on approximate variational Bayes. This algorithm comes in two variants.\n\n\u2022 We introduce Streaming Bayesian Inference for Crowdsourcing (SBIC), a new algorithm\n\u2022 The \ufb01rst, Fast SBIC, has similar computational complexity to the quick majority rule, but\n\u2022 The second, Sorted SBIC, is more computationally intensive, but delivers state-of-the-art\n\u2022 We quantify the asymptotic performance of SBIC in both the of\ufb02ine and online setting\n\npredictive accuracy in all settings.\n\nanalytically. Our theoretical bounds closely match the empirical performance of SBIC.\n\nThe paper is structured in the following way. In Section 2 we introduce the most popular model of\ncrowdsourced classi\ufb01cation, and the existing aggregation methods. In Section 3 we present the SBIC\nalgorithm in its two variants. In Section 4 we compute its asymptotical accuracy. In Section 5 we\ncompare its performance with the state of the art on both synthetic and real-world datasets. In Section\n6 we conclude and outline possible future work.\n\n2 Preliminaries\n\nExisting works in crowdsourced classi\ufb01cation are mostly built around the celebrated Dawid-Skene\nmodel [Dawid and Skene, 1979]. In this paper we adopt its binary, or one-coin variant, which has\nreceived considerable attention from the crowdsourcing community [Liu et al., 2012; Karger et al.,\n2014; Bonald and Combes, 2017; Manino et al., 2018]. The reason for this is that it allows to study\nthe fundamental properties of the crowdsourcing process, without dealing with the peculiarities of\nmore complex scenarios. Furthermore, generalising to the multi-class case is usually straightforward\n(e.g. [Gao et al., 2016]).\n\n2.1 The one-coin Dawid-Skene model\nAccording to this model, the objective is to infer the binary ground-truth class yi = {\u00b11} of a set\ntasks M, with i \u2208 M. To do so, we can interact with the crowd of workers N, and ask them to submit\na set of labels X = {xij}, where j \u2208 N is the worker\u2019s index. We have no control on the availability\nof the workers, and we assume that we interact with them in sequential fashion. Thus, at each time\nstep t a single worker j = a(t) becomes available, gets assigned to a task i and provides the label\nxij = \u00b11 in exchange for a unitary payment. We assume that we can collect an average of R \u2264 |N|\nlabels per task, for a total budget of T = R|M| labels. With slight abuse of notation, we set xij = 0\nfor any missing task-worker pairs, so that we can treat X as a matrix when needed. On a similar note,\nwe use Mj to denote the set of tasks executed by worker j, and Ni for the set of workers on task i.\nFurthermore, we use the superscript t (e.g. X t) to denote the information visible up to time t.\nA key feature of the one-coin Dawid-Skene model is that each worker has a \ufb01xed probability\nP(xij = yi) = pj of submitting a correct label. That means that the workers behave like independent\nrandom variables (conditioned on the ground-truth yi), and their accuracy pj remains stable over time\nand across different tasks.\n\n2.2 Sampling the data\n\nWhen interacting with the crowd, we need to decide which tasks to allocate the incoming workers\nto. The sampling policy \u03c0 we use to make these allocations has a considerable impact on the \ufb01nal\n\n2\n\n\faccuracy of our predictions, as demonstrated by Manino et al. [2018]. The existing literature provides\nus with the following two main options.\n\nUniform Sampling (UNI). This policy allocates the same number of workers |Ni| \u2248 T /|M| to\neach task i (rounded to the closest integer). The existing literature does not usually specify how this\npolicy is implemented in practice (e.g [Karger et al., 2014; Manino et al., 2018]). In this paper we\nassume a round-robin implementation, where we ensure that no worker is asked to label the same\ntask twice:\n\n(cid:8)\n\n(cid:9)\n\n\u03c0uni(t) = argmin\n\ni(cid:54)\u2208M t\n\na(t)\n\n|N t\ni |\n\nwhere M t\n\na(t) is the set of tasks labelled by the currently available worker j = a(t) so far.\n\n(1)\n\n(2)\n\nUncertainty Sampling (US). A number of policies proposed in the literature are adaptive, in that\nthey base their decisions on the data collected up to time t [Welinder et al., 2010; Barowy et al., 2012;\nSimpson and Roberts, 2014]. In this paper we focus on the most common of them, which consist of\ngreedily choosing the task with the largest uncertainty at each time-step t. More formally, assume\nthat we have a way to estimate the posterior probability on the ground-truth y given the current data\nX t. Then, we can select the task to label as follows:\n\n(cid:110)\n\n(cid:0)P(yi = (cid:96)|X t)(cid:1)(cid:111)\n\n\u03c0us(t) = argmin\n\ni(cid:54)\u2208M t\n\na(t)\n\nmax\n(cid:96)\u2208{\u00b11}\n\nCompared to uniform sampling, this second policy is provably better [Manino et al., 2018]. However,\nit can only be implemented in an online setting, when we have estimates of the posterior on y at\nevery t. Producing such estimates in real time is an open challenge. Current approaches are based on\nsimple heuristics like the majority voting rule [Barowy et al., 2012].\nWe study the theoretical and empirical performance of SBIC under these two policies in Sections 4\nand 5 respectively.\n\n2.3 Aggregating the data\n\n(cid:80)\nGiven a (partial) dataset X t as input, there exist several methods in the literature to form a prediction\n\u02c6y over the ground-truth classes y of the tasks. The simplest is the aforementioned majority voting\nrule (MAJ), which forms its predictions as \u02c6yi = sign{\nj\u2208Ni xij}, where ties are broken at random.\nAlternatively, we can resort to Bayesian methods, which infer the value of the latent variables y\nand p by estimating their posterior probability P(y, p|X, \u03b8) given the observed data X and prior\n\u03b8. In this regard, Liu et al. [2012] propose an approximate variational mean-\ufb01eld algorithm (AMF)\nand show its similarity to the original expectation-maximisation (EM) algorithm of Dawid and\nSkene [1979]. Conversely, Karger et al. [2014] propose a belief-propagation algorithm (KOS) on\na spammer-hammer prior, and show its connection to matrix factorisation. Both these algorithms\nrequire several iterationd on the whole dataset X to converge to their \ufb01nal predictions. As another\noption, we can directly estimate the value of the posterior by Montecarlo Sampling (MC) [Kim and\nGhahramani, 2012], even though this is usually more expensive computationally than the former two\ntechniques.\nFinally, there have been attempts at applying the frequentist approach to crowdsourcing [Dalvi et\nal., 2013; Zhang et al., 2016; Bonald and Combes, 2017]. The resulting algorithms have tractable\ntheoretical properties, but put strong constraints on the rank and sparsity of the task-worker matrix X,\nwhich limit their range of applicability. For completeness, we include in our experiments of Section 5\nthe Triangular Estimation algorithm (TE) recently proposed in [Bonald and Combes, 2017].\n\n3 The SBIC algorithm\n\nIn this section we introduce Streaming Bayesian Inference for Crowdsourcing (SBIC) and discuss the\nideas behind it. Then, we present two variants of this method, which we call Fast SBIC and Sorted\nSBIC. These prioritise two different goals: namely, computational speed and predictive accuracy.\n\n3\n\n\fFigure 1: A graphical representation of the SBIC algorithm.\n\nThe overarching goal in Bayesian inference is estimating the posterior probability P(y, p|X t, \u03b8) on\nthe latent variables y and p given the data we observed so far X t and the prior \u03b8. With this piece of\ninformation, we can form our current predictions \u02c6yt on the task classes by looking at the marginal\nprobability over each yi as follows:\n\u02c6yt\ni = argmax\n(cid:96)\u2208{\u00b11}\n\n(cid:8)P(yi = (cid:96)|X t, \u03b8)(cid:9)\n\n(3)\n\nUnfortunately the marginal in Equation 3 is computationally intractable in general. In fact, just\nsumming P(y, p|X t, \u03b8) over all vectors y that contain a speci\ufb01c yi has exponential time complexity\nin |M|. To overcome this issue, we turn to a mean-\ufb01eld variational approximation as done before in\n[Liu et al., 2012; Kim and Ghahramani, 2012]. This allows us to factorise the posterior as follows:\n\nP(y, p|X t, \u03b8) \u2248\n\n\u00b5t\n\ni(yi)\n\n\u03bdt\nj(pj)\n\n(4)\n\nj\u2208N\ni correspond to each task i and the factors \u03bdt\n\ni\u2208M\n\nj to each worker j.\n\nwhere the factors \u00b5t\nOur work diverges from the standard variational mean \ufb01eld paradigm [Murphy, 2012] in that we use a\nnovel method to optimise the factors \u00b5t and \u03bdt. Previous work minimises the Kullback-Leibler (KL)\ndivergence between the two sides of Equation 4 by running an expensive coordinate descent algorithm\nwith multiple passes over the whole dataset X t [Liu et al., 2012; Kim and Ghahramani, 2012]. Instead,\nwe aim at achieving a similar result by taking a single optimisation step after observing each new\ndata point. This yields quicker updates to \u00b5t and \u03bdt, thus allowing us to run our algorithm online.\nMore speci\ufb01cally, the core ideas of the SBIC algorithm are the following. First, assume that the prior\non the worker accuracy is pj \u223c Beta(\u03b1, \u03b2). This assumption is standard in Bayesian statistics, since\nthe Beta distribution is the conjugate prior of a Bernoulli-distributed random variable [Murphy, 2012].\nSecond, initialise the task factors \u00b50 to their respective prior P(yi = +1) = q, that is \u00b50\ni (+1) = q\ni (\u22121) = 1 \u2212 q for all i \u2208 M.3 Then, upon observing a new label at time t, update the factor\nand \u00b50\nj corresponding to the current available worker j = a(t) only. Thanks to the properties of the KL\n\u03bdt\ndivergence, \u03bdt\n\nj is still Beta-distributed:\n\n(cid:89)\n\n(cid:89)\n\n(cid:16) (cid:88)\n\ni\u2208M t\u22121\n\nj\n\n(cid:88)\n\ni\u2208M t\u22121\n\nj\n\nj(pj)\u223cBeta\n\u03bdt\n\n\u00b5t\u22121\ni\n\n(xij)+\u03b1,\n\n\u00b5t\u22121\ni\n\n(\u2212xij)+\u03b2\n\n(cid:17)\n\nwhere M t\u22121\ncorresponding to the task we observed the new label xij on:\n\nis the set of tasks labelled by worker j up to time t \u2212 1. Next, we update the factor \u00b5i\n\nj\n\n(cid:80)\n\nwhere\n\n\u00afpt\nj =\n\n(xij) + \u03b1\n\nj\n\ni\u2208M t\u22121\n|M t\u22121\n\nj\n\n\u00b5t\u22121\ni\n| + \u03b1 + \u03b2\n\n(5)\n\n(6)\n\n(cid:26)\u00b5t\u22121\n\ni\n\u00b5t\u22121\ni\n\n\u00b5t\ni(yi) \u221d\n\nif xij = yi\nif xij (cid:54)= yi\n\n(cid:1)\n\n(yi)\u00afpt\nj\n\n(yi)(cid:0)1 \u2212 \u00afpt\nj = exp(cid:0)Epj{log(\u03bdt\n\nj\n\ni((cid:96))}. Note that we set \u00afpt\nj)}\n\nFinally, we can inspect the factors \u00b5t and form our predictions on the task classes as \u02c6yt\nargmax(cid:96)\u2208{\u00b11}{\u00b5t\nstep would require \u00afpt\nhas a negligible impact on the accuracy of the inference, as demonstrated in [Liu et al., 2012].\nWe summarise the high-level behaviour of the SBIC algorithm in the explanatory sketch of Figure\n1. There, it is easy to see that SBIC falls under the umbrella of the Streaming Variational Bayes\nframework [Broderick et al., 2013]: in fact, at each time step t we trust our current approximations\n\n(cid:1) instead. However, the \ufb01rst-order approximation we use\n\ni =\nj} in Equation 6. An exact optimisation\n\nj = Epj{\u03bdt\n\n3Exact knowledge of \u03b1, \u03b2 and q is not necessary in practice. See Section 5 for examples.\n\n4\n\n\u00b50\u2205\u03bd1jx1ij\u00b51...XT\u22121\u03bdTjxTij\u00b5T\fAlgorithm 1 Fast SBIC\nInput: dataset X, availability a, policy \u03c0, prior \u03b8\nOutput: \ufb01nal predictions \u02c6yT\n1: z0\n2: for t = 1 to T do\n3:\n4:\n\ni = log(q/(1 \u2212 q)), \u2200i \u2208 M\n\ni \u2190 \u03c0(t)\nj \u2190 a(t)\n\n(cid:80)\n\nh\u2208M\n\nt\u22121\nj\n\nsig(xhj zt\u22121\n\nh\n\n)+\u03b1\n\nj\n\n5:\n\n|M t\u22121\n\n\u00afpt\nj \u2190\n|+\u03b1+\u03b2\ni \u2190 zt\u22121\ni + xij log(\u00afpt\nzt\ni(cid:48) \u2190 zt\u22121\nzt\ni(cid:48)\n\n, \u2200i(cid:48) (cid:54)= i\ni = sign(zT\n\n6:\n7:\n8: return \u02c6yT\n\ni ), \u2200i \u2208 M\n\nj/(1 \u2212 \u00afpt\nj))\n\n\u00b5t and \u03bdt to be close to the exact posterior, and we use their values to inform the next local updates.\nFrom another point of view, SBIC is a form of constrained variational inference, where the constraints\nare implicit in the local steps we make in Equations 5 and 6, as opposed to an explicit alteration\nof the KL objective. Finally, the sequential nature of the SBIC algorithm means that its output is\ndeeply in\ufb02uenced by the order in which we process the dataset in X. By altering its ordering, we can\noptimise SBIC for different applications, as we show in the next two Sections 3.1 and 3.2.\n\n3.1 Fast SBIC\n\nRecall that crowdsourcing bene\ufb01ts from an online approach, since it allows the deployment of an\nadaptive sampling strategy which can greatly improve the predictive accuracy (see Section 2.2). Thus,\nour main goal here is computational speed, which we achieve by keeping the natural ordering of the\nset X unaltered.\nWe call the resulting algorithm Fast SBIC, and show its pseudocode in Algorithm 1. There, we\nuse the following computational tricks. First, we express the value of each factor \u00b5t\ni in terms of its\nlog-odds. Accordingly, Equation 6 becomes:\n\n(cid:19)\n\n(cid:18) \u00b5t\n\ni(+1)\n\u00b5t\ni(\u22121)\n\n(cid:32)\n\n(cid:33)\n\n\u00afpt\nj\n1 \u2212 \u00afpt\n\nj\n\nzt\ni = log\n\n= zt\u22121\n\ni + xij log\n\nwhere\n\nz0\ni = log\n\n(7)\n\n(cid:18) q\n\n(cid:19)\n\n1 \u2212 q\n\n(cid:80)\n\nThis has both the advantage of converting the chain of products into a summation, and removing\nthe need of normalising the factors \u00b5t\ni. Second, we can use the current log-odds zt to compute the\nworker accuracy estimate as follows:\n\n\u00afpt\nj =\n\ni\u2208M t\u22121\n\nj\n\ni\n\nsig(xijzt\u22121\n| + \u03b1 + \u03b2\n\n|M t\u22121\n\nj\n\n) + \u03b1\n\nwhere\n\nsig(zt\u22121\n\ni\n\n) \u2261\n\n1\n\n1 + exp(\u2212zt\u22121\n\ni\n\n)\n\n= \u00b5t\u22121\n\ni\n\n(+1)\n\n(8)\nThanks to the additive nature of Equation 7, we can quickly update the log-odds zt as we observe\nnew labels. More in detail, in Line 1 of Algorithm 1 we set z0\ni to its prior value. Then, for every new\nlabel xij, we estimate the mean accuracy of worker j given the current value of zt\u22121 (see Line 5),\nand add its contribution to the log-odds on task i (see Line 6). In the end (Line 7), we compute the\n\ufb01nal predictions by selecting the maximum-a-posteriori class \u02c6yT\nThis algorithm runs in O(T L) time, where L = maxj(|Mj|) is the maximum number of labels per\nworker. This makes it particularly ef\ufb01cient in an online setting, e.g. under an adaptive collection\nstrategy, since it takes only O(L) operations to update its estimates after observing a new label. In\nSection 5 we show that its computational speed is on par with the simple majority voting scheme.\n\ni = sign(zT\n\ni ).\n\n3.2 Sorted SBIC\n\nIn an of\ufb02ine setting, or when more computational resources are available, we have the opportunity of\ntrading off some of the computational speed of Fast SBIC in exchange for better predictive accuracy.\nWe can do so by running multiple copies of the algorithm in parallel, and presenting them the labels\nin X in different orders. We show the implementation of this idea in Algorithm 2, which we call\nSorted SBIC.\nThe intuition behind the algorithm is the following. When running Fast SBIC, the estimates \u02c6\u00b5t and \u02c6\u03bdt\nare very close to their prior in the \ufb01rst rounds. As time passes, two things change. First, we have more\ninformation since we observe more data points. Second, we run more updates on each factor \u00b5t\ni and\n\n5\n\n\fAlgorithm 2 Sorted SBIC\nInput: dataset X, availability a, policy \u03c0, prior \u03b8\nOutput: \ufb01nal predictions \u02c6yT\n1: sk\n2: for t = 1 to T do\n3:\n4:\n(cid:80)\n5:\n\ni = log(q/(1 \u2212 q)), \u2200i \u2208 M,\u2200k \u2208 M\n\ni \u2190 \u03c0(t)\nj \u2190 a(t)\nfor all k \u2208 M : k (cid:54)= i do\n\\k sig(xhj sk\nj\\k|+\u03b1+\u03b2\ni + xij log(\u00afpk\n\n\u00afpk\nj \u2190\ni \u2190 sk\nsk\n\nj /(1 \u2212 \u00afpk\nj ))\n\nh\u2208M t\nj\n|M t\n\n6:\n\n7:\n\nh)+\u03b1\n\n8:\n9:\n10:\n11:\n\nzt\ni = log(q/(1 \u2212 q)), \u2200i \u2208 M\nfor u = 1 to t do\n\ni \u2190 \u03c0(u)\n(cid:80)\nj \u2190 a(u)\n12:\n\u00afpi\nj \u2190\n13:\ni \u2190 zt\nzt\n14: return \u02c6yT\ni = sign(zT\n\nh\u2208M t\nj\n|M t\n\n\\i sig(xhj si\nj\\i|+\u03b1+\u03b2\ni + xij log(\u00afpi\n\ni ), \u2200i \u2208 M\n\nh)+\u03b1\n\nj/(1 \u2212 \u00afpi\nj))\n\nj. Because of these, the estimates \u02c6\u00b5t and \u02c6\u03bdt become closer and closer to their ground-truth values.\n\u03bdt\nAs a result, we get more accurate predictions on a speci\ufb01c task i when the corresponding subset\nof labels is processed towards the end of the collection process (t \u2248 T ), rather than the beginning\n(t \u2248 0).\nWe exploit this property in Sorted SBIC by keeping a separate view of the log-odds sk for each task\nk \u2208 M (see Line 1). Then, every time we observe a new label xij we update the views for all tasks k\nexcept the one we observed the label on (see Lines 5-7). We skip it because we want to process the\ncorresponding label xij at the end. Note that in Line 6 we compute a different estimate \u00afpk\nj for each\ntask k (cid:54)= i. This is because we are implicitly running |M| copies of Fast SBIC, and each copy can\nonly see their correponding information stored in sk.\nFinally, we need to process all the labels we skipped. If we are running Sorted SBIC of\ufb02ine, we only\nneed to do so once at the end of the collection process. Conversely, in an online setting we need to\nrepeat the same procedure at each time step t. Lines 8-13 contain the corresponding pseudocode.\nj labelled by worker j except\nNotice how we compute the estimates \u00afpi\nfor task i itself. This is because we skipped the corresponding label xij in the past, and we are\nprocessing it right now.\nThe implementation of Sorted SBIC presented in Algorithm 2 runs in O(|M|T L) time, which is a\nfactor |M| slower than Fast SBIC since we are running |M| copies of it in parallel. By sharing the\nviews sk across different tasks, we can reduce the complexity to O(log(|M|)T L). However, this is\nonly possible if the algorithm is run in an of\ufb02ine setting, where the whole dataset X is known in\nadvance. This additional time complexity comes with improved predictive accuracy. In Sections 4\nand 5 we quantify such improvement both theoretically and empirically.\n\nj by looking at all the tasks M t\n\n4 Theoretical analysis\n\nIn this section we study the predictive performance of SBIC from the theoretical perspective. As\nis the norm in the crowdsourcing literature, we establish an exponential relationship between the\nprobability of an error and the average number of labels per task R = T /|M| in the form P(\u02c6yi (cid:54)=\nyi) \u2264 exp(\u2212cR + o(1)). Computing the constant c is not trivial as its value depends not only on the\nproperties of the crowd and the aggregation algorithm, but also the collection policy \u03c0 we use (see\nSection 2.2). In this regard, previous results are either very conservative [Karger et al., 2014; Manino\net al., 2018], or assume a large number of labels per worker so that the estimates of p are close to\ntheir ground-truth value [Gao et al., 2016].\nHere, we take a different approach and provide exponential bounds that are both close to the empirical\nperformance of SBIC, and valid for any number of labels per worker. We achieve this by focusing on\nthe asymptotic case, where we assume that the predictions of SBIC are converging to the ground-truth\nafter observing a large enough number of labels. More formally:\nDe\ufb01nition 1. For any small \u0001 > 0, de\ufb01ne t(cid:48) as the minimum size of the dataset X, such that\ni (yi) \u2265 1 \u2212 \u0001 for any task i \u2208 M with high probability.\n\u00b5t\nFor any larger dataset, when t \u2265 t(cid:48) the term \u00b5t\nAs a consequence, we can replace the worker accuracy estimates in Equation 6 with \u00afpt\n\ni(xij) is very close to the indicator I(xij = yi).\nj +\n\nj = (kt\n\n(cid:48)\n\n6\n\n\fj| + \u03b1 + \u03b2), where kt\n\nj is the number of correct answers. With this in mind, we can establish\n\u03b1)/(|M t\nthe following bound on the performance of SBIC under the UNI policy:\nTheorem 1. For a crowd of workers with accuracy pj \u223c Beta(\u03b1, \u03b2), L labels per worker, R labels\nper task, the probability of an error under the UNI policy is bounded by:\n(9)\n\nfor all i \u2208 M\nwhere F (L, \u03b1, \u03b2) depends on the variant of SBIC we use. For Sorted SBIC we have:\n\n\u2212 R log F (L, \u03b1, \u03b2) + o(1)(cid:1),\nP(\u02c6yi (cid:54)= yi) \u2264 exp(cid:0)\n(cid:115)(cid:18) k + \u03b1\n\u00afL(cid:88)\n(cid:18) \u00afL\n(cid:19) B(k + \u03b1, \u00afL \u2212 k + \u03b2)\nwhere \u00afL = L \u2212 1, and the probability of observing k is:\nL(cid:88)\n\nFor Fast SBIC we have instead:\n\nP(k| \u00afL, \u03b1, \u03b2) =\n\nP(k| \u00afL, \u03b1, \u03b2)2\n\n(cid:19)(cid:18) \u00afL \u2212 k + \u03b2\n\n\u00afL + \u03b1 + \u03b2\n\nFsorted(L, \u03b1, \u03b2) =\n\n\u00afL + \u03b1 + \u03b2\n\n(cid:19)\n\nk\n\nB(\u03b1, \u03b2)\n\nk=0\n\n(10)\n\n(11)\n\n(12)\n\nFf ast(L, \u03b1, \u03b2) =\n\nFsorted(h, \u03b1, \u03b2)\n\n1\nL\n\nh=1\n\nto the corresponding term 2(cid:112)pj(1 \u2212 pj) in [Gao et al., 2016] when the estimates \u00afpt\n\nFor reasons of space, we only present the intuition behind Theorem 1 here (the full proof is in\nAppendix A). First, P(k| \u00afL, \u03b1, \u03b2) is the probability of observing a worker with accuracy pj \u223c\nBeta(\u03b1, \u03b2) produce k correct labels over a total of \u00afL labels. Second, the square root term converges\nj become close to\ntheir ground-truth value pj. Finally, the constant Ff ast is averaged over \u00afL \u2208 [0, L \u2212 1] as this is the\nnumber of past labels we use to form each worker\u2019s estimate \u00afpt\nj during the execution of Fast SBIC.\nSimilarly, for the US policy we have the following theorem:\nTheorem 2. For a crowd of workers with accuracy pj \u223c Beta(\u03b1, \u03b2), L labels per worker, an average\nof R labels per task, and |M| \u2192 \u221e, the probability of an error under the US policy is bounded by:\n(13)\n\nwhere G(L, \u03b1, \u03b2) depends on the variant of SBIC we use. For Sorted SBIC we have:\n\nfor all i \u2208 M\n\n(cid:19) (k + \u03b1) \u2212 ( \u00afL \u2212 k + \u03b2)\n\n\u00afL + \u03b1 + \u03b2\n\n(14)\n\n(15)\n\nP(\u02c6yi (cid:54)= yi) \u2264 exp(cid:0)\n\u00afL(cid:88)\n\n\u2212 RG(L, \u03b1, \u03b2) + o(1)(cid:1),\n(cid:18) k + \u03b1\nP(k| \u00afL, \u03b1, \u03b2) log\n\u00afL \u2212 k + \u03b2\nL(cid:88)\n\n1\nL\n\nh=1\n\nGf ast(L, \u03b1, \u03b2) =\n\nGsorted(h, \u03b1, \u03b2)\n\nGsorted(L, \u03b1, \u03b2) =\n\nk=0\nFor Fast SBIC we have instead:\n\nA full proof of Theorem 2 is in Appendix A. Here, note that the logarithm term corresponds to the\nlog-odds of a worker with accuracy \u00afpt\nj, and the right-most term is the expected value of a new label\nxij provided by said worker.\nIn practice, both variants of SBIC reach the asymptotic regime described in De\ufb01nition 1 for fairly\nsmall values of R. As an example, in Figure 2 we compare our theoretical results with the empirical\nperformance of SBIC on synthetic data. There, we can see how the slope we predict in Theorems\n1 and 2 closely matches the empirical decay in prediction error of SBIC. This in contrast with the\ncorresponding state-of-the-art results in [Manino et al., 2018], which apply to any state-of-the-art\nprobabilistic inference algorithm (i.e. not MAJ) but are signi\ufb01cantly more conservative.\n\n5 Empirical analysis\n\nIn this section we compare the empirical performance of SBIC with the state-of-the-art algorithms\nlisted in Section 2.3. Our analysis includes synthetic data, real-world data and a discussion on\ntime complexity. For reasons of space, we report the details of the algorithm implementations and\nexperiment parameters in Appendix B.\n\n7\n\n\f(a) UNI sampling policy\n\n(b) US sampling policy\n\nFigure 2: Prediction error on synthetic data with pj \u223c Beta(\u03b1, \u03b2), q = 0.5 and L = 10. The accuracy\nguarantees for SBIC are represented by a dotted line in the corresponding colour.\n\nSynthetic data. First, we run the algorithms on synthetic data. With this choice we can make sure\nthat the assumptions of the underlying one-coin Dawid-Skene model are met. In turn, this allows us\nto compare the empirical performance of SBIC with the theoretical results in Section 4.\nTo do so, we extract workers from a distribution pj \u223c Beta(4, 3), representing a non-uniform\npopulation with large variance. Crucially, the mean of this distribution is above 1\n2, thus ensuring that\nthe crowd is biased towards the correct answer. Additionally, we set the number of tasks to M = 1000\nand the number of labels per worker to L = 10. This represents a medium-sized crowdsourcing\nproject with a high worker turnout. Finally, we run EM, AMF, MC and SBIC with parameters \u03b1 and\n\u03b2 matching the distribution of pj. Conversely, MAJ and KOS do not require any extra parameter. We\nomit the results for TE since in this setting the task-worker matrix X is too sparse for the algorithm\nto produce non-random predictions.\nIn Figures 2a and 2b we show the results obtained under the UNI and US sampling policies respec-\ntively. For reference, we also plot the bounds of Theorems 1 and 2 up to an arbitrary o(1) constant\n(see Section 4 for the related discussion). As expected, the performance of all algorithms under the\nUS policy greatly improves with respect to the UNI policy. Also, notice how MAJ is consistently\noutperformed by the other algorithms in this setting (this is not the case on real-world data, as\nwe show below). Additionally, both variants of SBIC perform well, with Sorted SBIC achieving\nstate-of-the-art performance under the UNI policy and matching the computationally-expensive MC\nalgorithm under the US policy. Interestingly, Fast SBIC is asymptotically competitive as well, but\nsuffers from an almost constant performance gap (in logarithmic scale). Finally, both EM and AMF\ntend to lose their competitiveness as the number of labels per task R increases. This is due to their\ninability to form unbiased estimates of the workers\u2019 accuracy with few labels per worker. Under\nthe US policy this may lead to poor sampling behaviour, which explains the lack of improvement in\npredictive accuracy for R > 40 in Figure 2b.\n\nTime complexity. As we show in our experiments on synthetic data, all algorithms bene\ufb01t from an\nadaptive sampling strategy. However, in order to deploy such policy we need to be able to update\nour estimates in real time, and only the MAJ and Fast SBIC algorithms are capable of that. To prove\nthis point, we measure the average time the algorithms take to complete the simulations presented in\nFigure 2b, i.e. when used in conjunction with the US policy. We plot the results in Figure 3. Note\nhow Fast SBIC matches MAJ in terms of computational speed, whereas all the other algorithms are\norders of magnitude slower. This makes Fast SBIC the only viable alternative to MAJ for the online\nsetting, particularly because it can deliver superior predictive accuracy.\n\nReal-world data. Second, we consider the 5 publicly available dataset listed in Table 1, which\ncome with binary annotations and ground-truth values. For more information on the datasets see\n[Snow et al., 2008; Welinder et al., 2010; Lease and Kazai, 2011]. The performance of the algorithms\nis reported in Table 2. There we run EM, AFM, MC and SBIC with the generic prior \u03b1 = 2, \u03b2 = 1\n2 as proposed in Liu et al. [2012]. Additionally, we include the triangular estimation (TE)\nand q = 1\n\n8\n\n102030405060708010\u2212310\u2212210\u22121NumberoflabelspertaskPredictionerrorMAJEMAMFKOSMCFastSBICSortedSBICManino2018102030405060708010\u2212410\u2212310\u2212210\u22121NumberoflabelspertaskPredictionerrorMAJEMAMFKOSMCFastSBICSortedSBICManino2018\fFigure 3: Time required to complete a single run with |M| = 1000 tasks under the US policy.\n\nalgorithm from Bonald and Combes [2017], since it outputs non-random predictions on most of the\naforementioned datasets.\n\nTable 1: Summary of the real-world datasets\n# Tasks\n\n# Workers\n\n# Labels Avg. L Avg. R\n\n108\n240\n800\n462\n711\n\n39\n53\n164\n76\n181\n\n4212\n9600\n8000\n4620\n2199\n\n108\n181\n49\n61\n12\n\n39\n40\n10\n10\n3\n\nDataset\nBirds\nDucks\nRTE\nTEMP\nTREC\n\nDataset MAJ\n0.241\nBirds\n0.306\nDucks\n0.100\nRTE\n0.057\nTEMP\n0.257\nTREC\n\nTable 2: Prediction error on the real-world datasets\nEM AMF KOS MC\n0.341\n0.278\n0.412\n0.412\n0.072\n0.079\n0.061\n0.095\n0.217\n0.302\n\n0.278\n0.396\n0.491\n0.567\n0.259\n\n0.278\n0.412\n0.075\n0.061\n0.266\n\nFast SBIC Sorted SBIC\n\n0.260\n0.400\n0.075\n0.059\n0.251\n\n0.298\n0.405\n0.072\n0.062\n0.239\n\nTE\n0.194\n0.408\n0.257\n0.115\n0.451\n\nInterestingly, the MAJ algorithm performs quite well and achieves the best score on the Ducks and\nTEMP datasets. This con\ufb01rms the practitioner\u2019s knowledge that majority voting is a robust and viable\nalgorithms in most settings. Unsurprisingly, TE achieves its best score on the Birds dataset, which has\na full task-worker matrix X. On the contrary, its predictions are almost random on the TREC dataset,\nwhich has a low number of labels per worker. Finally, both variants of SBIC match the performance\nof the other state-of-the-art Bayesian algorithms (EM, AFM, MC), with Sorted SBIC achieving the\nbest score on RTE, and EM on both RTE and TREC. More importantly, Fast SBIC is always close to\nthe other algorithms, making a strong case for its computationally ef\ufb01cient approach to Variational\nBayes.\n\n6 Conclusions\n\nIn this paper we proposed Streaming Bayesian Inference for Crowdsourcing, a new method to infer\nthe ground-truth from binary crowdsourced data. This method combines strong theoretical guarantees,\nstate-of-the-art accuracy and computational ef\ufb01ciency. The latter makes it the only viable alternative\nto majority voting when real-time decisions need to be made in an online setting. We plan to extend\nthese techniques to the multi-class case as our future work.\n\n9\n\n510152025303540101102103104105NumberoflabelspertaskTime(ms)MAJEMAMFKOSMCFastSBICSortedSBIC\fAcknowledgments\n\nThis research is funded by the UK Research Council project ORCHID, grant EP/I011587/1. The\nauthors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated\nsupport services at the University of Southampton.\n\nReferences\nDaniel W. Barowy, Charlie Curtsinger, Emery D. Berger, and Andrew McGregor. AutoMan: A\nPlatform for Integrating Human-based and Digital Computation. In Proceedings of the ACM\nInternational Conference on Object Oriented Programming Systems Languages and Applications,\npages 639\u2013654, 2012.\n\nThomas Bonald and Richard Combes. A Minimax Optimal Algorithm for Crowdsourcing.\n\nIn\nProceedings of the Thirtieth International Conference on Neural Information Processing Systems,\npages 4355\u20134363, 2017.\n\nTamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Stream-\ning Variational Bayes. In Proceedings of the Twenty-Sixth International Conference on Neural\nInformation Processing Systems, pages 1727\u20131735, 2013.\n\nNilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating Crowdsourced\nBinary Ratings. In Proceedings of the 22Nd International Conference on World Wide Web, pages\n285\u2013294, 2013.\n\nAlexander P. Dawid and Allan M. Skene. Maximum Likelihood Estimation of Observer Error-Rates\nUsing the EM Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics),\n28(1):20\u201328, 1979.\n\nJulie S Downs, Mandy B Holbrook, Steve Sheng, and Lorrie Faith Cranor. Are Your Participants\nIn Proceedings of the SIGCHI\n\nGaming the System? Screening Mechanical Turk Workers.\nConference on Human Factors in Computing Systems, pages 2399\u20132402, 2010.\n\nChao Gao, Yu Lu, and Dengyong Zhou. Exact Exponent in Optimal Rates for Crowdsourcing. In\nProceedings of the Thirty-Third International Conference on Machine Learning, pages 603\u2013611,\n2016.\n\nPanagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality Management on Amazon Mechanical\nTurk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 64\u201367, 2010.\n\nDavid R. Karger, Sewoong Oh, and Devavrat Shah. Budget-Optimal Task Allocation for Reliable\n\nCrowdsourcing Systems. Operations Research, 62(1):1\u201324, 2014.\n\nHyun-Chul Kim and Zoubin Ghahramani. Bayesian Classi\ufb01er Combination. In Proceedings of\nthe Fifteenth International Conference on Arti\ufb01cial Intelligence and Statistics, volume 22, pages\n619\u2013627, 2012.\n\nMatthew Lease and Gabriella Kazai. Overview of the TREC 2011 crowdsourcing track. In Proceed-\n\nings of TREC 2011, 2011.\n\nQiang Liu, Jian Peng, and Alexander Ihler. Variational Inference for Crowdsourcing. In Proceedings\nof the Twenty-Fifth International Conference on Neural Information Processing Systems, pages\n692\u2013700, 2012.\n\nEdoardo Manino, Long Tran-Thanh, and Nicholas R. Jennings. On the Ef\ufb01ciency of Data Collec-\ntion for Crowdsourced Classi\ufb01cation. In Proceedings of the Twenty-Seventh International Joint\nConference on Arti\ufb01cial Intelligence, pages 1568\u20131575, 2018.\n\nKevin P Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.\n\nEdwin Simpson and Stephen Roberts. Bayesian Methods for Intelligent Task Assignment in Crowd-\nsourcing Systems. In Scalable Decision Making: Uncertainty, Imperfection, Deliberation, pages\n1\u201332. Springer, 2014.\n\n10\n\n\fRion Snow, Brendan O\u2019Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and Fast \u2013 but is It\nGood?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the\nConference on Empirical Methods in Natural Language Processing, pages 254\u2013263, 2008.\n\nPeter Welinder and Pietro Perona. Online Crowdsourcing: Rating Annotators and Obtaining Cost-\nEffective Labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer\nVision and Pattern Recognition \u2013 Workshops, pages 25\u201332, 2010.\n\nPeter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The Multidimensional Wisdom\nof Crowds. In Proceedings of the Twenty-Third International Conference on Neural Information\nProcessing Systems, pages 1\u20139, 2010.\n\nYuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. Spectral Methods Meet EM:\nA Provably Optimal Algorithm for Crowdsourcing. Journal of Machine Learning Research,\n17(1):3537\u20133580, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6954, "authors": [{"given_name": "Edoardo", "family_name": "Manino", "institution": "University of Southampton"}, {"given_name": "Long", "family_name": "Tran-Thanh", "institution": "University of Southampton"}, {"given_name": "Nicholas", "family_name": "Jennings", "institution": "Imperial College, London"}]}