{"title": "Inference Aided Reinforcement Learning for Incentive Mechanism Design in Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 5507, "page_last": 5517, "abstract": "Incentive mechanisms for crowdsourcing are designed to incentivize financially self-interested workers to generate and report high-quality labels. Existing mechanisms are often developed as one-shot static solutions, assuming a certain level of knowledge about worker models (expertise levels, costs for exerting efforts, etc.). In this paper, we propose a novel inference aided reinforcement mechanism that acquires data sequentially and requires no such prior assumptions. Specifically, we first design a Gibbs sampling augmented Bayesian inference algorithm to estimate workers' labeling strategies from the collected labels at each step. Then we propose a reinforcement incentive learning (RIL) method, building on top of the above estimates, to uncover how workers respond to different payments. RIL dynamically determines the payment without accessing any ground-truth labels. We theoretically prove that RIL is able to incentivize rational workers to provide high-quality labels both at each step and in the long run. Empirical results show that our mechanism performs consistently well under both rational and non-fully rational (adaptive learning) worker models. Besides, the payments offered by RIL are more robust and have lower variances compared to existing one-shot mechanisms.", "full_text": "Inference Aided Reinforcement Learning for\nIncentive Mechanism Design in Crowdsourcing\n\nZehong Hu\n\nAlibaba Group, Hangzhou, China\n\nHUZE0004@e.ntu.edu.sg\n\nYitao Liang\n\nUniversity of California, Los Angeles\n\nyliang@cs.ucla.edu\n\nJie Zhang\n\nNanyang Technological University\n\nZhangJ@ntu.edu.sg\n\nZhao Li\n\nAlibaba Group, Hangzhou, China\nlizhao.lz@alibaba-inc.com\n\nUniversity of California, Santa Cruz/Harvard University\n\nYang Liu\n\nyangliu@ucsc.edu\n\nAbstract\n\nIncentive mechanisms for crowdsourcing are designed to incentivize \ufb01nancially\nself-interested workers to generate and report high-quality labels. Existing mech-\nanisms are often developed as one-shot static solutions, assuming a certain level\nof knowledge about worker models (expertise levels, costs of exerting efforts,\netc.). In this paper, we propose a novel inference aided reinforcement mechanism\nthat learns to incentivize high-quality data sequentially and requires no such prior\nassumptions. Speci\ufb01cally, we \ufb01rst design a Gibbs sampling augmented Bayesian\ninference algorithm to estimate workers\u2019 labeling strategies from the collected\nlabels at each step. Then we propose a reinforcement incentive learning (RIL)\nmethod, building on top of the above estimates, to uncover how workers respond\nto different payments. RIL dynamically determines the payment without accessing\nany ground-truth labels. We theoretically prove that RIL is able to incentivize\nrational workers to provide high-quality labels. Empirical results show that our\nmechanism performs consistently well under both rational and non-fully rational\n(adaptive learning) worker models. Besides, the payments offered by RIL are more\nrobust and have lower variances compared to the existing one-shot mechanisms.\n\n1\n\nIntroduction\n\nThe ability to quickly collect large-scale and high-quality labeled datasets is crucial for Machine\nLearning (ML). Among all proposed solutions, one of the most promising options is crowdsourcing [7,\n19, 4, 18]. Nonetheless, it has been noted that crowdsourced data often suffers from quality issue, due\nto its salient feature of no monitoring and no ground-truth veri\ufb01cation of workers\u2019 contribution. This\nquality control challenge has been attempted by two relatively disconnected research communities.\nFrom the more ML side, quite a few inference techniques have been developed to infer true labels\nfrom crowdsourced and potentially noisy labels [16, 11, 24, 23]. These solutions often work as\none-shot, post-processing procedures facing a static set of workers, whose labeling accuracy is\n\ufb01xed and informative. Despite their empirical success, the aforementioned methods ignore the\neffects of incentives when dealing with human inputs. It has been observed both in theory and\npractice that, without appropriate incentive, sel\ufb01sh and rational workers tend to contribute low quality,\nuninformative, if not malicious data [17, 12]. Existing inference algorithms are very vulnerable to\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthese cases - either much more redundant labels would be needed (low quality inputs), or the methods\nwould simply fail to work (the case where inputs are uninformative and malicious).\nFrom the less ML side, the above quality control question has been studied in the context of incentive\nmechanism design. In particular, a family of mechanisms, jointly referred as peer prediction, have\nbeen proposed [15, 8, 21, 2]. Existing peer prediction mechanisms focus on achieving incentive\ncompatibility (IC), which is de\ufb01ned as that truthfully reporting private data, or reporting high quality\ndata, maximizes workers\u2019 expected utilities. These mechanisms achieve IC via comparing the reports\nfrom the to-be-scored worker, against those from a randomly selected reference worker, to bypass the\nchallenge of no ground-truth veri\ufb01cation. However, we note several undesirable properties of these\nmethods. Firstly, from learning\u2019s perspective, collected labels contain rich information about the\nground-truth labels and workers\u2019 labeling accuracy. Existing peer prediction mechanisms often rely\non reported data from a small subset of reference workers, which only represents a limited share of\nthe overall collected information. In consequence, the mechanism designer dismisses the opportunity\nto leverage learning methods to generate a more credible and informative reference answer for the\npurpose of evaluation. Secondly, existing peer prediction mechanisms often require a certain level\nof prior knowledge about workers\u2019 models, such as the cost of exerting efforts, and their labeling\naccuracy when exerting different levels of efforts. However, this prior knowledge is dif\ufb01cult to obtain\nunder real environment. Thirdly, they often assume workers are all fully rational and always follow\nthe utility-maximizing strategy. Rather, they may adapt their strategies in a dynamic manner.\nIn this paper, we propose an inference-aided reinforcement mechanism, aiming to merge and extend\nthe techniques from both inference and incentive design communities to address the caveats when\nthey are employed alone, as discussed above. The high level idea is as follows: we collect data in a\nsequential fashion. At each step, we assign workers a certain number of tasks and estimate the true\nlabels and workers\u2019 strategies from their labels. Relying on the above estimates, a reinforcement\nlearning (RL) algorithm is proposed to uncover how workers respond to different levels of offered\npayments. The RL algorithm determines the payments for the workers based on the collected\ninformation up-to-date. By doing so, our mechanism not only incentivizes (non-)rational workers\nto provide high-quality labels but also dynamically adjusts the payments according to workers\u2019\nresponses to maximize the data requester\u2019s cumulative utility. Applying standard RL solutions here\nis challenging, due to unobservable states (workers\u2019 labeling strategies) and reward (the aggregated\nlabel accuracy) which is further due to the lack of ground-truth labels. Leveraging standard inference\nmethods seems to be a plausible solution at the \ufb01rst sight (for the purposes of estimating both the\nstates and reward), but we observe that existing methods tend to over-estimate the aggregated label\naccuracy, which would mislead the superstructure RL algorithm.\nWe address the above challenges and make the following contributions: (1) We propose a Gibbs\nsampling augmented Bayesian inference algorithm, which estimates workers\u2019 labeling strategies and\nthe aggregated label accuracy, as done in most existing inference algorithms, but signi\ufb01cantly lowers\nthe estimation bias of labeling accuracy. This lays a strong foundation for constructing correct reward\nsignals, which are extremely important if one wants to leverage reinforcement learning techniques.\n(2) A reinforcement incentive learning (RIL) algorithm is developed to maximize the data requester\u2019s\ncumulative utility by dynamically adjusting incentive levels according to workers\u2019 responses to\npayments. (3) We prove that our Bayesian inference algorithm and RIL algorithm are incentive\ncompatible (IC) at each step and in the long run, respectively. (4) Experiments are conducted to test\nour mechanism, which shows that our mechanism performs consistently well under different worker\nmodels. Meanwhile, compared with the state-of-the-art peer prediction solutions, our Bayesian\ninference aided mechanism can improve the robustness and lower the variances of payments.\n\n2 Problem Formulation\n\nThis paper considers the following data acquisition problem via crowdsourcing: at each discrete time\nstep t = 1, 2, ..., a data requester assigns M tasks with binary answer space {\u22121, +1} to N \u2265 3\ncandidate workers to label. Workers receive payments for submitting a label for each task. We use\ni(j) to denote the label worker i generates for task j at time t. For simplicity of computation,\nLt\nwe reserve Lt\ni(j) = 0 if j is not assigned to i. Furthermore, we use L and L to denote the set of\nground-truth labels and the set of all collected labels respectively.\n\n2\n\n\fFigure 1: Overview of our incentive mechanism.\n\nThe generated label Lt\ni(j) depends both on the latent ground-truth L(j) and worker i\u2019s strategy,\nwhich is mainly determined by two factors: exerted effort level (high or low) and reporting strategy\n(truthful or deceitful). Accommodating the notation commonly used in reinforcement learning, we\nalso refer worker i\u2019s strategy as his/her internal state. At any given time, workers at their will adopt\nan arbitrary combination of effort level and reporting strategy. Speci\ufb01cally, we de\ufb01ne eftt\ni \u2208 [0, 1]\nand rptt\ni \u2208 [0, 1] as worker i\u2019s probability of exerting high efforts and reporting truthfully for task j,\nrespectively. Furthermore, we use Pi,H and Pi,L to denote worker i\u2019s probability of observing the\ntrue label when exerting high and low efforts, respectively. Correspondingly, we denote worker i\u2019s\ncost of exerting high and low efforts by ci,H and ci,L, respectively. For the simplicity of analysis,\nwe assume that Pi,H > Pi,L = 0.5 and ci,H > ci,L = 0. All the above parameters and workers\u2019\nactions stay unknown to our mechanism. In other words, we regard workers as black-boxes, which\ndistinguishes our mechanism from the existing peer prediction mechanisms.\nWorker i\u2019s probability of being correct (PoBC) at time t for any given task is given as\n\nSuppose we assign mt\n\nPt\ni = rptt\n\ni \u00b7 Pi,H + (1 \u2212 rptt\n\ni) \u00b7 eftt\ni \u00b7 eftt\ni) \u00b7 Pi,L + (1 \u2212 rptt\ni \u00b7 (1 \u2212 eftt\nrptt\n(cid:88)M\n\ni \u2264 M tasks to worker i at step t. Then, a risk-neutral worker\u2019s utility satis\ufb01es:\n(2)\n\nut\ni =\n\ni (j) \u2212 mt\nP t\n\ni \u00b7 ci,H \u00b7 eftt\n\ni\n\nj=1\n\ni \u00b7 (1 \u2212 Pi,H )+\ni) \u00b7 (1 \u2212 eftt\n\ni) \u00b7 (1 \u2212 Pi,L)\n\n(1)\n\ni denotes our payment to worker i for task j at time t (see Section 3 for more details).\n\nwhere P t\nAt the beginning of each step, the data requester and workers agree to a certain rule of payment,\nwhich is not changed until the next time step. The workers are self-interested and may choose their\nstrategies in labeling and reporting according to the expected utility he/she can get. After collecting\nthe generated labels, the data requester infers the true labels \u02dcLt(j) by running a certain inference\nalgorithm. The aggregated label accuracy At and the data requester\u2019s utility rt are de\ufb01ned as follows:\n\n(cid:88)M\n\n1\n\nj=1\n\n(cid:104) \u02dcLt(j) = L(j)\n\n(cid:105)\n\nAt =\n\n1\nM\n\n(cid:88)N\n\n(cid:88)M\n\ni=1\n\nj=1\n\n, rt = F (At) \u2212 \u03b7\n\nP t\n\ni (j)\n\n(3)\n\nwhere F (\u00b7) is a non-decreasing monotonic function mapping accuracy to utility and \u03b7 > 0 is a tunable\nparameter balancing label quality and costs.\n\n3\n\nInference-Aided Reinforcement Mechanism for Crowdsourcing\n\nOur mechanism mainly consists of three components: the payment rule, Bayesian inference and\nreinforcement incentive learning (RIL); see Figure 1 for an overview, where estimated values are\ndenoted with tildes. The payment rule computes the payment to worker i for his/her label on task j\n\ni (j) = at \u00b7 [sct\nP t\n\ni(j) \u2212 0.5] + b\n\n(4)\n\n3\n\nWorkersBayesian InferenceReinforcement Incentive LearningData RequesterLabelsTrue LabelsPayment RulePoBCPaymentUtility FunctionScaling FactorScore\fwhere at \u2208 A denotes the scaling factor, determined by RIL at the beginning of every step t and\nshared by all workers. sct\ni(j) denotes worker i\u2019s score on task j, which will be computed by\nthe Bayesian inference algorithm. b \u2265 0 is a constant representing the \ufb01xed base payment. The\nBayesian inference algorithm is also responsible for estimating the true labels, workers\u2019 PoBCs and\nthe aggregated label accuracy at each time step, preparing the necessary inputs to RIL. Based on these\nestimates, RIL seeks to maximize the cumulative utility of the data requester by optimally balancing\nthe utility (accuracy in labels) and the payments.\n\n3.1 Bayesian Inference\n\nj=1\n\nFor the simplicity of notation, we omit the superscript t in this subsection. The motivation for\ndesigning our own Bayesian inference algorithm is as follows. We ran several preliminary experiments\nusing popular inference algorithms, for example, EM [3, 16, 22] and Variational Inference [11, 1]).\nOur empirical studies reveal that those methods tend to heavily bias towards over-estimating the\naggregated label accuracy when the quality of labels is low.1 This leads to biased estimation of the\ndata requester\u2019s utility rt (as it cannot be observed directly), and this estimated utility is used as the\nreward signal in RIL, which will be detailed later. Since the reward signal plays the core role in\nguiding the reinforcement learning process, the heavy bias will severely mislead our mechanism.\nTo reduce the estimation bias, we develop a Bayesian inference algorithm by introducing soft Dirichlet\npriors to both the distribution of true labels \u03c4 = [\u03c4\u22121, \u03c4+1] \u223c Dir(\u03b2\u22121, \u03b2+1), where \u03c4\u22121 and \u03c4+1\n(cid:81)N\ndenote that of label \u22121 and +1, respectively, and workers\u2019 PoBCs [Pi, 1 \u2212 Pi] \u223c Dir(\u03b11, \u03b12). Then,\ndenotes the beta function, \u02c6\u03b1 = [\u02c6\u03b11, \u02c6\u03b12], \u02c6\u03b2 = [ \u02c6\u03b2\u22121, \u02c6\u03b2+1], \u02c6\u03b1i1 =(cid:80)M\n(cid:80)\nwe derive the conditional distribution of true labels given collected labels as (see Appendix A)\nP(L|L) = P(L,L)/P(L) \u221d B( \u02c6\u03b2) \u00b7\ni=1B( \u02c6\u03b1i), where B(x, y) = (x \u2212 1)!(y \u2212 1)!/(x + y \u2212 1)!\n(cid:80)\n\u02c6\u03b1i2 = (cid:80)M\nk\u2208{\u22121,+1}\u03b4ij(\u2212k)\u03bejk + 2\u03b12 \u2212 1 and \u02c6\u03b2k = (cid:80)M\nk\u2208{\u22121,+1}\u03b4ijk\u03bejk+2\u03b11\u22121,\nj=1\u03bejk + 2\u03b2k \u2212 1, where \u03b4ijk =\n\nAlgorithm 1 Gibbs sampling for crowdsourcing\n1: Input: the collected labels L, the number of samples W\n2: Output: the sample sequence S\n3: S \u2190 \u2205, Initialize L with the uniform distribution\n4: for s = 1 to W do\n5:\n6:\n7:\n8:\n9:\n\n1(Li(j) = k) and \u03bejk = 1(L(j) = k).\nNote that it is generally hard to de-\nrive an explicit formula for the pos-\nterior distribution of a speci\ufb01c task\nj\u2019s ground-truth from the conditional\ndistribution P(L|L). We thus resort\nto Gibbs sampling for the inference.\nMore speci\ufb01cally, according to Bayes\u2019\ntheorem, we know that the conditional\ndistribution of task j\u2019s ground-truth\nL(j) satis\ufb01es P[L(j)|L,L(\u2212j)] \u221d\nP(L|L), where \u2212j denotes all tasks\nexcluding j. Leveraging this, we gen-\nerate samples of the true label vector L following Algorithm 1. At each step of the sampling procedure\n(lines 6-7), Algorithm 1 \ufb01rst computes P[L(j)|L,L(\u2212j)] and then generates a new sample of L(j)\nto replace the old one in \u02dcL. After traversing through all tasks, Algorithm 1 generates a new sample of\nthe true label vector L. Repeating this process for W times, we get W samples, which is recorded\nin S. Here, we write the s-th sample as \u02dcL(s). Since Gibbs sampling requires a burn-in process, we\ndiscard the \ufb01rst W0 samples and calcualte worker i\u2019s score on task j and PoBC as\n\nAppend \u02dcL to the sample sequence S\n\nfor j = 1 to M do\n\nL(j) \u2190 1 and compute x1 = B( \u02c6\u03b2)(cid:81)N\nL(j) \u2190 2 and compute x2 = B( \u02c6\u03b2)(cid:81)N\n\ni=1 B( \u02c6\u03b1i)\ni=1 B( \u02c6\u03b1i)\nL(j) \u2190 Sample {1, 2} with P (1) = x1/(x1 + x2)\n\nj=1\n\n(cid:88)W\n\n(cid:104) \u02dcL(s)(j) = Li(j)\n(cid:105)\n\n1\n\ns=W0\n\nW \u2212 W0\n\nW(cid:80)\n\n(cid:2)2\u03b11 \u2212 1 +\n\nM(cid:80)\n\nj=1\n\n1( \u02dcL(s)(j) = Li(j))(cid:3)\n\n(W \u2212 W0) \u00b7 (2\u03b11 + 2\u03b12 \u2212 2 + mi)\n\n.\n\n(5)\n\n, \u02dcPi =\n\ns=W0\n\nsct\n\ni(j) =\n\nSimilarly, we can obtain the estimates of the true label distribution \u03c4 and then derive the log-ratio\nof task j, \u03c3j = log(P[L(j) = \u22121]/P[L(j) = +1]). Furthermore, we decide the true label estimate\n\u02dcL(j) as \u22121 if \u02dc\u03c3j > 0 and as +1 if \u02dc\u03c3j < 0. Correspondingly, the label accuracy A is estimated as\n\n\u02dcA = E (A) = M\u22121(cid:88)M\n\ne|\u02dc\u03c3j|(cid:16)\n\n1 + e|\u02dc\u03c3j|(cid:17)\u22121\n\n.\n\n(6)\n\nj=1\n\n1See Section 5.1 for detailed experiment results and analysis.\n\n4\n\n\fi=0(2\u02c6Pi)2, \u03b4 = O[\u2206 \u00b7 log(M )] and \u2206 =(cid:80)N\n\nIn our Bayesian inference algorithm, workers\u2019 scores, PoBCs and the true label distribution are\nall estimated by comparing the true label samples with the collected labels. Thus, t To prove the\nconvergence of our algorithm, we need to bound the ratio of wrong samples. We introduce n and m\nto denote the number of tasks of which the true label sample in Eqn. (5) is correct ( \u02dcL(s)(j) = L(j))\nand wrong ( \u02dcL(s)(j) (cid:54)= L(j)) in the s-th sample, respectively. Formally, we have:\nLemma 1. Let \u00afP = 1 \u2212 P, \u02c6P = max{P, \u00afP} and P0 = \u03c4\u22121. When M (cid:29) 1,\nwhere \u03b5\u22121 =(cid:81)N\nE[m/M ] (cid:46) (1 + e\u03b4)\u22121(\u03b5 + e\u03b4)(1 + \u03b5)M\u22121 , E[m/M ]2 (cid:46) (1 + e\u03b4)\u22121(\u03b52 + e\u03b4)(1 + \u03b5)M\u22122 (7)\nThe proof is in Appendix B. Our main idea is to introduce a set of counts for the collected labels and\nthen calculate E[m/M ] and E[m/M ]2 based on the distribution of these counts. Using Lemma 1,\nTheorem 1 (Convergence). When M (cid:29) 1 and(cid:81)N\nthe convergence of our Bayesian inference algorithm states as follows:\ni=0(2\u02c6Pi)2 \u2265 M, if most of workers report truthfully\n(i.e. \u2206 < 0), with probability at least 1 \u2212 \u03b4 \u2208 (0, 1), |\u02dcPi \u2212 Pi| \u2264 O(1/\u221a\u03b4M ) holds for any worker\ni\u2019s PoBC estimate \u02dcPi as well as the true label distribution estimate (\u02dc\u03c4\u22121 = \u02dcP0).\nThe convergence of \u02dcPi and \u02dc\u03c4 can naturally lead to the convergence of \u02dc\u03c3j and \u02dcA because the latter\nestimates are fully computed based on the former ones. All these convergence guarantees enable us\nto use the estimates computed by Bayesian inference to construct the state and reward signal in our\nreinforcement learning algorithm RIL. In addition, we summarize the time and space complexity of\nour Bayesian inference algorithm as follows.\nTheorem 2 (Complexity). The time and space complexity of our Bayesian inference algorithm are\nO(W M N ) and O(W M ), respectively.\n\ni=1[1(Pi < 0.5) \u2212 1(Pi > 0.5)].\n\nThe space complexity is obvious, as the two loops in Algorithm 1 generate W M samples in total.\nAbout the time complexity, the additional term N results from the calculation of x1 and x2. Since\nour Bayesian inference algorithm only considers one variable (L), the convergence of sampling is\nsuf\ufb01ciently fast. In practice, the number of samples W does not need to be too big. We set it to\nbe 1000 in our experiments. Considering neither the number of tasks nor workers is very large, the\ncomputational ef\ufb01ciency of our Bayesian inference should be acceptable.\n\n3.2 Reinforcement Incentive Learning\n\nIn this subsection, we formally introduce our reinforcement incentive learning (RIL) algorithm, which\nadjusts the scaling factor at to maximize the data requesters\u2019 utility accumulated in the long run.\nTo fully understand the technical background, readers are expected to be familiar with Q-value and\nfunction approximation. For readers with limited knowledge, we kindly refer them to Appendix D,\nwhere we provide background on these concepts. With transformation, our problem can be perfectly\nmodeled as a Markov Decision Process. To be more speci\ufb01c, our mechanism is the agent and it\ninteracts with workers (i.e. the environment); scaling factor at is the action; the utility of the data\nrequester rt de\ufb01ned in Eqn. (3) is the reward. Workers\u2019 reporting strategies are the state. After\nreceiving payments, workers may change their strategies to, for example, increase their utilities at the\nnext step. How workers change their strategies forms the state transition kernel.\nOn the other hand, the reward rt de\ufb01ned in Eqn. (3) cannot be directly used because the true accuracy\nAt cannot be observed. Thus, we use the estimated accuracy \u02dcA calculated by Eqn. (6) instead to\napproximate rt as in Eqn. (8). Furthermore, to achieve better generalization across different states,\nit is a common approach to learn a feature-based state representation \u03c6(s) [14, 9]. Recall that the\ndata requester\u2019s implicit utility at time t only depends on the aggregated PoBC averaged across all\nworkers. Such observation already points out to a representation design with good generalization,\nPt\ni/N. Further recall that, when deciding the current scaling factor at, the\ndata requester does not observe the latest workers\u2019 PoBCs and thus cannot directly estimate the\ncurrent \u03c6(st). Due to this one-step delay, we have to build our state representation using the previous\nobservation. Since most workers would only change their internal states after receiving a new\nincentive, there exists some imperfect mapping function \u03c6(st) \u2248 f (\u03c6(st\u22121), at\u22121). Utilizing this\nimplicit function, we introduce the augmented state representation in RIL as \u02c6st in Eqn. (8).\n\nnamely \u03c6(st) =(cid:80)N\n\ni=1\n\n(cid:88)N\n\ni=1\n\nrt \u2248 F ( \u02dcAt) \u2212 \u03b7\n\nP t\ni\n\n, \u02c6st = (cid:104)\u03c6(st\u22121), at\u22121(cid:105).\n\n(8)\n\n5\n\n\fSince neither rt nor st can be perfectly inferred, it would not be a surprise to observe some noise that\ncannot be directly learned in our Q-function.\nFor most crowdsourcing problems the number of tasks M is large, so we can leverage the central\nlimit theorem to justify our modeling of the noise using a Gaussian process. To be more speci\ufb01c, we\ncalculate the temporal difference (TD) error as\n\nrt \u2248 Q\u03c0(\u02c6st, at) \u2212 \u03b3E\u03c0Q\u03c0(\u02c6st+1, at+1) + \u0001t\n\n(9)\n\nat =\n\n4:\n5:\n6:\n\nProbability \u0001\n\nRandom a \u2208 A\n\nfor each step in the episode do\n\nDecide the scaling factor as (\u0001-greedy method)\n\nAssign tasks and collect labels from the workers\nRun Bayesian inference to get \u02c6st+1 and rt\nUse (\u02c6st, at, rt) to update K, H and r in Eqn. (10)\n\n(cid:26) arg maxa\u2208A Q(\u02c6st, a) Probability 1 \u2212 \u0001\n\nAlgorithm 2 Reinforcement Incentive Learning (RIL)\n1: for each episode do\n2:\n3:\n\nwhere the noise \u0001t follows a Gaus-\nsian process, and \u03c0 = P(a|\u02c6s) denotes\nthe current policy. By doing so, we\ngain two bene\ufb01ts. First, the approx-\nimation greatly simpli\ufb01es the deriva-\ntion of the update equation for the Q-\nfunction. Secondly, as shown in our\nempirical results later, this kind of ap-\nproximation is robust against differ-\nent worker models. Besides, follow-\ning [6] we approximate Q-function as\nQ\u03c0(\u02c6st+1, at+1) \u2248 E\u03c0Q\u03c0(\u02c6st+1, at+1) + \u0001\u03c0, where \u0001\u03c0 also follows a Gaussian process.\nUnder the Gaussian process approximation, all the observed rewards and the corresponding Q values\nup to the current step t form a system of equations, and it can be written as r = HQ + N, where r,\nQ and N denote the collection of rewards, Q values, and residuals. Following Gaussian process\u2019s\nassumption for residuals, N \u223c N (0, \u03c32), where \u03c32 = diag(\u03c32, . . . , \u03c32). The matrix H satis\ufb01es\nH(k, k) = 1 and H(k, k + 1) = \u2212\u03b3 for k = 1, . . . , t. Then, by using the online Gaussian process\nregression algorithm [5], we effectively learn the Q-function as\n(10)\nwhere k(\u02c6s, a) = [k((\u02c6s, a), (\u02c6s1, a1)), . . . , k((\u02c6s, a), (st, at))]T and K = [k(\u02c6s1, a1), . . . , k(\u02c6st, at)].\nHere, we use k(\u00b7,\u00b7) to denote the Gaussian kernel. Finally, we employ the classic \u0001-greedy method to\ndecide at based on the learned Q-function. To summarize, we provide a formal description about RIL\nin Algorithm 2. Note that, when updating K, H and r in Line 6, we employ the sparse approximation\nproposed in [6] to discard some data so that the size of these matrices does not increase in\ufb01nitely.\n\nQ(\u02c6s, a) = k(\u02c6s, a)T(K + \u03c32)\u22121H\u22121r\n\n4 Theoretical Analysis on Incentive Compatibility\n\nTheorem 3 (One Step IC). At any time step t, when M (cid:29) 1, (cid:81)N\n\nIn this section, we prove the incentive compatibility of our Bayesian inference and reinforcement\nlearning algorithms. Our main results are as follows:\ni=1(2Pi,H )2 \u2265 M, at >\nmaxi ci,H /(Pi,H \u2212 0.5), reporting truthfully and exerting high efforts is the utility-maximizing\nstrategy for any worker i at equilibrium (if other workers all follow this strategy).\nProof. In Appendix E, we prove that when at > ci,H /(Pi,H \u2212 0.5), if \u02dcPt\ni \u2248 Pt\ni, any worker i\u2019s\nutility-maximizing strategy would be reporting truthfully and exerting high efforts. Since Theorem 1\nhas provided the convergence guarantee, we can conclude Theorem 2.\n(cid:19)(cid:89)\nTheorem 4 (Long Term IC). Suppose the conditions in Theorem 3 are satis\ufb01ed and the learned\n(cid:113)\nQ-function approaches the real Q\u03c0(\u02c6s, a). When the following equation holds for i = 1, . . . , N,\n4Px,H (1 \u2212 Px,H )\n\nF (1) \u2212 F (1 \u2212 \u03c8i)\n\nPx,H \u00b7 GA >\n\n(cid:18) \u03c4\u22121\n\n(cid:88)\n\n\u03c4+1\n\u03c4\u22121\n\n, \u03c8i =\n\n(11)\n\n\u03c4+1\n\n\u03b7M\n\n+\n\nx(cid:54)=i\n\nx(cid:54)=i\n\n1 \u2212 \u03b3\n\nalways reporting truthfully and exerting high efforts is the utility-maximizing strategy for any worker\ni in the long term if other workers all follow this strategy. Here, GA = mina,b\u2208A,a(cid:54)=b |a \u2212 b| denotes\nthe minimal gap between two available values of the scaling factor.\n\nIn order to induce RIL to change actions, worker i must let RIL learn a wrong Q-function. Thus,\nour main idea of proof is to derive the upper bounds of the effects of worker i\u2019s reports on the\nQ-function. Besides, Theorem 3 points that, to design robust reinforcement learning algorithms\nagainst the manipulation of strategical agents, we should leave a certain level of gaps between actions.\nThis observation may be of independent interests to reinforcement learning researchers.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Empirical analysis on Bayesian Inference (a) and RIL (b-c). To be more speci\ufb01c, (a)\ncompares the inference bias (i.e. the difference from the inferred label accuracy to the real one) of\nour Bayesian inference algorithm with that of EM and variational inference, averaged over 100 runs.\n(b) draws the gap between the estimation of the data requester\u2019s cumulative utility and the real one,\nsmoothed over 5 episodes. (c) shows the learning curve of our mechanism, smoothed over 5 episodes.\n\n5 Empirical Experiments\n\nIn this section, we empirically investigate the competitiveness of our solution. To be more speci\ufb01c,\nwe \ufb01rst show our proposed Bayesian inference algorithm produces more accurate estimates for the\naggregated label accuracy than the existing inference methods. Encouraged by the above results, we\nthen move on to evaluate RIL on three popular worker models, each representing one classic way\nof modeling how human agent makes decisions. Note for this part, our experiments are conducted\non the real-world data. Via these experiments, we demonstrate that our RIL algorithm consistently\nmanages to learn a good incentive policy, under above three rationality models of the workers. Lastly,\nwe show as a bonus bene\ufb01t of our mechanism that, leveraging Bayesian inference to fully exploit the\ninformation contained in the collected labels leads to more robust and lower-variance payments at\neach step.\n\n5.1 Empirical Analysis on Bayesian Inference\n\nThe aggregated label accuracy estimated from our Bayesian inference algorithm serves as a major\ncomponent of the state representation and reward function to RIL, and thus critically affects the\nperformance of our mechanism. Due to this, we choose to \ufb01rst investigate the bias of our Bayesian\ninference algorithm. To be more speci\ufb01c, we compare our Bayesian inference algorithm with two\npopular inference algorithms in crowdsourcing, that is, the EM estimator [16] and the variational\ninference estimator [11]. We utilize the RTE dataset, where workers need to check whether a\nhypothesis sentence can be inferred from the provided sentence [20]. In order to simulate strategic\nbehaviors of workers, we mix these data with random noise by replacing a part of real-world labels\nwith uniformly generated ones (low quality labels).\nAs Figure 2a suggests, compared to EM and variational inference, our proposed Bayesian inference\nalgorithm can signi\ufb01cantly lower the bias of the estimates of the aggregated label accuracy, and the\n\n7\n\n0.00.20.40.60.81.0Precentageofnoise0.00.10.20.30.40.5|\u02dcA\u2212A|VariationalInferenceEMestimatorOurBayesianInference\fTable 1: Performance comparison under three worker models. Data requester\u2019s cumulative utility\nnormalized over the number of tasks. Standard deviation reported in parenthesis.\n\nMETHOD\n\nRATIONAL\n\nQR\n\nMWU\n\nFIXED OPTIMAL\nHEURISTIC OPTIMAL\nADAPTIVE OPTIMAL\nRIL\n\n27.584 (.253)\n27.643 (.174)\n27.835 (.209)\n27.184 (.336)\n\n21.004 (.012)\n21.006 (.001)\n21.314 (.011)\n21.016 (.018)\n\n11.723 (.514)\n12.304 (.515)\n17.511 (.427)\n15.726 (.416)\n\nadvantage of our Bayesian inference algorithm is mostly in the high-noise regime. This is because\nboth variational inference and EM maintain the updates on the posterior distribution via the estimates\nof workers\u2019 PoBCs. When the noise is very high, inaccurate PoBCs inevitably cause the posterior\ndistribution to be biased. In our Bayesian inference algorithm, we introduce a soft Dirichlet prior for\ntrue labels, acting as a regularization term, to alleviate this bias.\nIn fact, we cannot use the estimates from either the EM or the variational inference as an alternative\nfor the reward signal because the biases of their estimates are much higher when the noise level is\nhigh; for instance, this bias can reach 0.45 while the range of the label accuracy is between [0.5, 1.0].\nThis set of experiments justi\ufb01es our motivation to develop our own inference algorithm and reinforces\nour claim that our inference algorithm could provide fundamentals for the further development of\npotential learning algorithms for crowdsourcing.\n\n5.2 Empirical Analysis on RIL\n\ndata requester\u2019s cumulative utility R =(cid:80)\n\nWe move on to investigate whether RIL consistently learns a good policy, which maximizes the\nt rt. For all the experiments in this subsection, we set the\nenvironment parameters as follows: N = 10, PH = 0.9, b = 0, cH = 0.02; the set of the scaling\nfactors is A = {0.1, 1.0, 5.0, 10}; F (A) = A10 and \u03b7 = 0.001 as in the utility function (Eqn. (3));\nthe number of time steps for an episode is set to be 28. Meanwhile, for the adjustable parameters in\nour mechanism, we set the number of tasks at each step M = 100 and the exploration rate for RIL\n\u0001 = 0.2. We report the averaged results over 5 runs to reduce the effect of outliers. To demonstrate\nour algorithm\u2019s general applicability, we test it under three different worker models, each representing\na popular way for modeling how human agents make decisions. We provide a simple description of\nthem as follows, whereas the detailed version is deferred to Appendix H. (i) Rational workers alway\ntake the utility-maximizing strategies. (ii) QR workers [13] follow strategies corresponding to an\nutility-dependent distribution (which is pre-determined). This model has been used to study agents\nwith bounded rationality. (iii) MWU workers [10] update their strategies according to the celebrated\nmultiplicative weights update algorithm. This model has been used to study adaptive learning agents.\nOur \ufb01rst set of experiments is a continuation to the last subsection. To be more speci\ufb01c, we \ufb01rst focus\non the estimation bias of the data requester\u2019s cumulative utility R. This value is used as the reward in\nRIL and is calculated from the estimates of the aggregated label accuracy. This set of experiments\naim to investigate whether our RIL module successfully leverages the label accuracy estimates and\npicks up the right reward signal. As Figure 2b shows, the estimates only deviate from the real values\nin a very limited magnitude after a few episodes of learning, regardless of which worker model the\nexperiments run with. The results further demonstrate that our RIL module observe reliable rewards.\nThe next set of experiments is about how quickly RIL learns. As Figure 2c shows, under all three\nworker models, RIL manages to pick up and stick to a promising policy in less than 100 episodes.\nThis observation also demonstrates the robustness of RIL under different environments.\nOur last set of experiments in this subsection aim to evaluate the competitiveness of the policy learned\nby RIL. In Table 1, we use the policy learned after 500 episodes with exploration rate turned off (i.e.\n\u0001 = 0) and compare it with three benchmarks constructed by ourselves. To create the \ufb01rst one, Fixed\nOptimal, we try all 4 possible \ufb01xed value for the scaling factor and report the highest cumulative\nreward realized by either of them. To create the second one, Heuristic Optimal, we divide the value\nregion of \u02dcAt into \ufb01ve regions: [0, 0.6), [0.6, 0.7), [0.7, 0.8), [0.8, 0.9) and [0.9, 1.0]. For each region,\nwe select a \ufb01xed value for the scaling factor at. We traverse all 45 = 1024 possible combinations to\ndecide the optimal heuristic strategy. To create the third one, Adaptive Optimal, we change the scaling\nfactor every 4 steps and report the highest cumulative reward via traversing all 47 = 16384 possible\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Empirical analysis on our Bayesian inference algorithm, averaged over 1000 runs. (a)\nAverage payment per task given true label\u2019s distribution. (b) Average payment per task given PoBCs\nof workers excluding i. (c) The standard deviation of the payment given worker i\u2019s PoBC.\n\ncon\ufb01gurations. This benchmark is infeasible to be reproduced in real-world practice, once the number\nof steps becomes large. Yet it is very close to the global optimal in the sequential setting. As Table 1\ndemonstrates, the two benchmarks plus RIL all achieve a similar performance tested under rational\nand QR workers. This is because these two kinds of workers have a \ufb01xed pattern in responding to\nincentives and thus the optimal policy would be a \ufb01xed scaling factor throughout the whole episode.\nOn contrast, MWU workers adaptively learn utility-maximizing strategies gradually, and the learning\nprocess is affected by the incentives. Under this worker environment, RIL managers to achieve an\naverage utility score of 15.6, which is a signi\ufb01cant improvement over \ufb01xed optimal and heuristic\noptimal (which achieve 11.7 and 12.3 respectively) considering the unrealistic global optimal is only\naround 18.5. Up to this point, with three sets of experiments, we demonstrate the competitiveness\nof RIL and its robustness under different work environments. Note that, when constructing the\nbenchmarks, we also conduct experiments on DG13, the state-of-the-art peer prediction mechanism\nfor binary labels [2], and get the same conclusion. For example, when DG13 and MWU workers\nare tested for Fixed Optimal and Heuristic Optimal, the cumulative utilities are 11.537(.397) and\n11.908(0.210), respectively, which also shows a large gap with RIL.\n\n5.3 Empirical Analysis on One Step Payments\n\nIn this subsection, we compare the one step payments provided by our mechanism with the payments\ncalculated by DG13, the state-of-the-art peer prediction mechanism for binary labels [2]. We \ufb01x\nthe scaling factor at = 1 and set M = 100, N = 10, PH = 0.8, b = 0 and mt\ni = 90. To set up\nthe experiments, we generate task j\u2019s true label L(j) following its distribution \u03c4 (to be speci\ufb01ed)\nand worker i\u2019s label for task j based on i\u2019s PoBC Pi and L(j). In Figure 3a, we let all workers\nexcluding i report truthfully and exert high efforts (i.e. P\u2212i = PH), and increase \u03c4+1 from 0.05 to\n0.95. In Figure 3b, we let \u03c4+1 = 0.5, and increase other workers\u2019 PoBCs P\u2212i from 0.6 to 0.95. As\nboth \ufb01gures reveal, in our mechanism, the payment for worker i almost only depends on his/her\nown strategy. On contrast, in DG13, the payments are clearly affected by the distribution of true\nlabels and the strategies of other workers. In other words, our Bayesian inference is more robust to\ndifferent environments. Furthermore, in Figure 3c, we present the standard deviation of the payment\nto worker i. We let \u03c4+1 = 0.5, P\u2212i = PH and increase Pi from 0.6 to 0.95. As shown in the \ufb01gure,\nour method manages to achieve a noticeably smaller standard deviation compared to DG13. Note\nthat, in Figure 3b, we implicitly assume that most of workers will at least not adversarially report\nfalse labels, which is widely-adopted in previous studies [11]. For workers\u2019 collusion attacks, we\nalso have some defending tricks provided in Appendix F.\n\n6 Conclusion\n\nIn this paper, we build an inference-aided reinforcement mechanism leveraging Bayesian inference\nand reinforcement learning techniques to learn the optimal policy to incentivize high-quality labels\nfrom crowdsourcing. Our mechanism is proved to be incentive compatible. Empirically, we show that\nour Bayesian inference algorithm can help improve the robustness and lower the variance of payments,\nwhich are favorable properties in practice. Meanwhile, our reinforcement incentive learning (RIL)\nalgorithm ensures our mechanism to perform consistently well under different worker models.\n\n9\n\n0.050.150.250.350.450.550.650.750.850.95\u03c4+1=P(L=+1)\u22120.4\u22120.20.00.20.40.6AveragePaymentPerTaskRIL(High,Truthful)RIL(Low,Truthful)RIL(High,False)DG13(High,Truthful)DG13(Low,Truthful)DG13(High,False)0.600.650.700.750.800.850.900.95P\u2212i\u22120.4\u22120.20.00.20.40.6AveragePaymentPerTaskRIL(High,Truthful)RIL(Low,Truthful)RIL(High,False)DG13(High,Truthful)DG13(Low,Truthful)DG13(High,False)0.600.650.700.750.800.850.900.95Pi0.0200.0250.0300.0350.0400.0450.0500.0550.0600.065StandardDeviationofPaymentRIL(N=10)RIL(N=5)DG13(N=10)DG13(N=5)\fAcknowledgments\n\nThis work started when Zehong Hu was at the Rolls-Royce@NTU Corporate Lab with support\nfrom the National Research Foundation (NRF) Singapore under the Corp Lab@University Scheme.\nYitao is partially supported by NSF grants #IIS-1657613, #IIS-1633857 and DARPA XAI grant\n#N66001-17-2-4032. Yang Liu acknowledges supports from NSF CCF #1718549. The authors also\nthank Anxiang Zeng from Alibaba Group for valuable discussions.\n\nReferences\n[1] Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical decision making for optimal budget\n\nallocation in crowd labeling. Journal of Machine Learning Research, 16:1\u201346, 2015.\n\n[2] Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous\n\npro\ufb01ciency. In Proc. of WWW, 2013.\n\n[3] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer\n\nerror-rates using the em algorithm. Applied statistics, pages 20\u201328, 1979.\n\n[4] Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini, Panagiotis G Ipeirotis, and\nPhilippe Cudr\u00e9-Mauroux. The dynamics of micro-task crowdsourcing: The case of amazon\nmturk. In Proc. of WWW, 2015.\n\n[5] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with gaussian processes.\n\nIn Proc. of ICML, 2005.\n\n[6] Milica Gasic and Steve Young. Gaussian processes for pomdp-based dialogue manager opti-\nmization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1):28\u201340,\n2014.\n\n[7] Jeff Howe. The rise of crowdsourcing. Wired Magazine, 14(6), 06 2006.\n\n[8] Radu Jurca, Boi Faltings, et al. Mechanisms for making crowds truthful. Journal of Arti\ufb01cial\n\nIntelligence Research, 34(1):209, 2009.\n\n[9] Yitao Liang, Marlos C. Machado, Erik Talvitie, and Michael Bowling. State of the art control\n\nof atari games using shallow reinforcement learning. In Proc. of AAMAS, 2016.\n\n[10] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and\n\ncomputation, 108(2):212\u2013261, 1994.\n\n[11] Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing. In Proc.\n\nof NIPS, 2012.\n\n[12] Yang Liu and Yiling Chen. Sequential peer prediction: Learning to elicit effort using posted\n\nprices. In Proc. of AAAI, pages 607\u2013613, 2017.\n\n[13] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form\n\ngames. Games and economic behavior, 10(1):6\u201338, 1995.\n\n[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level Control through Deep Reinforcement\nLearning. Nature, 518(7540):529\u2013533, 02 2015.\n\n[15] Dra\u017een Prelec. A bayesian truth serum for subjective data. Science, 306(5695):462\u2013466, 2004.\n\n[16] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin,\nLuca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research,\n11(Apr):1297\u20131322, 2010.\n\n[17] Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving data\n\nquality and data mining using multiple, noisy labelers. In Proc. of SIGKDD, 2008.\n\n10\n\n\f[18] Edwin D Simpson, Matteo Venanzi, Steven Reece, Pushmeet Kohli, John Guiver, Stephen J\nRoberts, and Nicholas R Jennings. Language understanding in the wild: Combining crowd-\nsourcing and machine learning. In Proc. of WWW, 2015.\n\n[19] Aleksandrs Slivkins and Jennifer Wortman Vaughan. Online decision making in crowdsourcing\n\nmarkets: Theoretical challenges. ACM SIGecom Exchanges, 12(2):4\u201323, 2014.\n\n[20] Rion Snow, Brendan O\u2019Connor, Daniel Jurafsky, and Andrew Y Ng. Cheap and fast\u2014but is it\ngood?: evaluating non-expert annotations for natural language tasks. In Proc. of EMNLP, 2008.\n\n[21] Jens Witkowski and David C Parkes. Peer prediction without a common prior. In Proc. of ACM\n\nEC, 2012.\n\n[22] Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A\n\nprovably optimal algorithm for crowdsourcing. In Proc. of NIPS, 2014.\n\n[23] Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. Truth inference in\ncrowdsourcing: is the problem solved? Proc. of the VLDB Endowment, 10(5):541\u2013552, 2017.\n\n[24] Dengyong Zhou, Qiang Liu, John Platt, and Christopher Meek. Aggregating ordinal labels from\n\ncrowds by minimax conditional entropy. In Proc. of ICML, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2646, "authors": [{"given_name": "Zehong", "family_name": "Hu", "institution": "Alibaba Group"}, {"given_name": "Yitao", "family_name": "Liang", "institution": "UCLA"}, {"given_name": "Jie", "family_name": "Zhang", "institution": "Nanyang Technological University"}, {"given_name": "Zhao", "family_name": "Li", "institution": "Alibaba Group"}, {"given_name": "Yang", "family_name": "Liu", "institution": "Harvard University"}]}