{"title": "Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?", "book": "Advances in Neural Information Processing Systems", "page_first": 1914, "page_last": 1922, "abstract": "We study the problem of estimating continuous quantities, such as prices, probabilities, and point spreads, using a crowdsourcing approach. A challenging aspect of combining the crowd's answers is that workers' reliabilities and biases are usually unknown and highly diverse. Control items with known answers can be used to evaluate workers' performance, and hence improve the combined results on the target items with unknown answers. This raises the problem of how many control items to use when the total number of items each workers can answer is limited: more control items evaluates the workers better, but leaves fewer resources for the target items that are of direct interest, and vice versa. We give theoretical results for this problem under different scenarios, and provide a simple rule of thumb for crowdsourcing practitioners. As a byproduct, we also provide theoretical analysis of the accuracy of different consensus methods.", "full_text": "Scoring Workers in Crowdsourcing: How Many\n\nControl Questions are Enough?\n\nQiang Liu\n\nDept. of Computer Science\nUniv. of California, Irvine\n\nqliu1@uci.edu\n\nMark Steyvers\n\nDept. of Cognitive Sciences\nUniv. of California, Irvine\n\nmark.steyvers@uci.edu\n\nAlexander Ihler\n\nDept. of Computer Science\nUniv. of California, Irvine\nihler@ics.uci.edu\n\nAbstract\n\nWe study the problem of estimating continuous quantities, such as prices, proba-\nbilities, and point spreads, using a crowdsourcing approach. A challenging aspect\nof combining the crowd\u2019s answers is that workers\u2019 reliabilities and biases are usu-\nally unknown and highly diverse. Control items with known answers can be used\nto evaluate workers\u2019 performance, and hence improve the combined results on the\ntarget items with unknown answers. This raises the problem of how many control\nitems to use when the total number of items each workers can answer is limited:\nmore control items evaluates the workers better, but leaves fewer resources for the\ntarget items that are of direct interest, and vice versa. We give theoretical results\nfor this problem under different scenarios, and provide a simple rule of thumb for\ncrowdsourcing practitioners. As a byproduct, we also provide theoretical analysis\nof the accuracy of different consensus methods.\n\n1\n\nIntroduction\n\nThe recent rise of crowdsourcing has provided a fast and inexpensive way to collect human knowl-\nedge and intelligence, as illustrated by human intelligence marketplaces such as Amazon Mechan-\nical Turk, games with purpose like ESP, reCAPTCHA, and crowd-based forecasting for politics\nand sports. One of the philosophies behind these successes is the wisdom of crowds phenomenon:\nproperly combining a group of untrained people can be better than the average performance of the\nindividuals, and even as good as the experts in many application domains (e.g., Surowiecki, 2005,\nSheng et al., 2008). Unfortunately, it is not always obvious how best to combine the crowd, because\nthe (often anonymous) workers have unknown and diverse levels of expertise, and potentially sys-\ntematic biases across the crowd. Na\u00a8\u0131ve consensus methods which simply take uniform averages or\nthe majority answer of the workers have been known to perform poorly. This raises the problem of\nscoring the workers, that is, estimating their expertise, bias and any other associated parameters, in\norder to combine their answers more effectively.\n\nOne direct way to address this problem is to score workers using their past performance on similar\nproblems. However, this is not always practical, since historical records are hard to maintain for\nanonymous workers, and their past tasks may be very different from the current one. An alternative\nis the idea behind reCAPTCHA: \u201cseed\u201d some control items with known answers into the assigned\ntasks (without telling workers which are control items), score the workers using these control items,\nand weight their answers accordingly on the unknown target items. Similar ideas have been widely\nused in existing crowdsourcing systems. CrowdFlower, for example, provides interfaces and tools\nto allow requesters to explicitly specify and analyze a set of control items (sometimes called \u201cgold\ndata\u201d). The reCAPTCHA example is a particularly simple case, where workers answer exactly one\ncontrol and one target item. However, in general crowdsourcing, the workers may answer hundreds\nof questions, raising the question of how many control items should be used. There is a clear trade-\noff: having workers answer more control items gives better estimates of their performance and any\n\n1\n\n\fpotential systematic bias, but leaves fewer resources for the target items that are of direct interest.\nHowever, using few control items gives poor estimates of workers\u2019 performance and their biases,\nalso leading to bad results. A deep understanding of the value of control items may provide useful\nguidance for crowdsourcing practitioners.\n\nOn the other hand, a line of research has studied more advanced consensus methods that are able\nto simultaneously estimate the workers\u2019 performance and items\u2019 answers without any ground truth\non the items, by building joint statistical models of the workers and labels (e.g., Dawid and Skene,\n1979, Whitehill et al., 2009, Karger et al., 2011, Liu et al., 2012, Zhou et al., 2012). The basic\nidea is to score the workers by their agreement with other workers, assuming that the majority of\nworkers are correct. Perhaps surprisingly, the worker reliabilities estimated by these \u201cunsupervised\u201d\nconsensus methods can be almost as good as those estimated when the true labels of all the items are\nknown, and are much better than self-evaluated worker reliability (Romney et al., 1987, Lee et al.,\n2012). Control items can also be incorporated into these methods: but how much can we expect\nthem to improve results, or does an \u201cunsupervised\u201d method suf\ufb01ce?\n\nThe goal of this paper is to study the value of control items, and provide practical guidance on how\nmany control items are enough under different scenarios. We give both theoretical and empirical\nresults for this problem, and provide some rules of thumbs that that are easy to use in practice.\nWe develop our theory on a class of Gaussian models for estimating continuous quantities, such as\nforecasting probabilities and point spreads in sports games, and show how it extends to more general\nmodels. As a byproduct, we also provide analytic results of the accuracy of different consensus\nalgorithms.\nImportant practical issues such as the impact of model misspeci\ufb01cation, systematic\nbiases and heteroscedasticity are also highlighted on real datasets.\n\n2 Background\n\nj that characterizes their expertise, bias, any other relevant features. We denote\n\ncontinuous variables in this paper. Denote by nt the number of target items and m the workers.\nj ) the set of target (and control) items\nLet \u2202i be the set of workers assigned to item i, and \u2202t\nlabeled by worker j. The assignment relationship between the workers and the target items can\n\nAssume there is a setT of target items, associated with a set of labels \u00b5T \u2236={\u00b5i\u2236 i\u2208T} whose true\nvalues \u00b5\u2217T we want to estimate. In addition, we have a setC of control (or training) items whose true\nlabels \u00b5\u2217C\u2236={\u00b5\u2217\ni\u2236 i\u2208C} are known. We denote the set of workers byW; each worker j is associated\nwith a parameter \u03bd\u2217\nj\u2236 j \u2208W}. Both \u00b5 and \u03bd are assumed to be\nthe complete vector of worker parameters by \u03bd \u2236={\u03bd\u2217\nbe represented by a bipartite graphGt=(T ,W,Et), where there is an edge(ij)\u2208Et iff item i is\nj) be the number of target (and control) items assigned to the j-th worker. Note that{ri} and\n{(cid:96)t\nj} are the two degree sequences of the bipartite graphGt.\nrandom variable drawn from a probabilistic distribution p(xij\u00b5\u2217\nj). The computational question\ni , \u03bd\u2217\nis then to construct an estimator \u02c6\u00b5T of the true labels \u00b5\u2217T based on the crowdsourced labels{xij},\nsuch that the expected mean square error (MSE) on the target items, E\u0001\u02c6\u00b5T \u2212 \u00b5\u2217T2(cid:6), is minimized.\n\nassigned to worker j. Let ri be the number of workers assigned to the i-th target item, and let (cid:96)t\nj\n(and (cid:96)c\n\nDenote by xij the label we collect from worker j for item i. In general, we can assume that xij is a\n\nj (and \u2202c\n\nGaussian Model. We focus on a class of simple Gaussian models on the labels xij:\n\ni+ b\u2217\nxij= \u00b5\u2217\nj+ \u03beij,\ni is the quantity of interest of item i, b\u2217\n\nwhere \u00b5\u2217\n\n\u03beij\u223cN(0, \u03c3\u22172),\n\nj is the bias of worker j, and \u03c3\u22172 is the variance.\n\n(1)\n\nFor some quantities, like probabilities and prices, proper transforms should be applied before using\nsuch Gaussian models. Model (1) is equivalent to the two-way \ufb01xed effects model in statistics (e.g.,\nChamberlain, 1982). It captures heterogeneous biases across workers that are commonly observed\nin practice, for example in workers\u2019 judgments on probabilities and prices, and which can have\nsigni\ufb01cant effects on the estimate accuracy. This model also has nice theoretical properties and will\nplay an important role in our theoretical analysis. Note that the biases are not identi\ufb01able solely from\n\nthe crowdsourced labels{xij}, making it is necessary to add some control items or other information\n\nwhen decoding the answers.\n\n2\n\n\fdifferent level of Gaussian noise: that is, xij = \u00b5\u2217\ni + b\u2217\nj+ \u03c3\u2217\nModel (1) as the bias-only model. We will also consider another special case, xij = \u00b5\u2217\n\nAn extension of model (1) is to introduce heteroscedasticity, allowing different workers to have\nis\na variance parameter of worker j. We will refer to this extension as the bias-variance model, and\nj \u03beij,\nwhich assumes the workers all have zero bias but different variances (the variance-only model).\nTheoretical analysis of the bias-variance and variance-only models are signi\ufb01cantly more dif\ufb01cult\ndue to the presence of the variance parameters, but is still possible under asymptotic assumption.\n\nj \u03beij, where \u03beij \u223cN(0, 1) and \u03c3\u22172\nj+ \u03c3\u2217\n\nj\n\n2.1 Consensus Algorithms With Partial Ground Truth\n\nControl items with known true labels can be used to estimate workers\u2019 parameters, and hence im-\nprove the estimation accuracy. In this section, we introduce two types of consensus methods that\nincorporate the control items in different ways: one simply scores the workers based on their perfor-\nmance on the control items, while the other uses a joint maximum likelihood estimator that scores\nthe worker based on their answers on both control items and target items. We present both methods\n\nin terms of a general model p(xij\u00b5i, \u03bdj) here; the updates for the Gaussian models can be easily\n\nderived, but are omitted for space.\nTwo-stage Estimator: the workers\u2019 parameters are \ufb01rst estimated using the control items, and are\nthen used to predict the target items. That is,\n\n\u02c6\u03bdj= arg max\n\u02c6\u00b5i= arg max\n\n\u03bdj\n\n\u00b5i\n\nQ\ni\u2208\u2202c\nQ\nj\u2208\u2202i\n\nj\n\nlog p(xij\u00b5\u2217\ni , \u03bdj),\nlog p(xij\u00b5i, \u02c6\u03bdj),\n\nfor all j\u2208W,\nfor all i\u2208T ,\n\n(2)\n\n(3)\n\nScoring workers:\n\nPredicting target items:\n\nwhere we use the maximum likelihood estimator as a general procedure for estimating parameters.\n\nJoint Estimator: we directly maximize the joint likelihood of the crowdsourced labels{xij} of\nboth target and control items, with \u00b5C of the control items clamped to the true values \u00b5\u2217C. That is,\n\n[\u00b5T ,\u03bd] \u0001Q\ni , \u03bdj)+Q\n[\u02c6\u00b5T , \u02c6\u03bd]= arg max\ni\u2208CQ\ni\u2208T Q\nj\u2208\u2202i\nj\u2208\u2202i\nwhich can be solved by block coordinate descent, alternatively optimizing \u00b5T and \u03bd. Compared to\ni through the model assumption p(xij\u00b5\u2217\nj). Therefore, the joint\nxij provide information on \u00b5\u2217\ni , \u03bd\u2217\n\nthe two-stage estimator, the joint estimator estimates the workers\u2019 parameters based on both the con-\ntrol items and the target items, even though their true labels are unknown. This is because the labels\n\nlog p(xij\u00b5i, \u03bdj)\u0001,\n\nlog p(xij\u00b5\u2217\n\nestimator may be much more ef\ufb01cient than the two-stage estimator when the model assumptions are\nsatis\ufb01ed, but may perform poorly if the model is misspeci\ufb01ed.\n\n(4)\n\n3 How many control items are enough?\n\nWe now consider the central question: assuming each worker answers (cid:96) items (we refer (cid:96) as the\n\nthe expected MSE? To be concrete, here we assume all the workers (items) are assigned to the\n\nsemi-regular bipartite graph, which can be generated by the con\ufb01guration model (e.g., Karger et al.,\n\nbudget), including k control items and (cid:96)\u2212 k target items, what is the optimal choice of k to minimize\nsame number of randomly selected items (workers), and hence the assignment graphGt is a random\n2011). We assume r is the number of labels received by each target item, so that r= m((cid:96)\u2212 k)~nt.\noptimal k should scale asO(\u221a\nnt) when\n\nObviously, the optimal number of control items should depend on their usage in the subsequent\nconsensus method. We will show that the two-stage and joint estimators exploit control items in\nfundamentally different ways, and yield very different optimal values of k. Roughly speaking, the\n\n(cid:96)) when using a two-stage estimator, compared toO((cid:96)~\u221a\n\nusing joint estimators. We now discuss these two methods separately.\n\n3.1 Optimal k for Two-stage Estimator\n\nWe \ufb01rst address the problem on the bias-only model, which has a particularly simple analytic solu-\ntion. We then extend our results to more general models.\n\n3\n\n\fTheorem 3.1. (i). For the bias-only model with xij= \u00b5\u2217\ni+ b\u2217\nj+ \u03beij, where \u03beij are i.i.d. noise drawn\nfromN(0, \u03c3\u22172), the expected mean square error (MSE) of the two-stage estimator in (2)-(3) is\ni2~nt]= \u03c3\u22172\n(1+ 1\n(ii). Note that r = m((cid:96)\u2212 k)~nt, and the optimal k that minimizes the expected MSE in (5) is\nk\u2217=\u0904\u0001\n\ni\u2208T\u02c6\u00b5i\u2212 \u00b5\u2217\nE[Q\n(cid:96)+ 5~4\u2212 3~2\u0905\u2248\u221a\n(cid:96), where\u0904z\u0905 denotes the smallest integer no less than z.\nfor\u2200i\u2208T , \u2200j\u2208W.\n(xij\u2212 \u02c6bj),\n\u02c6\u00b5i= 1\nQ\nj\u2208\u2202i\nhave that E\u02c6\u00b5i= \u00b5\u2217\ni , and Var(\u02c6\u00b5i) as in (5). The remaining steps are straightforward.\n\nSince the xij are Gaussian, the \u02c6\u00b5i are also Gaussian. Calculating the mean and variance of \u02c6\u00b5i, we\n\nProof. The solution of two-stage estimator has a simple linear form under the bias-only model,\n\n(xij\u2212 \u00b5\u2217\ni),\n\n\u02c6bj= 1\n\nQ\ni\u2208\u2202c\n\n).\n\n(5)\n\nk\n\nj\n\nr\n\nk\n\nr\n\n(cid:96)) achieves the desired balance of trade-offs.\n\nto in\ufb01nity. In addition, if the budget (cid:96) grows to in\ufb01nity, the optimal k should also grow to in\ufb01nity,\notherwise the multiplicative constant is strictly larger than one, which is suboptimal. One can readily\n\nRemarks. (i). Eq. (5) shows that the MSE is inversely proportional to the number r of workers per\ntarget item, while the number k of control items per workers only re\ufb01nes the multiplicative constant.\nTherefore, the resources assigned to the control items are much less \u201cuseful\u201d than those assigned\ndirectly to the target items, suggesting the optimal k should be much less than the budget (cid:96).\n(ii). On the other hand, if k is too small, the multiplicative constant becomes large, which also\n\nGeneral Models. The bias-only model is simple enough to give closed form solutions. It turns\nout that we can obtain similar results for more general models such as the bias-variance and the\nvariance-only model, but only in the asymptotic regime.\n\ndegrades the MSE. In the extreme, if k = 0 then the bias is unidenti\ufb01able, and the MSE grows\nsee that k=O(\u221a\nTo set up, assume{\u00b5i} and{\u03bdj} are drawn from prior distributions Q\u00b5 and Q\u03bd, respectively. As-\nsume log p(xij\u00b5i, \u03bdj) is twice differentiable w.r.t. \u00b5i and \u03bdj for all xij. De\ufb01ne the Fisher infor-\nmation matrix H\u00b5\u00b5=\u2212Ex[\u22072\n\u00b5\u00b5 log p(x\u00b5, \u03bd)], and similarly for H\u00b5\u03bd and H\u03bd\u03bd. Note that H\u00b5\u00b5 is a\nrandom variable dependent on \u00b5 and \u03bd, and denote by E\u00b5\u03bd[H\u00b5\u00b5] its expectation w.r.t. Q\u00b5 and Q\u03bd.\nTheorem 3.2. (i). Assume the crowdsourced labels{xij} are drawn from p(xij\u00b5\u2217\nj), where\ni , \u03bd\u2217\n{\u00b5\u2217\ni} and{\u03bd\u2217\nj} are drawn from priors Q\u00b5 and Q\u03bd, respectively. The asymptotic expected MSE of\n\u00011+ a\n\u0001,\ni2~nt(cid:6)= \u02dc\u03c32\nE\u0001Q\ni\u2208T\u02c6\u00b5i\u2212 \u00b5\u2217\n\u00b5\u03bd log p(x\u00b5, \u03bd)T(cid:6), and a =\n\u00b5\u00b5)], J\u00b5\u00b5 = Ex,\u03bd\u0001\u22072\n\u00b5\u03bd log p(x\u00b5, \u03bd)H\u22121\n\u03bd\u03bd\u22072\nwhere \u02dc\u03c32 = E\u00b5\u03bd[tr(H\u22121\nE\u00b5\u03bd[tr(H\u22121\n\u00b5\u00b5)],\n\u00b5\u00b5)]~E\u00b5\u03bd[tr(H\u22121\n\u00b5\u00b5J\u00b5\u00b5H\u22121\n(ii). Note that r = m((cid:96)\u2212 k)~nt, and the optimal k that minimizes the asymptotic MSE in (6) is\nk\u2217=\u0904\u0001\na(cid:96)+ a2+ 1~4\u2212 a\u2212 1~2\u0905\u2248\u221a\na(cid:96), where\u0904k\u0905 denotes the smallest integer no less than k.\n\nthe two-stage estimator de\ufb01ned in (2)-(3), as both r and k grow to in\ufb01nity, is\n\n(6)\n\nk\n\nr\n\nProof. Similar to Theorem 3.1, except asymptotic normality of M-estimators (e.g., Van der Vaart,\n2000) should be used.\n\nRemarks. (i). The result in Theorem 3.2 is parallel to that in Theorem 3.1 for bias-only models,\nexcept that the contribution from uncertainty on the workers\u2019 parameters is scaled by a model-\ndependent factor a, and correspondingly, the optimal k is scaled by\n\na. Calculation yields a= 2 for\nthe variance-only model, and a= 3 for the bias-variance model for any choice of prior Q\u00b5 and Q\u03bd.\n(ii). Letting k take continuous values, the optimal k to minimize (6) is k\u2217=\u221a\na(cid:96)+ a2\u2212 a, which\n\u22c5 \u02dc\u03c32~((cid:96)\u2212 k\u2217)\n\u22c5 \u02dc\u03c32~((cid:96)\u2212 2k\u2217). For comparison, the MSE would be nt\nan effective extra loss of k\u2217 labels for each target item. Note that this rule is universal, in that it\n\nachieves a minimum MSE of nt\nm\nif the worker parameters were known exactly. So, the uncertainty in the workers\u2019 parameters creates\n\nm\n\n\u221a\n\nremains true for any a (and hence any model).\n\n4\n\n\f3.2 Optimal k for Joint Estimator\n\nThe two-stage estimator is easy to analyze in that its accuracy is independent of the structure of the\nbipartite assignment graph beyond the degree r and k. This is not true for the joint estimator, whose\naccuracy depends on the topological structure of the assignment graph in a non-trivial way. In this\nsection we study the properties of the joint estimator, again starting with the simple bias-only model,\nthen discussing its extension to more general cases.\n\nWe \ufb01rst introduce some matrix notation. Let At be the adjacency matrix of Gt. Let Rt\n\u2236=\ndiag({ri\u2236 i\u2208T}) be the diagonal matrix formed by the degree sequence of the target items, and\nj\u2236 j\u2208W}).\nj\u2236 j\u2208W}) and Lc= diag({(cid:96)c\nsimilarly de\ufb01ne Lt= diag({(cid:96)t\ni+ b\u2217\nTheorem 3.3. (i). For the bias-only model with xij= \u00b5\u2217\nj+ \u03beij, where \u03beij are i.i.d. noise drawn\nfromN(0, \u03c3\u22172), the expected MSE of the joint estimator de\ufb01ned in (4) is\nt)\u22121)~nt,\ni2~nt]= \u03c3\u22172tr((Rt\u2212 At(Lt+ Lc)\u22121AT\nE[Q\ni\u2208T\u02c6\u00b5i\u2212 \u00b5\u2217\nIf At is regular, with Rt= rI and Lt=((cid:96)\u2212 k)I, this simpli\ufb01es:\ni2~nt]= \u03c3\u22172 1\nE[Q\ni\u2208T\u02c6\u00b5i\u2212 \u00b5\u2217\nW)\u22121)~nt, where W= R\u22121\nt AtL\u22121\nProof. Assume B\u2236= I\u2212 R\u22121\nthe bias-only model is \u02c6\u00b5T = \u00b5\u2217T + B\u22121zT , where zi= 1\n(\u03beij\u2212 \u00af\u03bej), and \u00af\u03bej=\nQ\nj+ (cid:96)t\ni\u2032\u2208\u2202c\nj\u222a\u2202t\ni\u2212 b\u2217\nand \u03beij= xij\u2212 \u00b5\u2217\nj for\u2200i\u2208T . We obtain (7) by calculating Var(\u02c6\u00b5T).\nture of the bipartite graphGt. Consider the eigenvalues 1= \u03bb1\u2265 \u03bb2\u2265\u0016\u2265 0 of W\u2236= R\u22121\nt AtL\u22121\nwhere the second largest eigenvalue \u03bb2 famously characterizes the connectivity of the graphGt.\nRoughly speaking,Gt has better connectivity if \u03bb2 is small, and verse versa. Observe that\n\nRemarks. (ii). Equation (8) establishes an explicit connection between MSE and the spectral struc-\nt ,\nt AT\n\n(cid:96)\nt is invertible. The solution of the joint estimator on\n\u03beij\n\ntr((I\u2212 (cid:96)\u2212 k\nt At(Lt+ Lc)\u22121AT\n\nQ\nj\u2208\u2202i\n\nt .\nt AT\n\n(7)\n\n(8)\n\n(cid:96)c\n\nri\n\n1\n\nr\n\nj\n\nj\n\n+ nt\u2212 1\n1\u2212 (cid:96)\u2212k\n\n(cid:96) \u03bb2\n\n.\n\n(9)\n\ntr((I\u2212 (cid:96)\u2212 k\n\n(cid:96)\n\n(1\u2212 (cid:96)\u2212 k\nW)\u22121) = ntQ\ni=1\n\n(cid:96)\n\n\u03bbi)\u22121 \u2264 (cid:96)\n\nk\n\nTherefore, the joint estimator performs better when \u03bb2 is small, i.e., when the graph is strongly\nconnected. Intuitively, better connectivity \u201ccouples\u201d the items and workers more tightly together,\nmaking it easier not to make mistakes during inference.\nBesides hoping for small error, one may also want the assignment graph to be sparse, i.e., use fewer\nlabels. Graphs that are both sparse and strongly connected are known as expander graphs, and have\nbeen found universally important in areas like robust computer networks, error correcting codes, and\ncommunication networks; see Hoory et al. (2006) for a review. It is well known that large sparse\nrandom regular graphs are good expanders (e.g., Friedman et al., 1989), and hence a near-optimal\nallocation strategy for crowdsourcing (Karger et al., 2011). On such graphs, we can also estimate\nthe optimal k in a simple form.\n\ni2~nt]= \u03c3\u22172\nE[Q\ni\u2208T\u02c6\u00b5i\u2212 \u00b5\u2217\n(cid:96)\u2212 k\n2+ 1~4\u2212 (cid:96)~nt\u2212 1~2\u0905\u2248 (cid:96)~\u221a\n\nTheorem 3.4. Assume At is a random regular bipartite graph, and nt= m. We have that\nwith probability one as nt\u2192\u221e. If in addition (cid:96)\u2192\u221e, the optimal k that minimizes (10) is k\u2217=\n\u0904\u0001\n(cid:96)2~nt+ (cid:96)2~nt\n(cid:96), in contrast to the square-root rule of two-stage estimators. However, since usually (cid:96)\u2264 nt, we have\nnt\u2264\u221a\n(cid:96)~\u221a\n\nRemarks. (i). Perhaps surprisingly, the optimal k of the joint estimator scales linearly w.r.t. budget\n\nProof. Use (9) and the bound in Puder (2012) for \u03bb2 of large random regular bipartite graphs.\n\n(cid:96), that is, the joint estimator requires fewer control items than the two-stage estimator.\n\n(1+O( 1\n\n\u0001 nt\u2212 1\n\n))+ (cid:96)\n\n(cid:6),\n\n(10)\n\nntk\n\nnt.\n\nnt\n\n(cid:96)\n\n5\n\n\f(ii). In addition, the optimal k for the joint estimator also decreases as the total number nt of target\nitems increases. Because nt is usually quite large in practice, the number of control items is usually\n\nvery small. In particular, as nt \u2192\u221e, we have k\u2217 = 1, that is, there is no need for control items\nnotation, let H\u00b5\u00b5= Rt\u2297 E\u00b5\u03bd(H\u00b5\u00b5), and H\u03bd\u03bd=(Lt+ Lc)\u2297 E\u00b5\u03bd(H\u03bd\u03bd), where\u2297 is the Kronecker\nproduct, and H\u00b5\u03bd=[H\u00b5i\u03bdj]ij is a block matrix, where block H\u00b5i\u03bdj for(ij)\u2208Et is a random copy\nof\u2212\u22072\n\u00b5\u03bd log p(x\u00b5, \u03bd) with random x, \u00b5 and \u03bd, and H\u00b5i\u03bdj = 0 for(ij)\u2209Et. Assuming the joint\n\nbeyond \ufb01xing the unidenti\ufb01ability issue of the biases.\nGeneral Models. The joint estimator on general models is more involved to analyze, but it is still\npossible to give an rough estimate by analyzing the Fisher information matrix of the likelihood. For\n\nmaximum likelihood estimator in (4) is asymptotically consistent (in terms of large (cid:96) and r), we can\nestimate its asymptotic MSE by the inverse of the Fisher information matrix,\n\n\u22121H\u00b5\u03bd\n\nT)\u22121)]~nt,\n\nE[Q\ni\u2208T\u02c6\u00b5i\u2212 \u00b5\u2217\n\ni2~nt]\u2248 E[tr((H\u00b5\u00b5\u2212 H\u00b5\u03bd H\u03bd\u03bd\n\nwhere the expectation on the right side is w.r.t. the randomness of H\u00b5\u03bd. This parallels (7) in The-\norem 3.3, except the adjacency matrices are replaced by corresponding Hessian matrices. Unfortu-\nnately, it is more challenging to give a simple estimate of the optimal k as in Theorem 3.4, even when\nAt is a random bipartite graph, because the spectral properties of the random matrix are complicated\n\nby blockwise structure, and may depend on the prior distribution Q(\u03bd). However, experimentally\na~nt, where the constant a depends on both the model assumption\nand the choice of Q(\u03bd), and can be numerically estimated by simulation.\n\nthe optimal k follows the trend (cid:96)\n\n\u0001\n\n4 Experiments\n\nWe show that our theoretical predictions match closely to the results on simulated data and two real\ndatasets for estimating prices and point spreads. The experiments also highlight important practical\nissues such as the impact of model misspeci\ufb01cation, biases, and heteroskedasticity.\nDatasets and Setup. The simulated data are generated by the Gaussian models de\ufb01nited in Sec-\n\ntion 2, where \u00b5i and bj are i.i.d. drawn fromN(1, 1); and \u03c3j from a \u03c72-distribution with degree\n\n4 for the heteroskedastic versions. The price dataset consists of 80 household items collected from\nstores like Amazon and Costco, whose prices are estimated by 155 undergraduate students at UC\nIrvine. A log transform is performed on the prices before using the Gaussian models. The Na-\ntional Football League (NFL) forecasting data was collected by Massey et al. (2011), where 386\nparticipants were asked to predict the point difference of 245 NFL games. We use the point spreads\ndetermined by professional bookmakers as the truth values in our experiments.\nFor all the experiments, we \ufb01rst construct the set of target items and control items by randomly\n\npartitioning items, and then randomly assign each worker with k control items and (cid:96)\u2212k target items,\n\nfor varying values of (cid:96) and k. The MSE is estimated by averaging over 500 random trials. The\noptimal k is estimated by minimizing the averaged MSE over 300 randomly subsampled trials, and\nthen taking average over 20 random subsamples.\nOptimal Number of Control Items. See Figure 1 for the results of the bias-only model when the\ndata are simulated from the correct model. Figure 1(a) shows the empirical MSE of the two-stage\nestimator when varying the number k of control items. A clear trade-off appears: MSE is large both\nwhen k is too small to estimate workers\u2019 parameters accurately, and when k is too large to leave\na suf\ufb01cient number of labels for the target items. The MSE of the joint estimator in Figure 1(b)\nfollows a similar trend, but the gain by using control items is less signi\ufb01cant (the left parts of the\ncurves are \ufb02atter). This is because the joint estimator leverages the labels on the target items (whose\ntrue values are unknown), and relies less on the control items. In particular, as the number nt of\nnt\n(see Figure 1(d)), but that of the two-stage estimator stays the same. Overall, the empirical optimal\nk of the two-stage and joint estimator aligns closely with our theoretical prediction (Figure 1(c)-(d)).\nWe show in Figure 2(a) the result of the bias-variance model when data are simulated from the\ncorrect model. The optimal k of the two-stage estimator aligns closely to\nthe asymptotic result in Theorem 3.2, while that of the joint estimator scales like the line (cid:96)\n\ntarget items increases, the optimal value of k for the joint estimator decreases with a rate of 1~\u221a\na(cid:96) with a= 3, matching\n\u0001\na~nt\n\nwith a\u2248 3, matching our hypothesis in Section 3.2.\n\n\u221a\n\n6\n\n\f(a) Two-stage Estimator\n\n(b) Joint Estimator\n\nFigure 1: Results of the bias-only model on data simulated from the same model. (a)-(b) The MSE\n\nof the two-stage and joint estimators with varying (cid:96) and k and \ufb01xed nt= 100. The stars and circles\nbut \ufb01xed nt= 100. (d) The optimal k with varying nt, but \ufb01xed (cid:96)= 50. We set m= nt here.\n\ndenote the empirically and theoretically optimal k, respectively. (c) The optimal k with varying (cid:96),\n\n(c) Optimal k vs. (cid:96)\n\n(d) Optimal k vs. nt\n\nModel misspeci\ufb01cation. Real datasets are not expected to match the model assumptions perfectly.\nIt is important, but dif\ufb01cult, to understand how the theory should be modi\ufb01ed to compensate for the\nviolation of assumptions. We provide some insights on this by constructing model misspeci\ufb01cation\narti\ufb01cially. Figure 2(b)-(c) shows the results when the data are simulated from a bias-variance\nmodel with non-zero biases, but we use the variance-only model (with zero bias) in the consensus\nalgorithm. We see in Figure 2(b) that the optimal k of the two-stage estimator still aligns closely to\nour theoretical prediction, but that of the joint estimator is much larger than one would expect (almost\nhalf of the budget (cid:96)). In addition, the MSE of the joint estimator in this case is signi\ufb01cantly worse\nthan that of the two-stage estimator (see Figure 2(c)), which is not expected if the model assumption\nholds. Therefore, the joint estimator seems to be more sensitive to model misspeci\ufb01cation than the\ntwo-stage estimator, suggesting that caution should be taken when it is applied in practice.\nReal Datasets. Figure 3 shows the results of the bias-only model on the two real datasets; our\nprediction of the optimal k matches the empirical results surprisingly well on the NFL dataset (Fig-\nure 3(d)-(f)), while our theoretically optimal values of k on the price dataset tend to be smaller than\nthe actual values (Figure 3(a)-(c)), perhaps caused by some unknown model misspeci\ufb01cation. How-\never, our bias on the estimated k does not cause a signi\ufb01cant increase in MSE, because the scale in\nFigure 3(a)-(b) is relatively small compared to that in Figure 4(a).\nInterestingly, the two real datasets have opposite properties in terms of the importance of bias and\nheteroskedasticity (see Figure 4): In the price dataset, all the workers tend to underestimate the prices\nof the products, i.e., bj are negative for all workers, and the bias-only model performs much better\nthan the zero-bias variance-only model. In contrast, the participants in the NFL dataset exhibit no\nsystematic bias but seem to have different individual variances, and the variance-only model works\nbetter than the bias-only model. In both cases, the full bias-variance model works best if budget (cid:96)\nis large, but is not necessarily best if the budget is small and over-\ufb01tting is an issue.\n\n5 Conclusion\n\nThe problem of how many control questions to use is unlikely to yield a de\ufb01nitive answer, since real\ndata are always likely to be more complicated than any model. However, our results highlight several\nissues and provide insights and rules of thumb that can help crowdsourcing practitioners make their\n\nown decisions. In particular, we show that the optimal number of control items should beO(\u221a\n(cid:96)) for\nthe two-stage estimator andO((cid:96)~\u221a\nnt) for the joint estimator. Because the number nt of target items\n\nis usually large in practice, it is reasonable to recommend using a minimal number of control items,\njust enough to \ufb01x potential unidenti\ufb01ability issues, assuming the model assumptions hold well.\nHowever, the joint estimator may require signi\ufb01cantly more control items if model misspeci\ufb01cation\nexists; in this case one might better switch to the more robust two-stage estimator, or search for\nbetter models. The control items can also be used to do model selection, an issue which deserves\nfurther discussion in the future.\nAcknowledgements. Work supported in part by NSF IIS-1065618 and IIS-1254071 and a Microsoft\nResearch Fellowship. Thanks to Tobias Johnson for discussion on random matrix theory.\n\n7\n\n123581322366010000.20.40.60.8(cid:1)=7k(#ofcontrolitems)MSE(cid:1)=10(cid:1)=15(cid:1)=20(cid:1)=50123581322366010000.10.20.30.40.50.6(cid:1)=7k(#ofcontrolitems)MSE(cid:1)=10(cid:1)=15(cid:1)=20(cid:1)=500501000246810Budget(cid:1)Optimalk100200300400500600246nt(#oftargetitems)Optimalk Joint(empirical)Joint((cid:1)/\u221ant)Two-stage(empirical)Two-stage(\u221a(cid:1))\f(a) Bias-variance Model\n\n(b)-(c) Model Misspeci\ufb01cation\n\n(a) Results of the bias-variance model on data simulated from the same model. (b)-(c)\nFigure 2:\nResults when the data are simulated from the bias-variance model with non-zero biases, but we use\nthe variance-only model (with zero bias) in the consensus algorithm. With this model misspeci\ufb01-\ncation, the joint estimator requires signi\ufb01cantly more control items than one would expect (almost\nhalf of the budget (cid:96)), and performs worse than the two-stage estimator.\n\nt\ne\ns\na\nt\na\nD\ne\nc\ni\nr\nP\n\nt\ne\ns\na\nt\na\nD\nL\nF\nN\n\n(a) Two-stage Estimator\n\n(b) Joint Estimator\n\n(c) Optimal k vs. (cid:96)\n\n(d) Two-stage Estimator\n\n(e) Joint Estimator\n\n(f) Optimal k vs. (cid:96)\n\nFigure 3: Results on the real datasets when using the bias-only model.\n(a)-(b) and (d)-(e) The\nMSE when using the two-stage and joint estimators, respectively. (c) and (f) The empirically and\n\ntheoretically optimal k as the budget (cid:96) varies. Here we \ufb01x nt= 50 for price dataset and nt= 200 for\n\nNFL dataset.\n\n(a) Price Dataset\n\n(b) NFL Dataset\n\nFigure 4: Comparison of different models and consensus methods on the two real datasets. (a)-(b)\nThe MSE when selecting the best possible k as the budget (cid:96) varies. The workers in the price dataset\nhas systematic bias, and the bias-only model works better than the variance-only model, while the\nworkers in NFL dataset have no bias but different individual variances, and the variance-only model\nis better than bias-only. In both datasets, the full bias-variance model works best if the budget (cid:96) is\nlarge, but is not necessarily best if the budget is small when over-\ufb01tting is an issue.\n\n8\n\n02040608010005101520Budget(cid:1)Optimalk Joint(empirical)Joint((cid:1)(cid:1)3/nt)Two-stage(empirical)Two-stage(\u221a3(cid:1))02040608010001020304050Budget(cid:1)Optimalk Joint(empirical)Two-stage(empirical)Two-stage(\u221a2(cid:1))2358132236601000.511.5(cid:1)=40(cid:1)=60(cid:1)=80(cid:1)=100k(#ofcontrolitems)MSE1235813220.20.210.22(cid:1)=7k(#ofcontrolitems)MSE(cid:1)=10(cid:1)=15(cid:1)=251235813220.20.210.22(cid:1)=7k(#ofcontrolitems)MSE(cid:1)=10(cid:1)=15(cid:1)=251020300246810Budget(cid:1)Optimalk Joint(empirical)Joint((cid:1)/\u221ant)Two-stage(empirical)Two-stage(\u221a(cid:1))123581322681012141618(cid:1)=7k(#ofcontrolitems)MSE(cid:1)=10(cid:1)=15(cid:1)=25123581322681012141618(cid:1)=7k(#ofcontrolitems)MSE(cid:1)=10(cid:1)=15(cid:1)=25102030400246Budget(cid:1)Optimalk Joint(empirical)Joint((cid:1)/\u221ant)Two-stage(empirical)Two-stage(\u221a(cid:1))2510200.20.220.240.260.280.30.320.34Budget(cid:1)MSE2510205101520Budget(cid:1)MSE Uniform MeanBias\u2212only / JointBias\u2212variance / JointVariance\u2212only / JointBias\u2212only / Two\u2212stageBias\u2212variance / Two\u2212stageVariance\u2212only / Two\u2212stage\fReferences\nJames Surowiecki. The wisdom of crowds. Anchor, 2005.\nVictor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? Improving data qual-\nity and data mining using multiple, noisy labelers. In Proc. SIGKDD Int\u2019l Conf. on Knowledge\nDiscovery and Data Mining, pages 614\u2013622. ACM, 2008.\n\nA.P. Dawid and A.M. Skene. Maximum likelihood estimation of observer error-rates using the EM\n\nalgorithm. Applied Statistics, pages 20\u201328, 1979.\n\nJacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. Whose vote should\ncount more: Optimal integration of labels from labelers of unknown expertise. In Advances in\nNeural Information Processing Systems (NIPS), pages 2035\u20132043, 2009.\n\nD.R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In Advances\n\nin Neural Information Processing Systems (NIPS), pages 1953\u20131961, 2011.\n\nQiang Liu, Jian Peng, and Alexander Ihler. Variational inference for crowdsourcing. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 701\u2013709, 2012.\n\nDengyong Zhou, John Platt, Sumit Basu, and Yi Mao. Learning from the wisdom of crowds by\nminimax entropy. In Advances in Neural Information Processing Systems (NIPS), pages 2204\u2013\n2212, 2012.\n\nA Kimball Romney, William H Batchelder, and Susan C Weller. Recent applications of cultural\n\nconsensus theory. American Behavioral Scientist, 31(2):163\u2013177, 1987.\n\nMichael D Lee, Mark Steyvers, Mindy de Young, and Brent Miller. Inferring expertise in knowledge\n\nand prediction ranking tasks. Topics in cognitive science, 4(1):151\u2013163, 2012.\n\nGary Chamberlain. Multivariate regression models for panel data. Journal of Econometrics, 18(1):\n\n5\u201346, 1982.\n\nAad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.\nShlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. Bulletin\n\nof the American Mathematical Society, 43(4):439\u2013561, 2006.\n\nJoel Friedman, Jeff Kahn, and Endre Szemeredi. On the second eigenvalue of random regular graphs.\n\nIn Proc. ACM Symp. on Theory of Computing, pages 587\u2013598. ACM, 1989.\n\nDoron Puder.\n\nExpansion of random graphs: New proofs, new results.\n\narXiv:1212.5216, 2012.\n\narXiv preprint\n\nCade Massey, Joseph P Simmons, and David A Armor. Hope over experience: Desirability and the\n\npersistence of optimism. Psychological Science, 22(2):274\u2013281, 2011.\n\n9\n\n\f", "award": [], "sourceid": 977, "authors": [{"given_name": "Qiang", "family_name": "Liu", "institution": "UC Irvine"}, {"given_name": "Alexander", "family_name": "Ihler", "institution": "UC Irvine"}, {"given_name": "Mark", "family_name": "Steyvers", "institution": "UC Irvine"}]}