{"title": "Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 1260, "page_last": 1268, "abstract": "The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the Dawid-Skene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods.", "full_text": "Spectral Methods Meet EM: A Provably Optimal\n\nAlgorithm for Crowdsourcing\n\nYuchen Zhang\u2020\n\nXi Chen(cid:93)\n\nDengyong Zhou\u2217 Michael I. Jordan\u2020\n\n\u2020University of California, Berkeley, Berkeley, CA 94720\n\n{yuczhang,jordan}@berkeley.edu\n(cid:93)New York University, New York, NY 10012\n\nxichen@nyu.edu\n\n\u2217Microsoft Research, 1 Microsoft Way, Redmond, WA 98052\n\ndengyong.zhou@microsoft.com\n\nAbstract\n\nThe Dawid-Skene estimator has been widely used for inferring the true labels\nfrom the noisy labels provided by non-expert crowdsourcing workers. However,\nsince the estimator maximizes a non-convex log-likelihood function, it is hard to\ntheoretically justify its performance. In this paper, we propose a two-stage ef\ufb01-\ncient algorithm for multi-class crowd labeling problems. The \ufb01rst stage uses the\nspectral method to obtain an initial estimate of parameters. Then the second stage\nre\ufb01nes the estimation by optimizing the objective function of the Dawid-Skene\nestimator via the EM algorithm. We show that our algorithm achieves the optimal\nconvergence rate up to a logarithmic factor. We conduct extensive experiments on\nsynthetic and real datasets. Experimental results demonstrate that the proposed\nalgorithm is comparable to the most accurate empirical approach, while outper-\nforming several other recently proposed methods.\n\n1\n\nIntroduction\n\nWith the advent of online crowdsourcing services such as Amazon Mechanical Turk, crowdsourcing\nhas become an appealing way to collect labels for large-scale data. Although this approach has\nvirtues in terms of scalability and immediate availability, labels collected from the crowd can be of\nlow quality since crowdsourcing workers are often non-experts and can be unreliable. As a remedy,\nmost crowdsourcing services resort to labeling redundancy, collecting multiple labels from different\nworkers for each item. Such a strategy raises a fundamental problem in crowdsourcing: how to infer\ntrue labels from noisy but redundant worker labels?\nFor labeling tasks with k different categories, Dawid and Skene [8] propose a maximum likelihood\napproach based on the Expectation-Maximization (EM) algorithm. They assume that each worker\nis associated with a k \u00d7 k confusion matrix, where the (l, c)-th entry represents the probability that\na randomly chosen item in class l is labeled as class c by the worker. The true labels and work-\ner confusion matrices are jointly estimated by maximizing the likelihood of the observed worker\nlabels, where the unobserved true labels are treated as latent variables. Although this EM-based\napproach has had empirical success [21, 20, 19, 26, 6, 25], there is as yet no theoretical guarantee\nfor its performance. A recent theoretical study [10] shows that the global optimal solutions of the\nDawid-Skene estimator can achieve minimax rates of convergence in a simpli\ufb01ed scenario, where\nthe labeling task is binary and each worker has a single parameter to represent her labeling accura-\ncy (referred to as a \u201cone-coin model\u201d in what follows). However, since the likelihood function is\nnon-convex, this guarantee is not operational because the EM algorithm may get trapped in a local\noptimum. Several alternative approaches have been developed that aim to circumvent the theoretical\nde\ufb01ciencies of the EM algorithm, still in the context of the one-coin model [14, 15, 11, 7]. Unfor-\n\n1\n\n\ftunately, they either fail to achieve the optimal rates or depend on restrictive assumptions which are\nhard to justify in practice.\nWe propose a computationally ef\ufb01cient and provably optimal algorithm to simultaneously estimate\ntrue labels and worker confusion matrices for multi-class labeling problems. Our approach is a\ntwo-stage procedure, in which we \ufb01rst compute an initial estimate of worker confusion matrices\nusing the spectral method, and then in the second stage we turn to the EM algorithm. Under some\nmild conditions, we show that this two-stage procedure achieves minimax rates of convergence up\nto a logarithmic factor, even after only one iteration of EM. In particular, given any \u03b4 \u2208 (0, 1),\nwe provide the bounds on the number of workers and the number of items so that our method can\ncorrectly estimate labels for all items with probability at least 1\u2212 \u03b4. We also establish a lower bound\nto demonstrate the optimality of this approach. Further, we provide both upper and lower bounds for\nestimating the confusion matrix of each worker and show that our algorithm achieves the optimal\naccuracy.\nThis work not only provides an optimal algorithm for crowdsourcing but sheds light on understand-\ning the general method of moments. Empirical studies show that when the spectral method is used\nas an initialization for the EM algorithm, it outperforms EM with random initialization [18, 5]. This\nwork provides a concrete way to theoretically justify such observations. It is also known that starting\nfrom a root-n consistent estimator obtained by the spectral method, one Newton-Raphson step leads\nto an asymptotically optimal estimator [17]. However, obtaining a root-n consistent estimator and\nperforming a Newton-Raphson step can be demanding computationally. In contrast, our initializa-\ntion doesn\u2019t need to be root-n consistent, thus a small portion of data suf\ufb01ces to initialize. Moreover,\nperforming one iteration of EM is computationally more attractive and numerically more robust than\na Newton-Raphson step especially for high-dimensional problems.\n2 Related Work\nMany methods have been proposed to address the problem of estimating true labels in crowdsourcing\n[23, 20, 22, 11, 19, 26, 7, 15, 14, 25]. The methods in [20, 11, 15, 19, 14, 7] are based on the\ngenerative model proposed by Dawid and Skene [8].\nIn particular, Ghosh et al. [11] propose a\nmethod based on Singular Value Decomposition (SVD) which addresses binary labeling problems\nunder the one-coin model. The analysis in [11] assumes that the labeling matrix is full, that is,\neach worker labels all items. To relax this assumption, Dalvi et al. [7] propose another SVD-based\nalgorithm which explicitly considers the sparsity of the labeling matrix in both algorithm design\nand theoretical analysis. Karger et al. propose an iterative algorithm for binary labeling problems\nunder the one-coin model [15] and extend it to multi-class labeling tasks by converting a k-class\nproblem into k \u2212 1 binary problems [14]. This line of work assumes that tasks are assigned to\nworkers according to a random regular graph, thus imposing speci\ufb01c constraints on the number\nof workers and the number of items. In Section 5, we compare our theoretical results with that\nof existing approaches [11, 7, 15, 14]. The methods in [20, 19, 6] incorporate Bayesian inference\ninto the Dawid-Skene estimator by assuming a prior over confusion matrices. Zhou et al.\n[26,\n25] propose a minimax entropy principle for crowdsourcing which leads to an exponential family\nmodel parameterized with worker ability and item dif\ufb01culty. When all items have zero dif\ufb01culty, the\nexponential family model reduces to the generative model suggested by Dawid and Skene [8].\nOur method for initializing the EM algorithm in crowdsourcing is inspired by recent work using\nspectral methods to estimate latent variable models [3, 1, 4, 2, 5, 27, 12, 13]. The basic idea in this\nline of work is to compute third-order empirical moments from the data and then to estimate param-\neters by computing a certain orthogonal decomposition of a tensor derived from the moments. Given\nthe special symmetric structure of the moments, the tensor factorization can be computed ef\ufb01cient-\nly using the robust tensor power method [3]. A problem with this approach is that the estimation\nerror can have a poor dependence on the condition number of the second-order moment matrix and\nthus empirically it sometimes performs worse than EM with multiple random initializations. Our\nmethod, by contrast, requires only a rough initialization from the moment of moments; we show that\nthe estimation error does not depend on the condition number (see Theorem 2 (b)).\n\n3 Problem Setup\nThroughout this paper, [a] denotes the integer set {1, 2, . . . , a} and \u03c3b(A) denotes the b-th largest\nsingular value of the matrix A. Suppose that there are m workers, n items and k classes. The true\n\n2\n\n\fh and\n\n(1) Partition the workers into three disjoint and non-empty group G1, G2 and G3. Compute the\n(2) For (a, b, c) \u2208 {(2, 3, 1), (3, 1, 2), (1, 2, 3)}, compute the second and the third order moments\n\ngroup aggregated labels Zgj by Eq. (1).\n\ntensor decomposition:\n\nAlgorithm 1: Estimating confusion matrices\nInput: integer k, observed labels zij \u2208 Rk for i \u2208 [m] and j \u2208 [n].\n\nOutput: confusion matrix estimates (cid:98)Ci \u2208 Rk\u00d7k for i \u2208 [m].\n(cid:99)M2 \u2208 Rk\u00d7k,(cid:99)M3 \u2208 Rk\u00d7k\u00d7k by Eq. (2a)-(2d), then compute (cid:98)C(cid:5)\n(a) Compute whitening matrix (cid:98)Q \u2208 Rk\u00d7k (such that (cid:98)QT(cid:99)M2(cid:98)Q = I) using SVD.\n(b) Compute eigenvalue-eigenvector pairs {((cid:98)\u03b1h,(cid:98)vh)}k\nby using the robust tensor power method [3]. Then compute (cid:98)wh =(cid:98)\u03b1\u22122\nh = ((cid:98)QT )\u22121((cid:98)\u03b1h(cid:98)vh).\n(cid:98)\u00b5(cid:5)\n(c) For l = 1, . . . , k, set the l-th column of (cid:98)C(cid:5)\nc by some(cid:98)\u00b5(cid:5)\ngreatest component, then set the l-th diagonal entry of(cid:99)W by (cid:98)wh.\n(3) Compute (cid:98)Ci by Eq. (3).\nwhere {wl : l \u2208 [k]} are positive values satisfying(cid:80)k\n\nc \u2208 Rk\u00d7k and(cid:99)W \u2208 Rk\u00d7k by\nh=1 of the whitened tensor(cid:99)M3((cid:98)Q,(cid:98)Q,(cid:98)Q)\n\nlabel yj of item j \u2208 [n] is assumed to be sampled from a probability distribution P[yj = l] = wl\nl=1 wl = 1. Denote by a vector zij \u2208 Rk\nthe label that worker i assigns to item j. When the assigned label is c, we write zij = ec, where ec\nrepresents the c-th canonical basis vector in Rk in which the c-th entry is 1 and all other entries are\n0. A worker may not label every item. Let \u03c0i indicate the probability that worker i labels a randomly\nchosen item. If item j is not labeled by worker i, we write zij = 0. Our goal is to estimate the true\nlabels {yj : j \u2208 [n]} from the observed labels {zij : i \u2208 [m], j \u2208 [n]}.\nIn order to obtain an estimator, we need to make assumptions on the process of generating observed\nlabels. Following the work of Dawid and Skene [8], we assume that the probability that worker i\nlabels an item in class l as class c is independent of any particular chosen item, that is, it is a constant\nover j \u2208 [n]. Let us denote the constant probability by \u00b5ilc. Let \u00b5il = [\u00b5il1 \u00b5il2 \u00b7\u00b7\u00b7 \u00b5ilk]T . The\nmatrix Ci = [\u00b5i1 \u00b5i2 . . . \u00b5ik] \u2208 Rk\u00d7k is called the confusion matrix of worker i. Besides estimating\nthe true labels, we also want to estimate the confusion matrix for each worker.\n\nh whose l-th coordinate has the\n\n4 Our Algorithm\n\nIn this section, we present an algorithm to estimate confusion matrices and true labels. Our algorithm\nconsists of two stages. In the \ufb01rst stage, we compute an initial estimate of confusion matrices via\nthe method of moments. In the second stage, we perform the standard EM algorithm by taking the\nresult of the Stage 1 as an initialization.\n\n4.1 Stage 1: Estimating Confusion Matrices\n\nPartitioning the workers into three disjoint and non-empty groups G1, G2 and G3, the outline of\nthis stage is the following: we use the spectral method to estimate the averaged confusion matrices\nfor the three groups, then utilize this intermediate estimate to obtain the confusion matrix of each\nindividual worker. In particular, for g \u2208 {1, 2, 3} and j \u2208 [n], we calculate the averaged labeling\nwithin each group by\n\n(cid:88)\n\nZgj :=\n\n1\n|Gg|\n\n3\n\ni\u2208Gg\nDenoting the aggregated confusion matrix columns by \u00b5(cid:5)\nour \ufb01rst step is to estimate C(cid:5)\n\ng2, . . . , \u00b5(cid:5)\n\ng := [\u00b5(cid:5)\n\ng1, \u00b5(cid:5)\n\nzij.\ngl := E(Zgj|yj = l) = 1|Gg|\n\n(cid:80)\n\n\u03c0i\u00b5il,\ngk] and to estimate the distribution of true labels\n\ni\u2208Gg\n\n(1)\n\n\fW := diag(w1, w2, . . . , wk). The following proposition shows that we can solve for C(cid:5)\nfrom the moments of {Zgj}.\nProposition 1 (Anandkumar et al. [3]). Assume that the vectors {\u00b5(cid:5)\ng2, . . . , \u00b5(cid:5)\nindependent for each g \u2208 {1, 2, 3}. Let (a, b, c) be a permutation of {1, 2, 3}. De\ufb01ne\n\ng and W\ngk} are linearly\n\ng1, \u00b5(cid:5)\n\nthen we have M2 =(cid:80)k\n\nbj] and M3 := E[Z(cid:48)\nl=1 wl \u00b5(cid:5)\n\naj \u2297 Z(cid:48)\ncl \u2297 \u00b5(cid:5)\n\nbj \u2297 Zcj];\ncl \u2297 \u00b5(cid:5)\ncl.\n\nSince we only have \ufb01nite samples, the expectations in Proposition 1 have to be approximated by\nempirical moments. In particular, they are computed by averaging over indices j = 1, 2, . . . , n. For\n(cid:17)\u22121\neach permutation (a, b, c) \u2208 {(2, 3, 1), (3, 1, 2), (1, 2, 3)}, we compute\n(cid:17)\u22121\n\nZaj \u2297 Zbj\n\nZcj \u2297 Zbj\n\nZaj,\n\n(2a)\n\nj=1\n\nn\n\nn\n\nZbj \u2297 Zaj\n\nZbj,\n\n\u22121 Zaj,\n\u22121 Zbj,\n\naj \u2297 Z(cid:48)\ncl \u2297 \u00b5(cid:5)\n\naj := E[Zcj \u2297 Zbj] (E[Zaj \u2297 Zbj])\nZ(cid:48)\nbj := E[Zcj \u2297 Zaj] (E[Zbj \u2297 Zaj])\nZ(cid:48)\ncl and M3 =(cid:80)k\nM2 := E[Z(cid:48)\nl=1 wl \u00b5(cid:5)\n(cid:17)(cid:16) 1\n(cid:16) 1\nn(cid:88)\nn(cid:88)\n(cid:17)(cid:16) 1\n(cid:16) 1\nn(cid:88)\nn(cid:88)\nn(cid:88)\n(cid:98)Z\naj \u2297 (cid:98)Z\nn(cid:88)\n(cid:98)Z\naj \u2297 (cid:98)Z\n\n(cid:98)Z\n(cid:98)Z\n(cid:99)M2 :=\n(cid:99)M3 :=\n\nZcj \u2297 Zaj\n\nbj \u2297 Zcj.\n(cid:48)\n\n(cid:48)\naj :=\n\n(cid:48)\nbj :=\n\n(cid:48)\nbj,\n\n1\nn\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\nn\n\n(cid:48)\n\n(cid:48)\n\nn\n\n1\nn\n\nj=1\n\n(2b)\n\n(2c)\n\n(2d)\n\nc and the diagonal\n\nc . Thus, (cid:98)\u00b5(cid:5)\n\nThe statement of Proposition 1 suggests that we can recover the columns of C(cid:5)\n\nentries of W by operating on the moments (cid:99)M2 and (cid:99)M3. This is implemented by the tensor fac-\nvectors {((cid:98)\u00b5(cid:5)\nh,(cid:98)wh) estimates a particular column of C(cid:5)\n\ntorization method in Algorithm 1. In particular, the tensor factorization algorithm returns a set of\nc (for\nsome \u00b5(cid:5)\ncl) and a particular diagonal entry of W (for some wl). It is important to note that the tensor\nfactorization algorithm doesn\u2019t provide a one-to-one correspondence between the recovered colum-\nn and the true columns of C(cid:5)\nk represents an arbitrary permutation of the true\ncolumns.\n\nh,(cid:98)wh) : h = 1, . . . , k}, where each ((cid:98)\u00b5(cid:5)\n1, . . . ,(cid:98)\u00b5(cid:5)\n\nTo discover the index correspondence, we take each (cid:98)\u00b5(cid:5)\nin the next section. As a consequence, if (cid:98)\u00b5(cid:5)\ncoordinate is expected to be greater than other coordinates. Thus, we set the l-th column of (cid:98)C(cid:5)\nsome vector(cid:98)\u00b5(cid:5)\nthen randomly select one of them; if there is no such vector, then randomly select a(cid:98)\u00b5(cid:5)\nset the l-th diagonal entry of (cid:99)W to the scalar (cid:98)wh associated with (cid:98)\u00b5(cid:5)\n(a, b, c) \u2208 {(2, 3, 1), (3, 1, 2), (1, 2, 3)}, we obtain (cid:98)C(cid:5)\nthree copies of(cid:99)W estimating the same matrix W \u2014we average them for the best accuracy.\n\nh and examine its greatest component. We\nassume that within each group, the probability of assigning a correct label is always greater than\nthe probability of assigning any speci\ufb01c incorrect label. This assumption will be made precise\nc , then its l-th\nc to\nh whose l-th coordinate has the greatest component (if there are multiple such vectors,\nh). Then, we\nh. Note that by iterating over\nc for c = 1, 2, 3 respectively. There will be\n\nh corresponds to the l-th column of C(cid:5)\n\nIn the second step, we estimate each individual confusion matrix Ci. The following proposition\nshows that we can recover Ci from the moments of {zij}. See [24] for the proof.\nProposition 2. For any g \u2208 {1, 2, 3} and any i \u2208 Gg, let a \u2208 {1, 2, 3}\\{g} be one of the remaining\ngroup index. Then\n\na)T = E[zijZ T\n\nof E[zijZ T\n\nProposition 2 suggests a plug-in estimator for Ci. We compute (cid:98)Ci using the empirical approximation\na, (cid:98)C(cid:5)\naj] and using the matrices (cid:98)C(cid:5)\nb ,(cid:99)W obtained in the \ufb01rst step. Concretely, we calculate\n\uf8f1\uf8f2\uf8f3(cid:16) 1\nn(cid:88)\n(cid:98)Ci := normalize\n\n(cid:17)(cid:16)(cid:99)W ((cid:98)C(cid:5)\n\na)T(cid:17)\u22121\n\n\uf8fc\uf8fd\uf8fe ,\n\nzijZ T\naj\n\n\u03c0iCiW (C(cid:5)\n\naj].\n\n(3)\n\nn\n\nj=1\n\n4\n\n\fwhere the normalization operator rescales the matrix columns, making sure that each column sums\nto one. The overall procedure for Stage 1 is summarized in Algorithm 1.\n\n4.2 Stage 2: EM algorithm\n\nn(cid:89)\n\nm(cid:89)\n\nk(cid:89)\n\nThe second stage is devoted to re\ufb01ning the initial estimate provided by Stage 1. The joint likelihood\nof true label yj and observed labels zij, as a function of confusion matrices \u00b5i, can be written as\n\nL(\u00b5; y, z) :=\n\n(\u00b5iyj c)\n\nI(zij =ec).\n\nlog((cid:80)\ntakes the values {(cid:98)\u00b5ilc} provided as output by Stage 1 as initialization, then executes the following\n\nBy assuming a uniform prior over y, we maximize the marginal log-likelihood function (cid:96)(\u00b5) :=\ny\u2208[k]n L(\u00b5; y, z)). We re\ufb01ne the initial estimate of Stage 1 by maximizing the objective func-\ntion, which is implemented by the Expectation Maximization (EM) algorithm. The EM algorithm\n\nj=1\n\nc=1\n\ni=1\n\nE-step and M-step for at least one round.\n\nE-step Calculate the expected value of the log-likelihood function, with respect to the conditional\n\n(cid:98)qjl log\n\ndistribution of y given z under the current estimate of \u00b5:\n\n(cid:32) m(cid:89)\n(cid:40) k(cid:88)\nk(cid:89)\nn(cid:88)\nQ(\u00b5) := Ey|zf,(cid:98)\u00b5 [log(L(\u00b5; y, z))] =\nI(zij = ec) log((cid:98)\u00b5ilc)(cid:1)\nwhere (cid:98)qjl \u2190 exp(cid:0)(cid:80)m\n(cid:80)k\nl(cid:48)=1 exp(cid:0)(cid:80)m\nI(zij = ec) log((cid:98)\u00b5il(cid:48)c)(cid:1)\n(cid:80)k\n(cid:80)k\n(cid:80)n\nj=1(cid:98)qjlI(zij = ec)\n(cid:80)k\n(cid:80)n\nj=1(cid:98)qjlI(zij = ec(cid:48))\n\nM-step Find the estimate(cid:98)\u00b5 that maximizes the function Q(\u00b5):\n\n(cid:98)\u00b5ilc \u2190\n\nc(cid:48)=1\n\nj=1\n\nc=1\n\nc=1\n\nc=1\n\ni=1\n\ni=1\n\ni=1\n\nl=1\n\n(cid:33)(cid:41)\n\nI(zij =ec)\n\n(\u00b5ilc)\n\n,\n\nfor j \u2208 [n], l \u2208 [k].\n\nfor i \u2208 [m], l \u2208 [k], c \u2208 [k].\n\n(4)\n\n(5)\n\nIn practice, we alternatively execute the updates (4) and (5), for one iteration or until convergence.\nEach update increases the objective function (cid:96)(\u00b5). Since (cid:96)(\u00b5) is not concave, the EM update doesn\u2019t\nguarantee converging to the global maximum. It may converge to distinct local stationary points for\ndifferent initializations. Nevertheless, as we prove in the next section, it is guaranteed that the EM\nalgorithm will output statistically optimal estimates of true labels and worker confusion matrices if\nit is initialized by Algorithm 1.\n\n5 Convergence Analysis\n\nTo state our main theoretical results, we \ufb01rst need to introduce some notation and assumptions. Let\n\nwmin := min{wl}k\n\nl=1\n\nand \u03c0min := min{\u03c0i}m\n\ni=1\n\nbe the smallest portion of true labels and the most extreme sparsity level of workers. Our \ufb01rst\nassumption assumes that both wmin and \u03c0min are strictly positive, that is, every class and every\nworker contributes to the dataset.\nOur second assumption assumes that the confusion matrices for each of the three groups, namely\n1 , C(cid:5)\nC(cid:5)\n3 , are nonsingular. As a consequence, if we de\ufb01ne matrices Sab and tensors Tabc for\nany a, b, c \u2208 {1, 2, 3} as\n\n2 and C(cid:5)\n\nwl \u00b5(cid:5)\n\nal \u2297 \u00b5(cid:5)\n\nbl = C(cid:5)\n\naW (C(cid:5)\nb )T\n\nl=1\n\nSab :=\n\nand Tabc :=\nthen there will be a positive scalar \u03c3L such that \u03c3k(Sab) \u2265 \u03c3L > 0.\nOur third assumption assumes that within each group, the average probability of assigning a correct\nlabel is always higher than the average probability of assigning any incorrect label. To make this\n\nl=1\n\nwl \u00b5(cid:5)\n\nal \u2297 \u00b5(cid:5)\n\nbl \u2297 \u00b5(cid:5)\ncl,\n\nk(cid:88)\n\nk(cid:88)\n\n5\n\n\fstatement rigorous, we de\ufb01ne a quantity\n\u03ba := min\n\ng\u2208{1,2,3} min\nl\u2208[k]\n\nmin\n\nc\u2208[k]\\{l}\n\n{\u00b5(cid:5)\n\ngll \u2212 \u00b5(cid:5)\n\nglc}\n\nlabels. For two discrete distributions P and Q, let DKL (P, Q) :=(cid:80)\n\nindicating the smallest gap between diagonal entries and non-diagonal entries in the same confusion\nmatrix column. The assumption requires \u03ba being strictly positive. Note that this assumption is\ngroup-based, thus does not assume the accuracy of any individual worker.\nFinally, we introduce a quantity that measures the average ability of workers in identifying distinct\ni P (i) log(P (i)/Q(i)) repre-\nsent the KL-divergence between P and Q. Since each column of the confusion matrix represents a\ndiscrete distribution, we can de\ufb01ne the following quantity:\n\nm(cid:88)\n\ni=1\n\nD = min\nl(cid:54)=l(cid:48)\n\n1\nm\n\n\u03c0iDKL (\u00b5il, \u00b5il(cid:48)) .\n\n(6)\n\nThe quantity D lower bounds the averaged KL-divergence between two columns. If D is strictly\npositive, it means that every pair of labels can be distinguished by at least one subset of workers. As\nthe last assumption, we assume that D is strictly positive.\nThe following two theorems characterize the performance of our algorithm. We split the conver-\ngence analysis into two parts. Theorem 1 characterizes the performance of Algorithm 1, providing\nsuf\ufb01cient conditions for achieving an arbitrarily accurate initialization. We provide the proof of\nTheorem 1 in the long version of this paper [24].\nTheorem 1. For any scalar \u03b4 > 0 and any scalar \u0001 satisfying \u0001 \u2264 min\nnumber of items n satis\ufb01es\n\n\u03c0minwmin\u03c3L\n\n, if the\n\n(cid:110)\n\n(cid:111)\n\n36\u03bak\n\n, 2\n\n(cid:18) k5 log((k + m)/\u03b4)\n\n(cid:19)\n\n,\n\nn = \u2126\n\nthen the confusion matrices returned by Algorithm 1 are bounded as\nfor all i \u2208 [m],\n\n(cid:107)(cid:98)Ci \u2212 Ci(cid:107)\u221e \u2264 \u0001\n\n\u00012\u03c02\n\nminw2\n\nmin\u03c313\nL\n\nwith probability at least 1 \u2212 \u03b4. Here, (cid:107) \u00b7 (cid:107)\u221e denotes the element-wise (cid:96)\u221e-norm of a matrix.\n\nIt states that when a suf\ufb01ciently accurate\n\nSee the long version of this paper [24] for the proof.\n\nTheorem 2 characterizes the error rate in Stage 2.\n\nTheorem 2. Assume that there is a positive scalar \u03c1 such that \u00b5ilc \u2265 \u03c1 for all (i, l, c) \u2208 [m] \u00d7 [k]2.\n\ninitialization is taken, the updates (4) and (5) re\ufb01ne the estimates(cid:98)\u00b5 and(cid:98)y to the optimal accuracy.\nFor any scalar \u03b4 > 0, if confusion matrices (cid:98)Ci are initialized in a manner such that\n(cid:18) log(mk/\u03b4)\n\n(cid:26) \u03c1\n(cid:107)(cid:98)Ci \u2212 Ci(cid:107)\u221e \u2264 \u03b1 := min\n(cid:18) log(1/\u03c1) log(kn/\u03b4) + log(mn)\n\nand the number of workers m and the number of items n satisfy\n\n1 \u2212 \u03b4,\n\nand n = \u2126\n\nthen, for(cid:98)\u00b5 and(cid:98)q obtained by iterating (4) and (5) (for at least one round), with probability at least\n(a) Letting(cid:98)yj = arg maxl\u2208[k](cid:98)qjl, we have that(cid:98)yj = yj holds for all j \u2208 [n].\n(b) (cid:107)(cid:98)\u00b5il \u2212 \u00b5il(cid:107)2\n\nholds for all (i, l) \u2208 [m] \u00d7 [k].\n\n2 \u2264 48 log(2mk/\u03b4)\n\n\u03c0minwmin\u03b12\n\nfor all i \u2208 [m],\n\n\u03c1D\n16\n\nm = \u2126\n\n(cid:27)\n\n(cid:19)\n\n(cid:19)\n\n\u03c0iwln\n\n(7)\n\nD\n\n2\n\n,\n\n,\n\nIn Theorem 2, the assumption that all confusion matrix entries are lower bounded by \u03c1 > 0 is\nsomewhat restrictive. For datasets violating this assumption, we enforce positive confusion matrix\nentries by adding random noise: Given any observed label zij, we replace it by a random label in\n{1, ..., k} with probability k\u03c1. In this modi\ufb01ed model, every entry of the confusion matrix is lower\nbounded by \u03c1, so that Theorem 2 holds. The random noise makes the constant D smaller than its\noriginal value, but the change is minor for small \u03c1.\n\n6\n\n\fDataset name\n\nBird\nRTE\nTREC\nDog\nWeb\n\n# classes\n\n2\n2\n2\n4\n5\n\n# items\n\n108\n800\n\n19,033\n\n807\n2,665\n\n# workers\n\n39\n164\n762\n52\n177\n\n# worker labels\n\n4,212\n8,000\n88,385\n7,354\n15,567\n\nTable 1: Summary of datasets used in the real data experiment.\n\n(cid:19)\n\nTo see the consequence of the convergence analysis, we take error rate \u0001 in Theorem 1 equal to the\nconstant \u03b1 de\ufb01ned in Theorem 2. Then we combine the statements of the two theorems. This shows\nthat if we choose the number of workers m and the number of items n such that\nL min{\u03c12, (\u03c1D)2}\n\n(cid:18) 1\nm =(cid:101)\u2126\nthen with high probability, the predictor (cid:98)y will be perfectly accurate, and the estimator (cid:98)\u00b5 will be\n2 \u2264 (cid:101)O(1/(\u03c0iwln)). To show the optimality of this convergence rate, we\nbounded as (cid:107)(cid:98)\u00b5il \u2212 \u00b5il(cid:107)2\n\nthat is, if both m and n are lower bounded by a problem-speci\ufb01c constant and logarithmic terms,\n\nand n =(cid:101)\u2126\n\npresent the following minimax lower bounds. Again, see [24] for the proof.\n\n\u03c02\nminw2\n\nmin\u03c313\n\n(cid:19)\n\n(cid:18)\n\n(8)\n\nk5\n\nD\n\n;\n\nTheorem 3. There are universal constants c1 > 0 and c2 > 0 such that:\n(a) For any {\u00b5ilc}, {\u03c0i} and any number of items n, if the number of workers m \u2264 1/(4D), then\n\n(b) For any {wl}, {\u03c0i}, any worker-item pair (m, n) and any pair of indices (i, l) \u2208 [m] \u00d7 [k], we\n\ninf(cid:98)y\n\nsup\nv\u2208[k]n\n\nhave\n\ninf(cid:98)\u00b5\n\nsup\n\n\u00b5\u2208Rm\u00d7k\u00d7k\n\nE(cid:104) n(cid:88)\nI((cid:98)yj (cid:54)= yj)\nE(cid:104)(cid:107)(cid:98)\u00b5il \u2212 \u00b5il(cid:107)2\n\n(cid:12)(cid:12)(cid:12){\u00b5ilc},{\u03c0i}, y = v\n(cid:12)(cid:12)(cid:12){wl},{\u03c0i}(cid:105) \u2265 c2 min\n\nj=1\n\n2\n\n(cid:105) \u2265 c1n.\n(cid:26)\n\n1\n\n1,\n\n(cid:27)\n\n.\n\n\u03c0iwln\n\nIn part (a) of Theorem 3, we see that the number of workers should be at least 1/(4D), otherwise\nany predictor will make many mistakes. This lower bound matches our suf\ufb01cient condition on the\nnumber of workers m (see Eq. (8)). In part (b), we see that the best possible estimate for \u00b5il has\n\n\u2126(1/(\u03c0iwln)) mean-squared error. It veri\ufb01es the optimality of our estimator(cid:98)\u00b5il. It is worth noting\n\nthat the constraint on the number of items n (see Eq. (8)) might be improvable. In real datasets we\nusually have n (cid:29) m so that the optimality for m is more important than for n.\nIt is worth contrasting our convergence rate with existing algorithms. Ghosh et al. [11] and Dalvi et\nal. [7] proposed consistent estimators for the binary one-coin model. To attain an error rate \u03b4, their\nalgorithms require m and n scaling with 1/\u03b42, while our algorithm only requires m and n scaling\nwith log(1/\u03b4). Karger et al. [15, 14] proposed algorithms for both binary and multi-class problems.\nTheir algorithm assumes that workers are assigned by a random regular graph. Moreover, their\nanalysis assumes that the limit of number of items goes to in\ufb01nity, or that the number of workers is\nmany times the number of items. Our algorithm no longer requires these assumptions.\nWe also compare our algorithm with the majority voting estimator, where the true label is simply\nestimated by a majority vote among workers. Gao and Zhou [10] showed that if there are many\nspammers and few experts, the majority voting estimator gives almost a random guess. In con-\n\ntrast, our algorithm only requires mD = (cid:101)\u2126(1) to guarantee good performance. Since mD is the\n\naggregated KL-divergence, a small number of experts are suf\ufb01cient to ensure it is large enough.\n\n6 Experiments\n\nIn this section, we report the results of empirical studies comparing the algorithm we propose in\nSection 4 (referred to as Opt-D&S) with a variety of existing methods which are also based on the\ngenerative model of Dawid and Skene. Speci\ufb01cally, we compare to the Dawid & Skene estimator\n\n7\n\n\f(a) RTE\n\n(b) Dog\n\n(c) Web\n\nFigure 1: Comparing MV-D&S and Opt-D&S with different thresholding parameter \u2206. The label\nprediction error is plotted after the 1st EM update and after convergence.\n\nOpt-D&S MV-D&S Majority Voting KOS Ghosh-SVD EigenRatio\n\nBird\nRTE\nTREC\nDog\nWeb\n\n10.09\n7.12\n29.80\n16.89\n15.86\n\n11.11\n7.12\n30.02\n16.66\n15.74\n\n24.07\n10.31\n34.86\n19.58\n26.93\n\n11.11\n39.75\n51.96\n31.72\n42.93\n\n27.78\n49.13\n42.99\n\n\u2013\n\u2013\n\n27.78\n9.00\n43.96\n\n\u2013\n\u2013\n\nTable 2: Error rate (%) in predicting true labels on real data.\n\ninitialized by majority voting (referred to as MV-D&S), the pure majority voting estimator, the\nmulti-class labeling algorithm proposed by Karger et al. [14] (referred to as KOS), the SVD-based\nalgorithm proposed by Ghosh et al. [11] (referred to as Ghost-SVD) and the \u201cEigenvalues of Ratio\u201d\nalgorithm proposed by Dalvi et al. [7] (referred to as EigenRatio). The evaluation is made on \ufb01ve\nreal datasets.\nWe compare the crowdsourcing algorithms on three binary tasks and two multi-class tasks. Binary\ntasks include labeling bird species [22] (Bird dataset), recognizing textual entailment [21] (RTE\ndataset) and assessing the quality of documents in the TREC 2011 crowdsourcing track [16] (TREC\ndataset). Multi-class tasks include labeling the breed of dogs from ImageNet [9] (Dog dataset) and\njudging the relevance of web search results [26] (Web dataset). The statistics for the \ufb01ve datasets\nare summarized in Table 1. Since the Ghost-SVD algorithm and the EigenRatio algorithm work on\nbinary tasks, they are evaluated only on the Bird, RTE and TREC datasets. For the MV-D&S and\nthe Opt-D&S methods, we iterate their EM steps until convergence.\nSince entries of the confusion matrix are positive, we \ufb01nd it helpful to incorporate this prior knowl-\nedge into the initialization stage of the Opt-D&S algorithm. In particular, when estimating the con-\nfusion matrix entries by Eq. (3), we add an extra checking step before the normalization, examining\nif the matrix components are greater than or equal to a small threshold \u2206. For components that are\nsmaller than \u2206, they are reset to \u2206. The default choice of the thresholding parameter is \u2206 = 10\u22126.\nLater, we will compare the Opt-D&S algorithm with respect to different choices of \u2206. It is impor-\ntant to note that this modi\ufb01cation doesn\u2019t change our theoretical result, since the thresholding is not\nneeded in case that the initialization error is bounded by Theorem 1.\nTable 2 summarizes the performance of each method. The MV-D&S and the Opt-D&S algorithms\nconsistently outperform the other methods in predicting the true label of items. The KOS algorithm,\nthe Ghost-SVD algorithm and the EigenRatio algorithm yield poorer performance, presumably due\nto the fact that they rely on idealized assumptions that are not met by the real data. In Figure 1, we\ncompare the Opt-D&S algorithm with respect to different thresholding parameters \u2206 \u2208 {10\u2212i}6\ni=1.\nWe plot results for three datasets (RET, Dog, Web), where the performance of MV-D&S is equal to or\nslightly better than that of Opt-D&S. The plot shows that the performance of the Opt-D&S algorithm\nis stable after convergence. But at the \ufb01rst EM iterate, the error rates are more sensitive to the choice\nof \u2206. A proper choice of \u2206 makes Opt-D&S outperform MV-D&S. The result suggests that a\nproper initialization combined with one EM iterate is good enough for the purposes of prediction.\nIn practice, the best choice of \u2206 can be obtained by cross validation.\n\n8\n\n10\u2212610\u2212510\u2212410\u2212310\u2212210\u221210.080.10.120.140.160.180.20.22ThresholdLabel prediction error Opt\u2212D&S: 1st iterationOpt\u2212D&S: 50th iterationMV\u2212D&S: 1st iterationMV\u2212D&S: 50th iteration10\u2212610\u2212510\u2212410\u2212310\u2212210\u221210.150.160.170.180.190.20.21ThresholdLabel prediction error Opt\u2212D&S: 1st iterationOpt\u2212D&S: 50th iterationMV\u2212D&S: 1st iterationMV\u2212D&S: 50th iteration10\u2212610\u2212510\u2212410\u2212310\u2212210\u221210.150.20.250.30.35ThresholdLabel prediction error Opt\u2212D&S: 1st iterationOpt\u2212D&S: 50th iterationMV\u2212D&S: 1st iterationMV\u2212D&S: 50th iteration\fReferences\n[1] A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y.-K. Liu. A spectral algorithm for latent\n\nDirichlet allocation. arXiv preprint: 1204.6703, 2012.\n\n[2] A. Anandkumar, R. Ge, D. Hsu, and S. M. Kakade. A tensor spectral approach to learning mixed mem-\n\nbership community models. In Annual Conference on Learning Theory, 2013.\n\n[3] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning\n\nlatent variable models. arXiv preprint:1210.7559, 2012.\n\n[4] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden\n\nMarkov models. In Annual Conference on Learning Theory, 2012.\n\n[5] A. T. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. arXiv preprint:\n\n1306.3729, 2013.\n\n[6] X. Chen, Q. Lin, and D. Zhou. Optimistic knowledge gradient policy for optimal budget allocation in\n\ncrowdsourcing. In Proceedings of ICML, 2013.\n\n[7] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Proceed-\n\nings of World Wide Web Conference, 2013.\n\n[8] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series C, pages 20\u201328, 1979.\n\n[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In IEEE CVPR, 2009.\n\n[10] C. Gao and D. Zhou. Minimax optimal convergence rates for estimating ground truth from crowdsourced\n\nlabels. arXiv preprint arXiv:1310.5764, 2014.\n\n[11] A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators? crowdsourcing abuse detection in\n\nuser-generated content. In Proceedings of the ACM Conference on Electronic Commerce, 2011.\n\n[12] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. Journal\n\nof Computer and System Sciences, 78(5):1460\u20131480, 2012.\n\n[13] P. Jain and S. Oh. Learning mixtures of discrete product distributions using spectral decompositions.\n\narXiv preprint:1311.2972, 2013.\n\n[14] D. R. Karger, S. Oh, and D. Shah. Ef\ufb01cient crowdsourcing for multi-class labeling. In ACM SIGMET-\n\nRICS, 2013.\n\n[15] D. R. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing systems.\n\nOperations Research, 62(1):1\u201324, 2014.\n\n[16] M. Lease and G. Kazai. Overview of the TREC 2011 crowdsourcing track. In Proceedings of TREC 2011,\n\n2011.\n\n[17] E. Lehmann and G. Casella. Theory of Point Estimation. Springer, 2nd edition, 2003.\n[18] P. Liang. Partial information from spectral methods. NIPS Spectral Learning Workshop, 2013.\n[19] Q. Liu, J. Peng, and A. T. Ihler. Variational inference for crowdsourcing. In NIPS, 2012.\n[20] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds.\n\nJournal of Machine Learning Research, 11:1297\u20131322, 2010.\n\n[21] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast\u2014but is it good? evaluating non-expert\n\nannotations for natural language tasks. In Proceedings of EMNLP, 2008.\n\n[22] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In NIPS,\n\n2010.\n\n[23] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. R. Movellan. Whose vote should count more: Optimal\n\nintegration of labels from labelers of unknown expertise. In NIPS, 2009.\n\n[24] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm\n\nfor crowdsourcing. arXiv preprint arXiv:1406.3824, 2014.\n\n[25] D. Zhou, Q. Liu, J. C. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax conditional\n\nentropy. In Proceedings of ICML, 2014.\n\n[26] D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In\n\nNIPS, 2012.\n\n[27] J. Zou, D. Hsu, D. Parkes, and R. Adams. Contrastive learning using spectral methods. In NIPS, 2013.\n\n9\n\n\f", "award": [], "sourceid": 717, "authors": [{"given_name": "Yuchen", "family_name": "Zhang", "institution": "UC Berkeley"}, {"given_name": "Xi", "family_name": "Chen", "institution": "NYU"}, {"given_name": "Dengyong", "family_name": "Zhou", "institution": "Microsoft Research"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}