{"title": "Max-Margin Majority Voting for Learning from Crowds", "book": "Advances in Neural Information Processing Systems", "page_first": 1621, "page_last": 1629, "abstract": "Learning-from-crowds aims to design proper aggregation strategies to infer the unknown true labels from the noisy labels provided by ordinary web workers. This paper presents max-margin majority voting (M^3V) to improve the discriminative ability of majority voting and further presents a Bayesian generalization to incorporate the flexibility of generative methods on modeling noisy observations with worker confusion matrices. We formulate the joint learning as a regularized Bayesian inference problem, where the posterior regularization is derived by maximizing the margin between the aggregated score of a potential true label and that of any alternative label. Our Bayesian model naturally covers the Dawid-Skene estimator and M^3V. Empirical results demonstrate that our methods are competitive, often achieving better results than state-of-the-art estimators.", "full_text": "Max-Margin Majority Voting for\n\nLearning from Crowds\n\nDepartment of Computer Science & Technology; Center for Bio-Inspired Computing Research\n\nTsinghua National Lab for Information Science & Technology\n\nTian Tian, Jun Zhu\n\nState Key Lab of Intelligent Technology & Systems; Tsinghua University, Beijing 100084, China\n\ntiant13@mails.tsinghua.edu.cn; dcszj@tsinghua.edu.cn\n\nAbstract\n\nLearning-from-crowds aims to design proper aggregation strategies to infer the\nunknown true labels from the noisy labels provided by ordinary web workers.\nThis paper presents max-margin majority voting (M3V) to improve the discrimi-\nnative ability of majority voting and further presents a Bayesian generalization to\nincorporate the \ufb02exibility of generative methods on modeling noisy observations\nwith worker confusion matrices. We formulate the joint learning as a regularized\nBayesian inference problem, where the posterior regularization is derived by max-\nimizing the margin between the aggregated score of a potential true label and that\nof any alternative label. Our Bayesian model naturally covers the Dawid-Skene\nestimator and M3V. Empirical results demonstrate that our methods are competi-\ntive, often achieving better results than state-of-the-art estimators.\n\n1\n\nIntroduction\n\nMany learning tasks require labeling large datasets. Though reliable, it is often too expensive and\ntime-consuming to collect labels from domain experts or well-trained workers. Recently, online\ncrowdsourcing platforms have dramatically decreased the labeling cost by dividing the workload\ninto small parts, then distributing micro-tasks to a crowd of ordinary web workers [17, 20]. However,\nthe labeling accuracy of web workers could be lower than expected due to their various backgrounds\nor lack of knowledge. To improve the accuracy, it is usually suggested to label every task multiple\ntimes by different workers, then the redundant labels can provide hints on resolving the true labels.\nMuch progress has been made in designing effective aggregation mechanisms to infer the true labels\nfrom noisy observations. From a modeling perspective, existing work includes both generative ap-\nproaches and discriminative approaches. A generative method builds a \ufb02exible probabilistic model\nfor generating the noisy observations conditioned on the unknown true labels and some behavior\nassumptions, with examples of the Dawid-Skene (DS) estimator [5], the minimax entropy (Entropy)\nestimator1 [24, 25], and their variants. In contrast, a discriminative approach does not model the ob-\nservations; it directly identi\ufb01es the true labels via some aggregation rules. Examples include major-\nity voting and the weighted majority voting that takes worker reliability into consideration [10, 11].\nIn this paper, we present a max-margin formulation of the most popular majority voting estimator to\nimprove its discriminative ability, and further present a Bayesian generalization that conjoins the ad-\nvantages of both generative and discriminative approaches. The max-margin majority voting (M3V)\ndirectly maximizes the margin between the aggregated score of a potential true label and that of any\nalternative label, and the Bayesian model consists of a \ufb02exible probabilistic model to generate the\nnoisy observations by conditioning on the unknown true labels. We adopt the same approach as the\n\n1A maximum entropy estimator can be understood as a dual of the MLE of a probabilistic model [6].\n\n1\n\n\fclassical Dawid-Skene estimator to build the probabilistic model by considering worker confusion\nmatrices, though many other generative models are also possible. Then, we strongly couple the\ngenerative model and M3V by formulating a joint learning problem under the regularized Bayesian\ninference (RegBayes) [27] framework, where the posterior regularization [7] enforces a large mar-\ngin between the potential true label and any alternative label. Naturally, our Bayesian model covers\nboth the David-Skene estimator and M3V as special cases by setting the regularization parameter to\nits extreme values (i.e., 0 or \u221e). We investigate two choices on de\ufb01ning the max-margin posterior\nregularization: (1) an averaging model with a variational inference algorithm; and (2) a Gibbs model\nwith a Gibbs sampler under a data augmentation formulation. The averaging version can be seen\nas an extension to the MLE learner of Dawid-Skene model. Experiments on real datasets suggest\nthat max-margin learning can signi\ufb01cantly improve the accuracy of majority voting, and that our\nBayesian estimators are competitive, often achieving better results than state-of-the-art estimators\non true label estimation tasks.\n\n2 Preliminary\nWe consider the label aggregation problem with a dataset consisting of M items (e.g., pictures or\nparagraphs). Each item i has an unknown true label yi \u2208 [D], where [D] := {1, . . . , D}. The task\nti is to label item i. In crowdsourcing, we have N workers assigning labels to these items. Each\nworker may only label a part of the dataset. Let Ii \u2286 [N ] denote the workers who have done task\nti. We use xij to denote the label of ti provided by worker j, xi to denote the labels provided to\ntask ti, and X is the collection of these worker labels, which is an incomplete matrix. The goal of\nlearning-from-crowds is to estimate the true labels of items from the noisy observations X.\n\n2.1 Majority Voting Estimator\nMajority voting (MV) is arguably the simplest method. It posits that for every task the true label is\nalways most commonly given. Thus, it selects the most frequent label for each task as its true label,\nby solving the problem:\n\n(1)\nwhere I(\u00b7) is an indicator function. It equals to 1 whenever the predicate is true, otherwise it equals to\n0. Previous work has extended this method to weighted majority voting (WMV) by putting different\nweights on workers to measure worker reliability [10, 11].\n\n\u02c6yi = argmax\nd\u2208[D]\n\nj=1\n\nI(xij = d),\u2200i \u2208 [M ],\n\nN(cid:88)\n\n2.2 Dawid-Skene Estimator\n\nThe method of Dawid and Skene [5] is a generative approach by considering worker confusability.\nIt posits that the performance of a worker is consistent across different tasks, as measured by a\nconfusion matrix whose diagonal entries denote the probability of assigning correct labels while off-\ndiagonal entries denote the probability of making speci\ufb01c mistakes to label items in one category as\nanother. Formally, let \u03c6j be the confusion matrix of worker j. Then, \u03c6jkd denotes the probability\nthat worker j assigns label d to an item whose true label is k. Under the basic assumption that\nworkers \ufb01nish each task independently, the likelihood of observed labels can be expressed as\n\n\u03c6jkd\n\nnjkd ,\n\n(2)\n\nM(cid:89)\n\nN(cid:89)\n\nD(cid:89)\n\np(X|\u03a6, y) =\n\njkd = I(xij = d, yi = k), and njkd =(cid:80)M\n\n\u03c6jkd\n\nd,k=1\n\ni=1\n\nj=1\n\nni\njkd =\n\nN(cid:89)\n\nD(cid:89)\n\nj=1\n\nd,k=1\n\ni=1 ni\n\njkd is the number of tasks with true label k\n\nwhere ni\nbut being labeled to d by worker j.\nThe unknown labels and parameters can be estimated by maximum-likelihood estimation (MLE),\n{ \u02c6y, \u02c6\u03a6} = argmaxy,\u03a6 log p(X|\u03a6, y), via an expectation-maximization (EM) algorithm that itera-\ntively updates the true labels y and the parameters \u03a6. The learning procedure is often initialized\nby majority voting to avoid bad local optima. If we assume some structure of the confusion matrix,\nvarious variants of the DS estimator have been studied, including the homogenous DS model [15]\nand the class-conditional DS model [11]. We can also put a prior over worker confusion matrices\nand transform the inference into a standard inference problem in graphical models [12]. Recently,\nspectral methods have also been applied to better initialize the DS model [23].\n\n2\n\n\f3 Max-Margin Majority Voting\nMajority voting is a discriminative model that directly \ufb01nds the most likely label for each item.\nIn this section, we present max-margin majority voting (M3V), a novel extension of (weighted)\nmajority voting with a new notion of margin (named crowdsourcing margin).\n\nN g(xi, d) \u2265 0,\n\nN g(xi, yi) \u2212 1(cid:62)\n1(cid:62)\n\nFigure 1: A geometric interpretation of the crowd-\nsourcing margin.\n\n3.1 Geometric Interpretation of Crowdsourcing Margin\nLet g(xi, d) be a N-dimensional vector, with\nthe element j equaling to I(j \u2208 Ii, xij = d).\nThen, the estimation of the vanilla majority vot-\ning in Eq. (1) can be formulated as \ufb01nding so-\nlutions {yi}i\u2208[M ] that satisfy the following con-\nstraints:\n\u2200i, d, (3)\nwhere 1N is the N-dimensional all-one vector\nand 1(cid:62)\nN g(xi, k) is the aggregated score of the\npotential true label k for task ti. By using the\nall-one vector, the aggregated score has an intu-\nitive interpretation \u2014 it denotes the number of\nworkers who have labeled ti as class k.\nApparently, the all-one vector treats all workers equally, which may be unrealistic in practice due\nto the various backgrounds of the workers. By simply choosing what the majority of workers agree\non, the vanilla MV is prone to errors when many workers give low quality labels. One way to tackle\nthis problem is to take worker reliability into consideration. Let \u03b7 denote the worker weights. When\nthese values are known, we can get the aggregated score \u03b7(cid:62)g(xi, k) of a weighted majority voting\n(WMV), and estimate the true labels by the rule: \u02c6yi = argmaxd\u2208[D] \u03b7(cid:62)g(xi, d). Thus, reliable\nworkers contribute more to the decisions.\nGeometrically, g(xi, d) is a point in the N-dimensional space for each task ti. The aggregated\nscore 1(cid:62)\nN g(xi, d) measures the distance (up to a constant scaling) from this point to the hyperplane\n1(cid:62)\nN x = 0. So the MV estimator actually \ufb01nds a point that has the largest distance to that hyperplane\nN x\u2212b = 0 which\nfor each task, and the decision boundary of majority voting is another hyperplane 1(cid:62)\nseparates the point g(xi, \u02c6yi) from the other points g(xi, k), k (cid:54)= \u02c6yi. By introducing the worker\nweights \u03b7, we relax the constraint of the all-one vector to allow for more \ufb02exible decision boundaries\n\u03b7(cid:62)x\u2212b = 0. All the possible decision boundaries with the same orientation are equivalent. Inspired\nby the generalized notion of margin in multi-class SVM [4], we de\ufb01ne the crowdsourcing margin as\nthe minimal difference between the aggregated score of the potential true label and the aggregated\nscores of other alternative labels. Then, one reasonable choice of the best hyperplane (i.e. \u03b7) is the\none that represents the largest margin between the potential true label and other alternatives.\nFig. 1 provides an illustration of the crowdsourcing margin for WMV with D = 3 and N = 2,\nwhere each axis represents the label of a worker. Assume that both workers provide labels 3 and 1\nto item i. Then, the vectors g(xi, y), y \u2208 [3] are three points in the 2D plane. Given the worker\nweights \u03b7, the estimated label should be 1, since g(xi, 1) has the largest distance to line P0. Line P1\nand line P2 are two boundaries that separate g(xi, 1) and other points. The margin is the distance\nbetween them. In this case, g(xi, 1) and g(xi, 3) are support vectors that decide the margin.\n\n3.2 Max-Margin Majority Voting Estimator\nLet (cid:96) be the minimum margin between the potential true label and all other alternatives. We de\ufb01ne\nthe max-margin majority voting (M3V) as solving the constrained optimization problem to estimate\nthe true labels y and weights \u03b7:\n\n(cid:107)\u03b7(cid:107)2\n1\n2\ni (d) \u2265 (cid:96)\u2206\ni (d),\u2200i \u2208 [M ], d \u2208 [D],\ni (d) = (cid:96)I(yi (cid:54)= d). And in practice, the worker\ni (d) := g(xi, yi) \u2212 g(xi, d) 2 and (cid:96)\u2206\nwhere g\u2206\nlabels are often linearly inseparable by a single hyperplane. Therefore, we relax the hard constraints\n\ns. t. : \u03b7(cid:62)g\u2206\n\n(4)\n\ninf\n\u03b7,y\n\n2\n\n2The offset b is canceled out in the margin constraints.\n\n3\n\n\ud835\udc65\ud835\udc562\ud835\udc65\ud835\udc561\ud835\udc88\ud835\udc99\ud835\udc56,1:(\ud835\udfce,\ud835\udfcf)\ud835\udc7b\ud835\udc88\ud835\udc99\ud835\udc56,2:(\ud835\udfce,\ud835\udfce)\ud835\udc7b\ud835\udc88\ud835\udc99\ud835\udc56,3:(\ud835\udfcf,\ud835\udfce)\ud835\udc7b\fby introducing non-negative slack variables {\u03bei}M\nmax-margin majority voting as\n(cid:107)\u03b7(cid:107)2\n1\ninf\n2\ni (d) \u2212 \u03bei,\u2200i \u2208 [M ], d \u2208 [D],\ni (d) \u2265 (cid:96)\u2206\n\ns. t. : \u03b7(cid:62)g\u2206\n\n\u03bei\u22650,\u03b7,y\n\n(cid:88)\n\n2 + c\n\n\u03bei\n\ni\n\ni=1, one for each task, and de\ufb01ne the soft-margin\n\n(5)\n\n(6)\n\n(7)\n\nwhere c is a positive regularization parameter and (cid:96) \u2212 \u03bei is the soft-margin for task ti. The value of\n\u03bei re\ufb02ects the dif\ufb01culty of task ti \u2014 a small \u03bei suggests a large discriminant margin, indicating that\nthe task is easy with a rare chance to make mistakes; while a large \u03bei suggests that the task is hard\nwith a higher chance to make mistakes. Note that our max-margin majority voting is signi\ufb01cantly\ndifferent from the unsupervised SVMs (or max-margin clustering) [21], which aims to assign cluster\nlabels to the data points by maximizing some different notion of margin with balance constraints to\navoid trivial solutions. Our M3V does not need such balance constraints.\nAlbeit not jointly convex, problem (5) can be solved by iteratively updating \u03b7 and y to \ufb01nd a local\ni (d) by the fact that the\n\noptimum. For \u03b7, the solution can be derived as \u03b7 = (cid:80)M\n(cid:88)\n\nsubproblem is convex. The parameters \u03c9 are obtained by solving the dual problem\n\n(cid:80)D\n\n(cid:88)\n\nd=1 \u03c9d\n\ni g\u2206\n\ni=1\n\n\u03b7(cid:62)\u03b7 +\n\n\u2212 1\n2\n\nsup\n0\u2264\u03c9d\n\ni \u2264c\n\ni\n\nd\n\n\u03c9d\ni (cid:96)\u2206\n\ni (d),\n\nwhich is exactly the QP dual problem in standard SVM [4]. So it can be ef\ufb01ciently solved by well-\ndeveloped SVM solvers like LIBSVM [2]. For updating y, we de\ufb01ne (x)+ := max(0, x), and then\nit is a weighted majority voting with a margin gap constraint:\n\n(cid:18)\n\n(cid:0)(cid:96)\u2206\n\n\u02c6yi = argmax\nyi\u2208[D]\n\n\u2212c max\nd\u2208[D]\n\ni (d) \u2212 \u02c6\u03b7(cid:62)g\u2206\n\ni (d)(cid:1)\n\n(cid:19)\n\n,\n\n+\n\nOverall, the algorithm is a max-margin iterative weighted majority voting (MM-IWMV). Comparing\nwith the iterative weighted majority voting (IWMV) [11], which tends to maximize the expected gap\nof the aggregated scores under the Homogenous DS model, our M3V directly maximizes the data\nspeci\ufb01ed margin without further assumption on data model. Empirically, as we shall see, our M3V\ncould have more powerful discriminative ability with better accuracy than IWMV.\n\n4 Bayesian Max-Margin Estimator\n\nWith the intuitive and simple max-margin principle, we now present a more sophisticated Bayesian\nmax-margin estimator, which conjoins the discriminative ability of M3V and the \ufb02exibility of the\ngenerative DS estimator. Though slightly more complicated in learning and inference, the Bayesian\nmodels retain the intuitive simplicity of M3V and the \ufb02exibility of DS, as explained below.\n\n4.1 Model De\ufb01nition\n\nWe adopt the same DS model to generate observations conditioned on confusion matrices, with the\nfull likelihood in Eq. (2). We further impose a prior p0(\u03a6, \u03b7) for Bayesian inference. Assuming that\nthe true labels y are given, we aim to get the target posterior p(\u03a6, \u03b7|X, y), which can be obtained\nby solving an optimization problem:\n\ninf\n\nq(\u03a6,\u03b7)\n\n(8)\nwhere L(q; y) := KL(q(cid:107)p0(\u03a6, \u03b7)) \u2212 Eq[log p(X|\u03a6, y)] measures the Kullback-Leibler (KL) di-\nvergence between a desired post-data posterior q and the original Bayesian posterior, and p0(\u03a6, \u03b7)\nis the prior, often factorized as p0(\u03a6)p0(\u03b7). As we shall see, this Bayesian DS estimator often leads\nto better performance than the vanilla DS.\nThen, we explore the ideas of regularized Bayesian inference (RegBayes) [27] to incorporate\nmax-margin majority voting constraints as posterior regularization on problem (8), and de\ufb01ne the\nBayesian max-margin estimator (denoted by CrowdSVM) as solving:\n\nL (q(\u03a6, \u03b7); y) ,\n\nL(q(\u03a6, \u03b7); y) + c \u00b7(cid:88)\n\ninf\n\n\u03bei\u22650,q\u2208P,y\ns. t. : Eq[\u03b7(cid:62)g\u2206\n\ni (d)] \u2265 (cid:96)\u2206\n\ni (d) \u2212 \u03bei,\u2200i \u2208 [M ], d \u2208 [D],\n\ni\n\n\u03bei\n\n(9)\n\n4\n\n\fwhere P is the probabilistic simplex, and we take expectation over q to de\ufb01ne the margin constraints.\nSuch posterior constraints will in\ufb02uence the estimates of y and \u03a6 to get better aggregation, as we\nshall see. We use a Dirichlet prior on worker confusion matrices, \u03c6mk|\u03b1 \u223c Dir(\u03b1), and a spherical\nGaussian prior on \u03b7, \u03b7 \u223c N (0, vI). By absorbing the slack variables, CrowdSVM solves the\nequivalent unconstrained problem:\n\nwhere Rm(q; y) =(cid:80)M\n\ninf\nq\u2208P,y\n\n(cid:0)(cid:96)\u2206\n\nL(q(\u03a6, \u03b7); y) + c \u00b7 Rm(q(\u03a6, \u03b7); y),\ni (d)\u2212Eq[\u03b7(cid:62)g\u2206\n\ni (d)](cid:1)\n\ni=1maxD\n\nd=1\n\n+ is the posterior regularization.\n\nRemark 1. From the above de\ufb01nition, we can see that both the Bayesian DS estimator and the max-\nmargin majority voting are special cases of CrowdSVM. Speci\ufb01cally, when c \u2192 0, it is equivalent\nto the DS model. If we set v = v(cid:48)/c for some positive parameter v(cid:48), then when c \u2192 \u221e CrowdSVM\nreduces to the max-margin majority voting.\n\n(10)\n\n4.2 Variational Inference\n\n2. For each worker j and category k:\n\nAlgorithm 1: The CrowdSVM algorithm\n1. Initialize y by majority voting.\nwhile Not converge do\n\nq(\u03c6jk) \u2190 Dir(njk + \u03b1).\n3. Solve the dual problem (11).\n4. For each item i: \u02c6yi \u2190 argmaxyi\u2208[D] f (yi, xi; q).\n\nSince it is intractable to directly solve\nproblem (9) or\n(10), we introduce\nthe structured mean-\ufb01eld assumption\non the post-data posterior, q(\u03a6, \u03b7) =\nq(\u03a6)q(\u03b7), and solve the problem by\nalternating minimization as outlined in\nAlg. 1. The algorithm iteratively per-\nforms the following steps until a local\noptimum is reached:\nInfer q(\u03a6): Fixing the distribution q(\u03b7) and the true labels y, the problem in Eq. (9) turns to a\nstandard Bayesian inference problem with the closed-form solution: q\u2217(\u03a6) \u221d p0(\u03a6)p(X|\u03a6, y).\nSince the prior is a Dirichlet distribution, the inferred distribution is also Dirichlet, q\u2217(\u03c6jk) =\nDir(njk + \u03b1), where njk is a D-dimensional vector with element d being njkd.\ni (d)(cid:1), where \u03c9 = {\u03c9d\nInfer q(\u03b7) and solve for \u03c9:\nFixing the distribution q(\u03a6) and the true labels y, we opti-\nmize Eq. (9) over q(\u03b7), which is also convex. We can derive the optimal solution: q\u2217(\u03b7) \u221d\ni } are Lagrange multipliers. With the normal\nprior, p0(\u03b7) = N (0, vI), the posterior is a normal distribution: q\u2217(\u03b7) = N (\u00b5, vI) , whose mean\n(cid:88)\n(cid:88)\ni (d). Then the parameters \u03c9 are obtained by solving the dual problem\n\np0(\u03b7) exp(cid:0)\u03b7(cid:62)(cid:80)\n(cid:80)\n(cid:80)D\nis \u00b5 = v(cid:80)M\n\nd=1 \u03c9d\n\ni g\u2206\n\nd \u03c9d\n\ni g\u2206\n\nend\n\ni=1\n\ni\n\n\u2212 1\n2v\n\n\u00b5(cid:62)\u00b5 +\n\nsup\n0\u2264\u03c9d\n\ni \u2264c\n\ni\n\nd\n\n\u03c9d\ni (cid:96)\u2206\n\ni (d),\n\nwhich is same as the problem (6) in max-margin majority voting.\nInfer y: Fixing the distributions of \u03a6 and \u03b7 at their optimum q\u2217, we \ufb01nd y by solving problem\n(10). To make the prediction more ef\ufb01cient, we approximate the distribution q\u2217(\u03a6) by a Dirac delta\nmass \u03b4(\u03a6\u2212 \u02c6\u03a6), where \u02c6\u03a6 is the mean of q\u2217(\u03a6). Then since all tasks are independent, we can derive\nthe discriminant function of yi as\n\nf (yi, xi; q\u2217) = log p(xi| \u02c6\u03a6, yi) \u2212 c max\nd\u2208[D]\n\ni (d) \u2212 \u02c6\u00b5(cid:62)g\u2206\n\ni (d))+\n\n(11)\n\n(12)\n\n(cid:0)((cid:96)\u2206\n\n(cid:1),\n\nwhere \u02c6\u00b5 is the mean of q\u2217(\u03b7). Then we can make predictions by maximize this function.\nApparently, the discriminant function (12) represents a strong coupling between the generative\nmodel and the discriminative margin constraints. Therefore, CrowdSVM jointly considers these\ntwo factors when estimating true labels. We also note that the estimation rule used here reduces to\nthe rule (7) of MM-IWMV by simply setting c = \u221e.\n\n5 Gibbs CrowdSVM Estimator\n\nCrowdSVM adopts an averaging model to de\ufb01ne the posterior constraints in problem (9). Here, we\nfurther provide an alternative strategy which leads to a full Bayesian model with a Gibbs sampler.\nThe resulting Gibbs-CrowdSVM does not need to make the mean-\ufb01eld assumption.\n\n5\n\n\f5.1 Model De\ufb01nition\nSuppose the target posterior q(\u03a6, \u03b7) is given, we perform the max-margin majority voting by draw-\ning a random sample \u03b7. This leads to the crowdsourcing hinge-loss\n\nM(cid:88)\n\ni=1\n\n(cid:0)(cid:96)\u2206\n\ni (d)(cid:1)\n\nR(\u03b7, y) =\n\ni (d) \u2212 \u03b7(cid:62)g\u2206\n\nmax\nd\u2208[D]\n\n+ ,\n\n(13)\n\nwhich is a function of \u03b7. Since \u03b7 are random, we de\ufb01ne the overall hinge-loss as the expectation over\nq(\u03b7), that is, R(cid:48)\nm(q(\u03a6, \u03b7); y) = Eq [R(\u03b7, y)]. Due to the convexity of max function, the expected\nloss is in fact an upper bound of the average loss, i.e., R(cid:48)\nm(q(\u03a6, \u03b7); y) \u2265 Rm(q(\u03a6, \u03b7); y). Dif-\nfering from CrowdSVM, we also treat the hidden true labels y as random variables with a uniform\nprior. Then we de\ufb01ne Gibbs-CrowdSVM as solving the problem:\n\nq\u2208P L(cid:16)\n\ninf\n\n(cid:17)\n\n(cid:34) M(cid:88)\n\ni=1\n\n(cid:35)\n\nq(\u03a6, \u03b7, y)\n\n+ Eq\n\n2c(\u03b6isi)+\n\n,\n\n(14)\n\nM(cid:89)\n\ni (d) \u2212 \u03b7(cid:62)g\u2206\n\ni (d), si = argmaxd(cid:54)=yi \u03b6id, and the factor 2 is introduced for simplicity.\nwhere \u03b6id = (cid:96)\u2206\nData Augmentation In order to build an ef\ufb01cient Gibbs sampler for this problem, we derive the\nposterior distribution with the data augmentation [3, 26] for the max-margin regularization term.\nWe let \u03c8(yi|xi, \u03b7) = exp(\u22122c(\u03b6isi )+) to represent the regularizer. According to the equality:\n(\u03bbi + c\u03b6isi)2) is\na (unnormalized) joint distribution of yi and the augmented variable \u03bbi [14], the posterior of Gibbs-\nCrowdSVM can be expressed as the marginal of a higher dimensional distribution, i.e., q(\u03a6, \u03b7, y) =\n\n0 \u03c8(yi, \u03bbi|xi, \u03b7)d\u03bbi, where \u03c8(yi, \u03bbi|xi, \u03b7) = (2\u03c0\u03bbi)\u2212 1\n\n\u03c8(yi|xi, \u03b7) =(cid:82) \u221e\n(cid:82) q(\u03a6, \u03b7, y, \u03bb)d\u03bb, where\n\n2 exp( \u22121\n\n2\u03bbi\n\nq(\u03a6, \u03b7, y, \u03bb) \u221d p0(\u03a6, \u03b7, y)\n\np(xi|\u03a6, yi)\u03c8(yi, \u03bbi|xi, \u03b7).\n\n(15)\n\nPutting the last two terms together, we can view q(\u03a6, \u03b7, y, \u03bb) as a standard Bayesian posterior, but\n\nwith the unnormalized likelihood (cid:101)p(xi, \u03bbi|\u03a6, \u03b7, yi) \u221d p(xi|\u03a6, yi)\u03c8(yi, \u03bbi|xi, \u03b7), which jointly\n\nconsiders the noisy observations and the large margin discrimination between the potential true\nlabels and alternatives.\n\ni=1\n\n5.2 Posterior Inference\nWith the augmented representation, we can do Gibbs sampling to infer the posterior distribution\nq(\u03a6, \u03b7, y, \u03bb) and thus q(\u03a6, \u03b7, y) by discarding \u03bb. The conditional distributions for {\u03a6, \u03b7, \u03bb, y}\nare derived in Appendix A. Note that when sample \u03bb from the inverse Gaussian distribution, a fast\nsampling algorithm [13] can be applied with O(1) time complexity. And for the hidden variables y,\nwe initially set them as the results of majority voting. After removing burn-in samples, we use their\nmost frequent values of as the \ufb01nal outputs.\n\n6 Experiments\nWe now present experimental results to demonstrate the strong discriminative ability of max-margin\nmajority voting and the promise of our Bayesian models, by comparing with various strong com-\npetitors on multiple real datasets.\n\n6.1 Datasets and Setups\nWe use four real world crowd labeling datasets as summarized in Table 1. Web Search [24]: 177\nworkers are asked to rate a set of 2,665 query-URL pairs on a relevance rating scale from 1 to 5.\nIn total 15,567 labels are collected. Age [8]: It\nEach task is labeled by 6 workers on average.\nconsists of 10,020 labels of age estimations for 1,002 face images. Each image was labeled by 10\nworkers. And there are 165 workers involved in these tasks. The \ufb01nal estimations are discretized\ninto 7 bins. Bluebirds [19]: It consists of 108 bluebird pictures. There are 2 breeds among all the\nimages, and each image is labeled by all 39 workers. 4,214 labels in total. Flowers [18]: It contains\n2,366 binary labels for a dataset with 200 \ufb02ower pictures. Each worker is asked to answer whether\nthe \ufb02ower in picture is peach \ufb02ower. 36 workers participate in these tasks.\n\n6\n\n\fTable 1: Datasets Overview.\n\nDATASET\n\nWEB SEARCH\n\nWe compare M3V, as well as its Bayesian ex-\ntensions CrowdSVM and Gibbs-CrowdSVM,\nwith various baselines, including majority vot-\ning (MV), iterative weighted majority voting\n(IWMV) [11], the Dawid-Skene (DS) estima-\ntor [5], and the minimax entropy (Entropy) es-\ntimator [25]. For Entropy estimator, we use the implementation provided by the authors, and show\nboth the performances of its multiclass version (Entropy (M)) and the ordinal version (Entropy (O)).\nAll the estimators that require an iterative updating are initialized by majority voting to avoid bad lo-\ncal minima. All experiments were conducted on a PC with Intel Core i5 3.00GHz CPU and 12.00GB\nRAM.\n\nITEMS WORKERS\n2,665\n1,002\n108\n200\n\nLABELS\n15,567\n10,020\n4,214\n2,366\n\nAGE\n\nBLUEBIRDS\nFLOWERS\n\n177\n165\n39\n36\n\n6.2 Model Selection\n\nDue to the special property of crowdsourcing, we cannot simply split the training data into multiple\nfolds to cross-validate the hyperparameters by using accuracy as the selection criterion, which may\nbias to over-optimistic models. Instead, we adopt the likelihood p(X| \u02c6\u03a6, \u02c6y) as the criterion to select\nparameters, which is indirectly related to our evaluation criterion (i.e., accuracy). Speci\ufb01cally, we\ntest multiple values of c and (cid:96), and select the value that produces a model with the maximal likelihood\non the given dataset. This method ensures us to select model without any prior knowledge on the\ntrue labels. For the special case of M3V, we \ufb01x the learned true labels y after training the model\nwith certain parameters, and learn confusion matrices that optimize the full likelihood in Eq. (2).\nNote that the likelihood-based cross-validation strategy [25] is not suitable for CrowdSVM, because\nthis strategy uses marginal likelihood p(X|\u03a6) to select model and ignores the label information\nof y, through which the effect of constraints is passed for CrowdSVM. If we use this strategy on\nCrowdSVM, it will tend to optimize the generative component without considering the discriminant\nconstraints, thus resulting in c \u2192 0, which is a trivial solution for model selection.\n\n6.3 Experimental Results\n\nWe \ufb01rst test our estimators on the task of estimating true labels. For CrowdSVM, we set \u03b1 = 1\nand v = 1 for all experiments, since we \ufb01nd that the results are insensitive to them. For\nM3V, CrowdSVM and Gibbs-CrowdSVM, the regularization parameters (c, (cid:96)) are selected from\nc = 2\u02c6[\u22128 : 0] and (cid:96) = [1, 3, 5] by the method in Sec. 6.2. As for Gibbs-CrowdSVM, we generate\n50 samples in each run and discard the \ufb01rst 10 samples as burn-in steps, which are suf\ufb01ciently large\nto reach convergence of the likelihood. The reported error rate is the average over 5 runs.\nTable 2 presents the error rates of various estimators. We group the comparisons into three parts:\n\nI. All the MV, IWMV and M3V are purely discriminative estimators. We can see that our M3V\nproduces consistently lower error rates on all the four datasets compared with the vanilla MV\nand IWMV, which show the effectiveness of max-margin principle for crowdsourcing;\n\nII. This part analyzes the effects of prior and max-margin regularization on improving the DS\nmodel. We can see that DS+Prior is better than the vanilla DS model on the two larger datasets\nby using a Dirichlet prior. Furthermore, CrowdSVM consistently improves the performance of\nDS+Prior by considering the max-margin constraints, again demonstrating the effectiveness of\nmax-margin learning;\n\nIII. This part compares our Gibbs-CrowdSVM estimator to the state-of-the-art minimax entropy es-\ntimators. We can see that Gibbs-CrowdSVM performs better than CrowdSVM on Web-Search,\nAge and Flowers datasets, while worse on the small Bluebuirds dataset. And it is comparable\nto the minimax entropy estimators, sometimes better with faster running speed as shown in\nFig. 2 and explained below. Note that we only test Entropy (O) on two ordinal datasets, since\nthis method is speci\ufb01cally designed for ordinal labels, while not always effective.\n\nFig. 2 summarizes the training time and error rates after each iteration for all estimators on the\nlargest Web-Search dataset.\nIt shows that the discriminative methods (e.g., IWMV and M3V)\nrun fast but converge to high error rates. Compared to the minimax entropy estimator, CrowdSVM is\n\n7\n\n\fTable 2: Error-rates (%) of different estimators on four datasets.\n\nWEB SEARCH\n\nBLUEBIRDS\n\nFLOWERS\n\nMETHODS\nMV\n\nIWMV\nM3V\nDS\n\nI\n\nII\n\nIII\n\n26.90\n15.04\n12.74\n16.92\n13.26\n9.42\n11.10\n10.40\n\nDS+PRIOR\nCROWDSVM\nENTROPY (M)\nENTROPY (O)\nG-CROWDSVM 7.99 \u00b1 0.26\n\nAGE\n34.88\n34.53\n33.33\n39.62\n34.53\n33.33\n31.14\n37.32\n\n32.98 \u00b1 0.36\n\n24.07\n27.78\n20.37\n10.19\n10.19\n10.19\n8.33\n\u2212\n\n22.00\n19.00\n13.50\n13.00\n13.50\n13.50\n13.00\n\u2212\n\n10.37\u00b10.41\n\n12.10 \u00b1 1.07\n\nFigure 2: Error rates per iteration of various esti-\nmators on the web search dataset.\n\ncomputationally more ef\ufb01cient and also con-\nverges to a lower error rate. Gibbs-CrowdSVM\nruns slower than CrowdSVM since it needs to\ncompute the inversion of matrices. The per-\nformance of the DS estimator seems mediocre\n\u2014 its estimation error rate is large and slowly\nincreases when it runs longer. Perhaps this is\npartly because the DS estimator cannot make\ngood use of the initial knowledge provided by\nmajority voting.\nWe further investigate the effectiveness of the\ngenerative component and the discriminative\ncomponent of CrowdSVM again on the largest\nWeb-Search dataset. For the generative part, we\ncompared CrowdSVM (c = 0.125, (cid:96) = 3) with\nDS and M3V (c = 0.125, (cid:96) = 3). Fig. 3(a)\ncompares the negative log likelihoods (NLL) of\nthese models, computed with Eq. (2). For M3V,\nwe \ufb01x its estimated true labels and \ufb01nd the con-\nfusion matrices to optimize the likelihood. The\nresults show that CrowdSVM achieves a lower\nNLL than DS; this suggests that by incorporat-\ning M3V constraints, CrowdSVM \ufb01nds a better\nsolution of the true labels as well as the confu-\nsion matrices than that found by the original EM algorithm. For the discriminative part, we use the\nmean of worker weights \u02c6\u00b5 to estimate the true labels as yi = argmaxd\u2208[D] \u02c6\u00b5(cid:62)g(xi, d), and show\nthe error rates in Fig. 3(b). Apparently, the weights learned by CrowdSVM are also better than those\nlearned by the other MV estimators. Overall, these results suggest that CrowdSVM can achieve a\ngood balance between the generative modeling and the discriminative prediction.\n\nFigure 3: NLLs and ERs when separately test the\ngenerative and discriminative components.\n\n(b)\n\n(a)\n\n7 Conclusions and Future Work\nWe present a simple and intuitive max-margin majority voting estimator for learning-from-crowds\nas well as its Bayesian extension that conjoins the generative modeling and discriminative predic-\ntion. By formulating as a regularized Bayesian inference problem, our methods naturally cover the\nclassical Dawid-Skene estimator. Empirical results demonstrate the effectiveness of our methods.\nOur model is \ufb02exible to \ufb01t speci\ufb01c complicated application scenarios [22]. One seminal feature of\nBayesian methods is their sequential updating. We can extend our Bayesian estimators to the online\nsetting where the crowdsourcing labels are collected in a stream and more tasks are distributed. We\nhave some preliminary results as shown in Appendix B. It would also be interesting to investigate\nmore on active learning, such as selecting reliable workers to reduce costs [9].\nAcknowledgments\nThe work was supported by the National Basic Research Program (973 Program) of China (Nos.\n2013CB329403, 2012CB316301), National NSF of China (Nos. 61322308, 61332007), Tsinghua\nNational Laboratory for Information Science and Technology Big Data Initiative, and Tsinghua\nInitiative Scienti\ufb01c Research Program (Nos. 20121088071, 20141080934).\n\n8\n\n1001011020.100.140.18Time (Seconds)Error rate IWMVM3VDawid\u2212SkeneEntropy (M)Entropy (O)CrowdSVMGibbs\u2212CrowdSVMDSCSVMG\u2212CSVMM^3V22.522.622.722.8NLL ( x103 )22.8422.5522.6222.65MVIWMVCSVMG\u2212CSVMM^3V0.10.20.3Error rate0.26930.15040.10690.10210.1274\fReferences\n[1] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward\n\nan architecture for never-ending language learning. In AAAI, 2010.\n\n[2] C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions\n\non Intelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\n[3] C. Chen, J. Zhu, and X. Zhang. Robust Bayesian max-margin clustering. In NIPS, 2014.\n[4] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. JMLR, 2:265\u2013292, 2002.\n\n[5] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using\n\nthe EM algorithm. Applied Statistics, pages 20\u201328, 1979.\n\n[6] M. Dud\u00b4\u0131k, S. J. Phillips, and R. E. Schapire. Maximum entropy density estimation with gener-\n\nalized regularization and an application to species distribution modeling. JMLR, 8(6), 2007.\n\n[7] K. Ganchev, J. Grac\u00b8a, J. Gillenwater, and B. Taskar. Posterior regularization for structured\n\nlatent variable models. JMLR, 11:2001\u20132049, 2010.\n\n[8] Otto C. Liu X. Han, H. and A. Jain. Demographic estimation from face images: Human vs.\n\nmachine performance. IEEE Trans. on PAMI, 2014.\n\n[9] S. Jagabathula, L. Subramanian, and A. Venkataraman. Reputation-based worker \ufb01ltering in\n\ncrowdsourcing. In NIPS, 2014.\n\n[10] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In\n\nNIPS, 2011.\n\n[11] H. Li and B. Yu. Error rate bounds and iterative weighted majority voting for crowdsourcing.\n\narXiv preprint arXiv:1411.4086, 2014.\n\n[12] Q. Liu, J. Peng, and A. Ihler. Variational inference for crowdsourcing. In NIPS, 2012.\n[13] J. R. Michael, W. R. Schucany, and R. W. Haas. Generating random variates using transforma-\n\ntions with multiple roots. The American Statistician, 30(2):88\u201390, 1976.\n\n[14] N. G. Polson and S. L. Scott. Data augmentation for support vector machines. Bayesian\n\nAnalysis, 6(1):1\u201323, 2011.\n\n[15] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning\n\nfrom crowds. JMLR, 11:1297\u20131322, 2010.\n\n[16] T. Shi and J. Zhu. Online Bayesian passive-aggressive learning. In ICML, 2014.\n[17] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast\u2014but is it good?: evaluating\n\nnon-expert annotations for natural language tasks. In EMNLP, 2008.\n\n[18] T. Tian and J. Zhu. Uncovering the latent structures of crowd labeling. In PAKDD, 2015.\n[19] P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of\n\ncrowds. In NIPS, 2010.\n\n[20] J. Whitehill, T. F. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count\n\nmore: Optimal integration of labels from labelers of unknown expertise. In NIPS, 2009.\n\n[21] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector ma-\n\nchines. In AAAI, 2005.\n\n[22] O. F. Zaidan and C. Callison-Burch. Crowdsourcing translation: Professional quality from\n\nnon-professionals. In ACL, 2011.\n\n[23] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal\n\nalgorithm for crowdsourcing. In NIPS, 2014.\n\n[24] D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax\n\nentropy. In NIPS, 2012.\n\n[25] D. Zhou, Q. Liu, J. Platt, and C. Meek. Aggregating ordinal labels from crowds by minimax\n\nconditional entropy. In ICML, 2014.\n\n[26] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with data aug-\n\nmentation. JMLR, 15:1073\u20131110, 2014.\n\n[27] J. Zhu, N. Chen, and E. P. Xing. Bayesian inference with posterior regularization and applica-\n\ntions to in\ufb01nite latent svms. JMLR, 15:1799\u20131847, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1007, "authors": [{"given_name": "TIAN", "family_name": "TIAN", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}]}