{"title": "Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 4439, "page_last": 4447, "abstract": "We consider a crowdsourcing model in which n workers are asked to rate the quality of n items previously generated by other workers. An unknown set of $\\alpha n$ workers generate reliable ratings, while the remaining workers may behave arbitrarily and possibly adversarially. The manager of the experiment can also manually evaluate the quality of a small number of items, and wishes to curate together almost all of the high-quality items with at most an fraction of low-quality items. Perhaps surprisingly, we show that this is possible with an amount of work required of the manager, and each worker, that does not scale with n: the dataset can be curated with $\\tilde{O}(1/\\beta\\alpha\\epsilon^4)$ ratings per worker, and $\\tilde{O}(1/\\beta\\epsilon^2)$ ratings by the manager, where $\\beta$ is the fraction of high-quality items. Our results extend to the more general setting of peer prediction, including peer grading in online classrooms.", "full_text": "Avoiding Imposters and Delinquents: Adversarial\n\nCrowdsourcing and Peer Prediction\n\nJacob Steinhardt\nStanford University\n\nGregory Valiant\nStanford University\n\nMoses Charikar\nStanford University\n\nAbstract\n\nWe consider a crowdsourcing model in which n workers are asked to rate the quality\nof n items previously generated by other workers. An unknown set of \u03b1n workers\ngenerate reliable ratings, while the remaining workers may behave arbitrarily and\npossibly adversarially. The manager of the experiment can also manually evaluate\nthe quality of a small number of items, and wishes to curate together almost all\nof the high-quality items with at most an \u0001 fraction of low-quality items. Perhaps\nsurprisingly, we show that this is possible with an amount of work required of the\nmanager, and each worker, that does not scale with n: the dataset can be curated\nratings by the manager, where \u03b2\nis the fraction of high-quality items. Our results extend to the more general setting\nof peer prediction, including peer grading in online classrooms.\n\nratings per worker, and \u02dcO(cid:16) 1\n\n(cid:17)\n\nwith \u02dcO(cid:16) 1\n\n\u03b2\u03b13\u00014\n\n(cid:17)\n\n\u03b2\u00012\n\n1\n\nIntroduction\n\nHow can we reliably obtain information from humans, given that the humans themselves are unreli-\nable, and might even have incentives to mislead us? Versions of this question arise in crowdsourcing\n(Vuurens et al., 2011), collaborative knowledge generation (Priedhorsky et al., 2007), peer grading\nin online classrooms (Piech et al., 2013; Kulkarni et al., 2015), aggregation of customer reviews\n(Harmon, 2004), and the generation/curation of large datasets (Deng et al., 2009). A key challenge\nis to ensure high information quality despite the fact that many people interacting with the system\nmay be unreliable or even adversarial. This is particularly relevant when raters have an incentive to\ncollude and cheat as in the setting of peer grading, as well as for reviews on sites like Amazon and\nYelp, where artists and \ufb01rms are incentivized to manufacture positive reviews for their own products\nand negative reviews for their rivals (Harmon, 2004; Mayzlin et al., 2012).\nOne approach to ensuring quality is to use gold sets \u2014 questions where the answer is known, which\ncan be used to assess reliability on unknown questions. However, this is overly constraining \u2014 it\ndoes not make sense for open-ended tasks such as knowledge generation on wikipedia, nor even for\ncrowdsourcing tasks such as \u201ctranslate this paragraph\u201d or \u201cdraw an interesting picture\u201d where there\nare different equally good answers. This approach may also fail in settings, such as peer grading in\nmassive online open courses, where students might collude to in\ufb02ate their grades.\nIn this work, we consider the challenge of using crowdsourced human ratings to accurately and\nef\ufb01ciently evaluate a large dataset of content. In some settings, such as peer grading, the end goal\nis to obtain the accurate evaluation of each datum; in other settings, such as the curation of a large\ndataset, accurate evaluations could be leveraged to select a high-quality subset of a larger set of\nvariable-quality (perhaps crowd-generated) data.\nThere are several confounding dif\ufb01culties that arise in extracting accurate evaluations. First, many\nraters may be unreliable and give evaluations that are uncorrelated with the actual item quality;\nsecond, some reliable raters might be harsher or more lenient than others; third, some items may be\nharder to evaluate than others and so error rates could vary from item to item, even among reliable\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fraters; \ufb01nally, some raters may even collude or want to hack the system. This raises the question: can\nwe obtain information from the reliable raters, without knowing who they are a priori?\nIn this work, we answer this question in the af\ufb01rmative, under surprisingly weak assumptions:\n\n\u2022 We do not assume that the majority of workers are reliable.\n\u2022 We do not assume that the unreliable workers conform to any statistical model; they could\nbehave fully adversarially, in collusion with each other and with full knowledge of how the\nreliable workers behave.\n\u2022 We do not assume that the reliable worker ratings match the true ratings, but only that they\nare \u201capproximately monotonic\u201d in the true ratings, in a sense that will be formalized later.\n\u2022 We do not assume that there is a \u201cgold set\u201d of items with known ratings presented to each\nuser (as an adversary could identify and exploit this). Instead, we rely on a small number of\nreliable judgments on randomly selected items, obtained after the workers submit their own\nratings; in practice, these could be obtained by rating those items oneself.\n\nFor concreteness, we describe a simple formalization of the crowdsourcing setting (our actual results\nhold in a more general setting). We imagine that we are the dataset curator, so that \u201cus\u201d and \u201courselves\u201d\nrefers in general to whoever is curating the data. There are n raters and m items to evaluate, which\nhave an unknown quality level in [0, 1]. At least \u03b1n workers are \u201creliable\u201d in that their judgments\nmatch our own in expectation, and they make independent errors. We assign each worker to evaluate\nat most k randomly selected items. In addition, we ourselves judge k0 items. Our goal is to recover\nthe \u03b2-quantile: the set T \u2217 of the \u03b2m highest-quality items. Our main result implies the following:\nTheorem 1. In the setting above, suppose n = m. Then there is k = O(\n\u03b2\u03b13\u00014 ), and k0 = \u02dcO( 1\n\u03b2\u00012 )\nsuch that, with probability 99%, we can identify \u03b2m items with average quality only \u0001 worse than T \u2217.\n\n1\n\nInterestingly, the amount of work that each worker (and we ourselves) has to do does not grow with\nn; it depends only on the fraction \u03b1 of reliable workers and the desired accuracy \u0001. While the number\nof evaluations k for each worker is likely not optimal, we note that the amount of work k0 required of\nus is close to optimal: for \u03b1 \u2264 \u03b2, it is information theoretically necessary for us to evaluate \u2126(1/\u03b2\u00012)\nitems, via a reduction to estimating noisy coin \ufb02ips.\n2, then an adversary\nWhy is it necessary to include some of our own ratings? If we did not, and \u03b1 < 1\ncould create a set of dishonest raters that were identical to the reliable raters except with the item\nindices permuted by a random permutation of {1, . . . , m}. In this case, there is no way to distinguish\nthe honest from the dishonest raters except by breaking the symmetry with our own ratings.\nOur main result holds in a considerably more general setting where we require a weaker form of\ninter-rater agreement \u2014 for example, our results hold even if some of the reliable raters are harsher\nthan others, as long as the expected ratings induce approximately the same ranking. The focus on\nquantiles rather than raw ratings is what enables this. Note that once we estimate the quantiles, we\ncan approximately recover the ratings by evaluating a few items in each quantile.\n\nr\u2217\n\ntrue ratings\n\ngood raters\n\n\u02dcA\n\nrandom\n\nadversaries\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n.9\n1\n\n1\n\n1\n1\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nT \u2217\n\n=\u21d2 M\u2217\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n1\n\n1\n1\n\n1\n\n1\n1\n\n1\n\n1\n1\n\n0\n\n1\n1\n\n1\n\n1\n1\n\n1\n\n0\n0\n\n0\n\n0\n0\n\n0\n\n0\n0\n\n0\n\n0\n0\n\n1\n\n0\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n0\n\n0\n0\n\n0\n\n1\n1\n\nitems\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.1\n\n.8\n.9\n\n0\n\n.8\n.8\n\n.7\n.8\n\n1\n\n.6\n.6\n\n.6\n.2\n\n0\n\n0\n0\n\n.5\n.1\n\n1\n\n0\n0\n\n.4\n0\n\n0\n\n1\n1\n\nFigure 1: Illustration of our problem setting. We observe a small number of ratings from each rater\n(indicated in blue), which we represent as entries in a matrix \u02dcA (unobserved ratings in red, treated as\nzero by our algorithm). There is also a true rating r\u2217 that we would assign to each item; by rating\nsome items ourself, we observe some entries of r\u2217 (also in blue). Our goal is to recover the set T \u2217\nrepresenting the top \u03b2 fraction of items under r\u2217. As an intermediate step, we approximately recover\na matrix M\u2217 that indicates the top items for each individual rater.\n\n2\n\n\fOur technical tools draw on semide\ufb01nite programming methods for matrix completion, which have\nbeen used to study graph clustering as well as community detection in the stochastic block model\n(Holland et al., 1983; Condon and Karp, 2001). Our setting corresponds to the sparse case of graphs\nwith constant degree, which has recently seen great interest (Decelle et al., 2011; Mossel et al., 2012;\n2013b;a; Massouli\u00e9, 2014; Gu\u00e9don and Vershynin, 2014; Mossel et al., 2015; Chin et al., 2015; Abbe\nand Sandon, 2015a;b; Makarychev et al., 2015). Makarychev et al. (2015) in particular provide an\nalgorithm that is robust to adversarial perturbations, but only if the perturbation has size o(n); see\nalso Cai and Li (2015) for robustness results when the degree of the graph is logarithmic.\nSeveral authors have considered semirandom settings for graph clustering, which allow for some\ntypes of adversarial behavior (Feige and Krauthgamer, 2000; Feige and Kilian, 2001; Coja-Oghlan,\n2004; Krivelevich and Vilenchik, 2006; Coja-Oghlan, 2007; Makarychev et al., 2012; Chen et al.,\n2014; Gu\u00e9don and Vershynin, 2014; Moitra et al., 2015; Agarwal et al., 2015). In our setting, these\nsemirandom models are unsuitable because they rule out important types of strategic behavior, such\nas an adversary rating some items accurately to gain credibility. By allowing arbitrary behavior\nfrom the adversary, we face a key technical challenge: while previous analyses consider errors\nrelative to a ground truth clustering, in our setting the ground truth only exists for rows of the matrix\ncorresponding to reliable raters, while the remaining rows could behave arbitrarily even in the limit\nwhere all ratings are observed. This necessitates a more careful analysis, which helps to clarify what\nproperties of a clustering are truly necessary for identifying it.\n\n2 Algorithm and Intuition\n\nWe now describe our recovery algorithm. To \ufb01x notation, we assume that there are n raters and m\nitems, and that we observe a matrix \u02dcA \u2208 [0, 1]n\u00d7m: \u02dcAij = 0 if rater i does not rate item j, and\notherwise \u02dcAij is the assigned rating, which takes values in [0, 1]. In the settings we care about \u02dcA is\nvery sparse \u2014 each rater only rates a few items. Remember that our goal is to recover the \u03b2-quantile\nT \u2217 of the best items according to our own rating.\nOur algorithm is based on the following intuition: the reliable raters must (approximately) agree on\nthe ranking of items, and so if we can cluster the rows of \u02dcA appropriately, then the reliable raters\nshould form a single very large cluster (of size \u03b1n). There can be at most 1\n\u03b1 disjoint clusters of this\nsize, and so we can manually check the accuracy of each large cluster (by checking agreement with\nour own rating on a few randomly selected items) and then choose the best one.\nOne major challenge in using the clustering intuition is the sparsity of \u02dcA: any two rows of \u02dcA will\nalmost certainly have no ratings in common, so we must exploit the global structure of \u02dcA to discover\nclusters, rather than using pairwise comparisons of rows. The key is to view our problem as a form of\nnoisy matrix completion \u2014 we imagine a matrix A\u2217 in which all the ratings have been \ufb01lled in and\nall noise from individual ratings has been removed. We de\ufb01ne a matrix M\u2217 that indicates the top \u03b2m\nitems in each row of A\u2217: M\u2217\nij = 0\notherwise (this differs from the actual de\ufb01nition of M\u2217 given in Section 4, but is the same in spirit).\nIf we could recover M\u2217, we would be close to obtaining the clustering we wanted.\n\nij = 1 if item j has one of the top \u03b2m ratings from rater i, and M\u2217\n\nAlgorithm 1 Algorithm for recovering \u03b2-quantile matrix \u02dcM using (unreliable) ratings \u02dcA.\n1: Parameters: reliable fraction \u03b1, quantile \u03b2, tolerance \u0001, number of raters n, number of items m\n2: Input: noisy rating matrix \u02dcA\n3: Let \u02dcM be the solution of the optimization problem (1):\n\nmaximize (cid:104) \u02dcA, M(cid:105),\n(cid:80)\nsubject to 0 \u2264 Mij \u2264 1 \u2200i, j,\njMij \u2264 \u03b2m \u2200j,\n\nwhere (cid:107) \u00b7 (cid:107)\u2217 denotes nuclear norm.\n\n4: Output \u02dcM.\n\n3\n\n(cid:112)\u03b1\u03b2nm,\n\n(cid:107)M(cid:107)\u2217 \u2264 2\n\u03b1\u0001\n\n(1)\n\n\f\u221a\n\n\u03b1\u0001\n\n4 \u03b2k\n\nAlgorithm 2 Algorithm for recovering an accurate \u03b2-quantile T from the \u03b2-quantile matrix \u02dcM.\n1: Parameters: tolerance \u0001, reliable fraction \u03b1\n2: Input: matrix \u02dcM of approximate \u03b2-quantiles, noisy ratings \u02dcr\n3: Select 2 log(2/\u03b4)/\u03b1 indices i \u2208 [n] at random.\n4: Let i\u2217 be the index among these for which (cid:104) \u02dcMi, \u02dcr(cid:105) is largest, and let T0 \u2190 \u02dcMi\u2217. (cid:46) T0 \u2208 [0, 1]m\n5: do T \u2190 RANDOMIZEDROUND(T0) while (cid:104)T \u2212 T0, \u02dcr(cid:105) < \u2212 \u0001\n(cid:46) T \u2208 {0, 1}m\n6: return T\nThe key observation that allows us to approximate M\u2217 given only the noisy, incomplete \u02dcA is that M\u2217\nhas low-rank structure: since all of the reliable raters agree with each other, their rows in M\u2217 are all\nidentical, and so there is an (\u03b1n)\u00d7 m submatrix of M\u2217 with rank 1. This inspires the low-rank matrix\ncompletion algorithm for recovering \u02dcM given in Algorithm 1. Each row of M is constrained to have\nsum at most \u03b2m, and M as a whole is constrained to have nuclear norm (cid:107)M(cid:107)\u2217 at most 2\n\u03b1\u03b2nm.\nRecall that the nuclear norm is the sum of the singular values of M; in the same way that the (cid:96)1-norm\nis a convex surrogate for the (cid:96)0-norm, the nuclear norm acts as a convex surrogate for the rank of M\n(i.e., number of non-zero singular values). The optimization problem (1) therefore chooses a set of\n\u03b2m items in each row to maximize the corresponding values in \u02dcA, while constraining the item sets to\nhave low rank (where low rank is relaxed to low nuclear norm to obtain a convex problem). This\nlow-rank constraint acts as a strong regularizer that quenches the noise in \u02dcA.\nOnce we have recovered \u02dcM using Algorithm 1, it remains to recover a speci\ufb01c set T that approximates\nthe \u03b2-quantile according to our ratings. Algorithm 2 provides a recipe for doing so: \ufb01rst, rate k0\nitems at random, obtaining the vector \u02dcr: \u02dcrj = 0 if we did not rate item j, and otherwise \u02dcrj is the\n(possibly noisy) rating that we assign to item j. Next, score each row \u02dcMi based on the noisy ratings\n\u02dcMij \u02dcrj, and let T0 be the highest-scoring \u02dcMi among O(log(2/\u03b4)/\u03b1) randomly selected i. Finally,\nrandomly round the vector T0 \u2208 [0, 1]m to a discrete vector T \u2208 {0, 1}m, and treat T as the indicator\nfunction of a set approximating the \u03b2-quantile (see Section 5 for details of the rounding algorithm).\nIn summary, given a noisy rating matrix \u02dcA, we will \ufb01rst run Algorithm 1 to recover a \u03b2-quantile\nmatrix \u02dcM for each rater, and then run Algorithm 2 to recover our personal \u03b2-quantile using \u02dcM.\nPossible attacks by adversaries. In our algorithm, the adversaries can in\ufb02uence \u02dcMi for reliable\nraters i via the nuclear norm constraint (note that the other constraints are separable across rows).\nThis makes sense because the nuclear norm is what causes us to pool global structure across raters\n(and thus potentially pool bad information). In order to limit this in\ufb02uence, the constraint on the\nnuclear norm is weaker than is typical by a factor of 2\n\u0001 ; it is not clear to us whether this is actually\nnecessary or due to a loose analysis.\nj Mij \u2264 \u03b2m is not typical in the literature. For instance, (Chen et al., 2014) place\nno constraint on the sum of each row in M (they instead normalize \u02dcM to lie in [\u22121, 1]n\u00d7m, which\nrecovers the items with positive rating rather than the \u03b2-quantile). Our row normalization constraint\nprevents an attack in which a spammer rates a random subset of items as high as possible and rates the\nremaining items as low as possible. If the actual set of high-quality items has density much smaller\nthan 50%, then the spammer gains undue in\ufb02uence relative to honest raters that only rate e.g. 10% of\nthe items highly. Normalizing M to have a \ufb01xed row sum prevents this; see Section B for details.\n\nThe constraint(cid:80)\n\n(cid:80)\n\nj\n\n3 Assumptions and Approach\nWe now state our assumptions more formally, state the general form of our results, and outline the\nkey ingredients of the proof. In our setting, we can query a rater i \u2208 [n] and item j \u2208 [m] to obtain a\nrating \u02dcAij \u2208 [0, 1]. Let r\u2217 \u2208 [0, 1]m denote the vector of true ratings of the items. We can also query\nan item j (by rating it ourself) to obtain a noisy rating \u02dcrj such that E[\u02dcrj] = r\u2217\nj .\nLet C \u2286 [n] be the set of reliable raters, where |C| \u2265 \u03b1n. Our main assumption is that the reliable\nraters make independent errors:\nAssumption 1 (Independence). When we query a pair (i, j) with i \u2208 C, we obtain an output \u02dcAij\nwhose value is independent of all of the other queries so far. Similarly, when we query an item j, we\nobtain an output \u02dcrj that is independent of all of the other queries so far.\n\n4\n\n\fAlgorithm 3 Algorithm for obtaining (unreliable) ratings matrix \u02dcA and noisy ratings \u02dcr, \u02dcr(cid:48).\n1: Input: number of raters n, number of items m, and number of ratings k and k0.\n2: Initially assign each rater to each item independently with probability k/m.\n3: For each rater with more than 2k items, arbitrarily unassign items until there are 2k remaining.\n4: For each item with more than 2k raters, arbitrarily unassign raters until there are 2k remaining.\n5: Have the raters submit ratings of their assigned items, and let \u02dcA denote the resulting matrix of\n\nratings with missing entries \ufb01ll in with zeros.\n6: Generate \u02dcr by rating items with probability k0\n7: Output \u02dcA, \u02dcr\n\nm (\ufb01ll in missing entries with zeros)\n\nNote that Assumption 1 allows the unreliable ratings to depend on all previous ratings and also allows\narbitrary collusion among the unreliable raters. In our algorithm, we will generate our own ratings\nafter querying everyone else, which ensures that at least \u02dcr is independent of the adversaries.\nWe need a way to formalize the idea that the reliable raters agree with us. To this end, for i \u2208 C\nlet A\u2217\nij = E[ \u02dcAij] be the expected rating that rater i assigns to item j. We want A\u2217 to be roughly\nincreasing in r\u2217:\nDe\ufb01nition 1 (Monotonic raters). We say that the reliable raters are (L, \u0001)-monotonic if\n\nwhenever r\u2217\n\nj \u2265 r\u2217\n\nj(cid:48) \u2264 L \u00b7 (A\u2217\nj(cid:48), for all i \u2208 C and all j, j(cid:48) \u2208 [m].\n\nj \u2212 r\u2217\nr\u2217\n\nij \u2212 A\u2217\n\nij(cid:48)) + \u0001\n\n(2)\n\n(3)\n\n5. Then A\u2217\n\nj with probability 3\n\nj \u2208 {0, 1}) and that each rating \u02dcAi,j matches r\u2217\n\nThe (L, \u0001)-monotonicity property says that if we think that one item is substantially better than another\nitem, the reliable raters should think so as well. As an example, suppose that our own ratings are\nbinary (r\u2217\n5 r\u2217\nj ,\nand hence the ratings are (5, 0)-monotonic. In general, the monotonicity property is fairly mild \u2014 if\nthe reliable ratings are not (L, \u0001)-monotonic, it is not clear that they should even be called reliable!\nAlgorithm for collecting ratings. Under the model given in Assumption 1, our algorithm for\ncollecting ratings is given in Algorithm 3. Given integers k and k0, Algorithm 3 assigns each rater\nat most 2k ratings, and assigns us k0 ratings in expectation. The output is a noisy rating matrix\n\u02dcA \u2208 [0, 1]n\u00d7m as well as a noisy rating vector \u02dcr \u2208 [0, 1]m. Our main result states that we can use \u02dcA\nand \u02dcr to estimate the \u03b2-quantile T \u2217; throughout we will assume that m is at least n.\nTheorem 2. Let m \u2265 n. Suppose that Assumption 1 holds, that the reliable raters are (L, \u00010)-\n\nmonotonic, and that we run Algorithm 3 to obtain noisy ratings. Then there is k = O(cid:16) log3(2/\u03b4)\nand k0 = O(cid:16) log(2/\u03b1\u03b2\u0001\u03b4)\n\nsuch that, with probability 1 \u2212 \u03b4, Algorithms 1 and 2 output a set T with\n\ni,j = 2\n\n5 + 1\n\n(cid:17)\n\n(cid:17)\n\n\u03b2\u03b13\u00014\n\nm\nn\n\n\u03b2\u00012\n\n\uf8eb\uf8ed(cid:88)\n\nj\u2208T \u2217\n\nj \u2212(cid:88)\n\nr\u2217\n\nj\u2208T\n\n1\n\u03b2m\n\n\uf8f6\uf8f8 \u2264 (2L + 1) \u00b7 \u0001 + 2\u00010.\n\nr\u2217\n\nj\n\nn . Some dependence on m\n\nNote that the amount of work for the raters scales as m\nwe need to make sure that every item gets rated at least once.\nThe proof of Theorem 2 can be split into two parts: analyzing Algorithm 1 (Section 4), and analyzing\nAlgorithm 2 (Section 5). At a high level, analyzing Algorithm 1 involves showing that the nuclear\nnorm constraint in (1) imparts suf\ufb01cient noise robustness while not allowing the adversary too much\nin\ufb02uence over the reliable rows of \u02dcM. Analyzing Algorithm 2 is far more straightforward, and\nrequires only standard concentration inequalities and a standard randomized rounding idea (though\nthe latter is perhaps not well-known, so we will explain it brie\ufb02y in Section 5).\n\nn is necessary, since\n\n4 Recovering \u02dcM (Algorithm 1)\n\nThe goal of this section is to show that solving the optimization problem (1) recovers a matrix \u02dcM that\napproximates the \u03b2-quantile of r\u2217 in the following sense:\n\n5\n\n\f(cid:88)\n\n(cid:88)\n\nProposition 1. Under the conditions of Theorem 2 and the corresponding values of k and k0,\nAlgorithm 1 outputs a matrix \u02dcM satisfying\n\n(T \u2217\n\n1\n\u03b2m\n\n1\n|C|\nwith probability 1 \u2212 \u03b4, where T \u2217\nj = 1 if j lies in the \u03b2-quantile of r\u2217, and is 0 otherwise.\nProposition 1 says that the row \u02dcMi is good according to rater i\u2019s ratings A\u2217\nmonotonicity then implies that \u02dcMi is also good according to r\u2217. In particular (see A.2 for details)\n\nj \u2212 \u02dcMij)A\u2217\n\nij \u2264 \u0001\n\ni . Note that (L, \u00010)-\n\nj\u2208[m]\n\ni\u2208C\n\n(4)\n\n1\n|C|\n\n1\n\u03b2m\n\n(T \u2217\n\nj \u2212 \u02dcMij)r\u2217\n\nj \u2264 L \u00b7 1\n|C|\n\n1\n\u03b2m\n\n(T \u2217\n\nj \u2212 \u02dcMij)A\u2217\n\nij + \u00010 \u2264 L \u00b7 \u0001 + \u00010.\n\n(5)\n\n(cid:88)\n\n(cid:88)\n\ni\u2208C\n\nj\u2208[m]\n\n(cid:88)\n\n(cid:88)\n\ni\u2208C\n\nj\u2208[m]\n\nProving Proposition 1 involves two major steps: showing (a) that the nuclear norm constraint in (1)\nimparts noise-robustness, and (b) that the constraint does not allow the adversaries to in\ufb02uence \u02dcMC\ntoo much. (For a matrix X we let XC denote the rows indexed by C and XC the remaining rows.)\nIn a bit more detail, if we let M\u2217 denote the \u201cideal\u201d value of \u02dcM, and B denote a denoised version\nof \u02dcA, we \ufb01rst show (Lemma 1) that (cid:104)B, \u02dcM \u2212 M\u2217(cid:105) \u2265 \u2212\u0001(cid:48) for some \u0001(cid:48) determined below. This is\nestablished via the matrix concentration inequalities in Le et al. (2015). Lemma 1 would already\nsuf\ufb01ce for standard approaches (e.g., Gu\u00e9don and Vershynin, 2014), but in our case we must grapple\nwith the issue that the rows of B could be arbitrary outside of C, and hence closeness according to\nB may not imply actual closeness between \u02dcM and M\u2217. Our main technical contribution, Lemma 2,\nshows that (cid:104)BC, \u02dcMC \u2212 M\u2217\nC(cid:105) \u2265 (cid:104)B, \u02dcM \u2212 M\u2217(cid:105)\u2212 \u0001(cid:48); that is, closeness according to B implies closeness\naccording to BC. We can then restrict attention to the reliable raters, and obtain Proposition 1.\nPart 1: noise-robustness. Let B be the matrix satisfying BC = k\non C. The scaling k\nIdeally, we would like to have MC = RC, i.e., M matches T \u2217 on all the rows of C. In light of this,\nwe will let M\u2217 be the solution to the following \u201ccorrected\u201d program, which we don\u2019t have access to\n(since it involves knowledge of A\u2217 and C), but which will be useful for analysis purposes:\n\nm is chosen so that E[ \u02dcAC] \u2248 BC. Also de\ufb01ne R \u2208 Rn\u00d7m by Rij = T \u2217\nj .\n\nC, BC = \u02dcAC, which denoises \u02dcA\n\nm A\u2217\n\n(6)\n\n(7)\n\nmaximize (cid:104)B, M(cid:105),\nsubject to MC = RC,\n\n(cid:80)\njMij \u2264 \u03b2m \u2200i,\nij = T \u2217\n\n0 \u2264 Mij \u2264 1 \u2200i, j,\n(cid:107)M(cid:107)\u2217 \u2264 2\n\u03b1\u0001\n\n(cid:112)\u03b1\u03b2nm\n\nImportantly, (6) enforces M\u2217\n\nLemma 1. Let m \u2265 n. Suppose that Assumption 1 holds. Then there is a k = O(cid:16) log3(2/\u03b4)\n\nj for all i \u2208 C. Lemma 1 shows that \u02dcM is \u201cclose\u201d to M\u2217:\n\nsuch\nthat the solution \u02dcM to (1) performs nearly as well as M\u2217 under B; speci\ufb01cally, with probability\n1 \u2212 \u03b4,\n\n(cid:17)\n\n\u03b2\u03b13\u00014\n\nm\nn\n\n(cid:104)B, \u02dcM(cid:105) \u2265 (cid:104)B, M\u2217(cid:105) \u2212 \u0001\u03b1\u03b2kn.\n\nNote that \u02dcM is not necessarily feasible for (6), because of the constraint MC = RC; Lemma 1 merely\nasserts that \u02dcM approximates M\u2217 in objective value. The proof of Lemma 1, given in Section A.3,\nprimarily involves establishing a uniform deviation result; if we let P denote the feasible set for (1),\nthen we wish to show that |(cid:104) \u02dcA \u2212 B, M(cid:105)| \u2264 1\n2 \u0001\u03b1\u03b2kn for all M \u2208 P. This would imply that the\nobjectives of (1) and (6) are essentially identical, and so optimizing one also optimizes the other.\nUsing the inequality |(cid:104) \u02dcA \u2212 B, M(cid:105)| \u2264 (cid:107) \u02dcA \u2212 B(cid:107)op(cid:107)M(cid:107)\u2217, where (cid:107) \u00b7 (cid:107)op denotes operator norm, it\nsuf\ufb01ces to establish a matrix concentration inequality bounding (cid:107) \u02dcA \u2212 B(cid:107)op. This bound follows\nfrom the general matrix concentration result of Le et al. (2015), stated in Section A.1.\nPart 2: bounding the in\ufb02uence of adversaries. We next show that the nuclear norm constraint does\nnot give the adversaries too much in\ufb02uence over the de-noised program (6); this is the most novel\naspect of our argument.\n\n6\n\n\fMC\n\nMC = RC\n\n(cid:107)M(cid:107)\u2217\u2264\n\n\u03c1\n\n\u0001\n\u2264\n(cid:105)\n\nM\n\u2212\n\u2217\n\nM\n\n,\nC\nZ\n\u2212\nC\nB\n(cid:104)\n\n\u02dcM\nM\u2217\n\nB\n\nBC\u2212ZC\n\nMC\n\nC \u2212 MC(cid:105) \u2264 \u0001, which will contain \u02dcM.\n\nFigure 2: Illustration of our Lagrangian duality argument, and of the role of Z. The blue region\nrepresents the nuclear norm constraint and the gray region the remaining constraints. Where the blue\nregion slopes downwards, a decrease in MC can be offset by an increase in MC when measuring\n(cid:104)B, M(cid:105). By linearizing the nuclear norm constraint, the vector B \u2212 Z accounts for this offset, and\nthe red region represents the constraint (cid:104)BC \u2212 ZC, M\u2217\nSuppose that the constraint on (cid:107)M(cid:107)\u2217 were not present in (6). Then the adversaries would have no\nin\ufb02uence on M\u2217\nC , because all the remaining constraints in (6) are separable across rows. How can we\nquantify the effect of this nuclear norm constraint? We exploit Lagrangian duality, which allows us to\nreplace constraints with appropriate modi\ufb01cations to the objective function.\nTo gain some intuition, consider Figure 2. The key is that the Lagrange multiplier ZC can bound the\namount that (cid:104)B, M(cid:105) can increase due to changing M outside of C. If we formalize this and analyze\nZ in detail, we obtain the following result:\n\nLemma 2. Let m \u2265 n. Then there is a k = O(cid:16) log3(2/\u03b4)\n1 \u2212 \u03b4, there exists a matrix Z with rank(Z) = 1, (cid:107)Z(cid:107)F \u2264 \u0001k(cid:112)\u03b1\u03b2n/m, and\n\nsuch that, with probability at least\n\n(cid:104)BC \u2212 ZC, M\u2217\n\nC \u2212 MC(cid:105) \u2264 (cid:104)B, M\u2217 \u2212 M(cid:105) for all M \u2208 P.\n\n(8)\nBy localizing (cid:104)B, M\u2217 \u2212 M(cid:105) to C via (8), Lemma 2 bounds the effect that the adversaries can have\non \u02dcMC. It is therefore the key technical tool powering our results, and is proved in Section A.4.\nProposition 1 is proved from Lemmas 1 and 2 in Section A.5.\n\n(cid:17)\n\n\u03b1\u03b2\u00012\n\nm\nn\n\n5 Recovering T (Algorithm 2)\nIn this section we show that if \u02dcM satis\ufb01es the conclusion of Proposition 1, then Algorithm 2 recovers\na set T that approximates T \u2217 well. We represent the sets T and T \u2217 as {0, 1}-vectors, and use the\n\nnotation (cid:104)T, r(cid:105) to denote(cid:80)\nProposition 2. Suppose Assumption 1 holds. For some k0 = O(cid:16) log(2/\u03b1\u03b2\u0001\u03b4)\n\nj\u2208[m] Tjrj. Formally, we show the following:\n\n, with probability 1\u2212 \u03b4,\n\n(cid:17)\n\n\u03b2\u00012\n\nAlgorithm 2 outputs a set T satisfying\n\n(cid:104)T \u2217 \u2212 T, r\u2217(cid:105) \u2264 2\n|C|\n\n(cid:16)(cid:88)\n\n(cid:104)T \u2217 \u2212 \u02dcMi, r\u2217(cid:105)(cid:17)\n\ni\u2208C\n\n+ \u0001\u03b2m.\n\n(9)\n\n2, at least one of the 2 log(2/\u03b4)\n\nTo establish Proposition 2, \ufb01rst note that with probability 1\u2212 \u03b4\nrandomly\nselected i from Algorithm 2 will have cost (cid:104)T \u2217 \u2212 \u02dcMi, r\u2217(cid:105) within twice the average cost across i \u2208 C.\nThis is because with probability \u03b1, a randomly selected i will lie in C, and with probability 1\n2, an\ni \u2208 C will have cost at most twice the average cost (by Markov\u2019s inequality).\nThe remainder of the proof hinges on two results. First, we establish a concentration bound showing\nj for any \ufb01xed i, and hence (by a union bound) also for\nrandomly selected i. This yields the following lemma, which is a straightforward\n\nthe 2 log(2/\u03b4)\napplication of Bernstein\u2019s inequality (see Section A.6 for a proof):\n\n\u02dcMij \u02dcrj is close to k0\nm\n\nthat(cid:80)\n\n\u02dcMijr\u2217\n\n(cid:80)\n\n\u03b1\n\n\u03b1\n\nj\n\nj\n\n7\n\n\fLemma 3. Let i\u2217 be the row selected in Algorithm 2. Suppose that \u02dcr satis\ufb01es Assumption 1. For\n\nsome k0 = O(cid:16) log(2/\u03b1\u03b4)\n\n(cid:17)\n\n\u03b2\u00012\n\n, with probability 1 \u2212 \u03b4, we have\n\n(cid:16)(cid:88)\n\n(cid:104)T \u2217 \u2212 \u02dcMi, r\u2217(cid:105)(cid:17)\n\n(cid:104)T \u2217 \u2212 \u02dcMi\u2217 , r\u2217(cid:105) \u2264 2\n|C|\n\ni\u2208C\n\n+\n\n\u0001\n4\n\n\u03b2m.\n\n(10)\n\nHaving recovered a good row T0 = \u02dcMi\u2217, we need to turn T0 into a binary vector so that Algorithm 2\ncan output a set; we do so via randomized rounding, obtaining a vector T \u2208 {0, 1}m such that E[T ] =\nT0 (where the randomness is with respect to the choices made by the algorithm). Our rounding\nprocedure is given in Algorithm 4; the following lemma, proved in A.7, asserts its correctness:\nLemma 4. The output T of Algorithm 4 satis\ufb01es E[T ] = T0, (cid:107)T(cid:107)0 \u2264 \u03b2m.\nAlgorithm 4 Randomized rounding algorithm.\n1: procedure RANDOMIZEDROUND(T0)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: end procedure\n\nLet s be the vector of partial sums of T0\nSample u \u223c Uniform([0, 1]).\nT \u2190 [0, . . . , 0] \u2208 Rm\nfor z = 0 to \u03b2m \u2212 1 do\n\nFind j such that u + z \u2208 [sj\u22121, sj), and set Tj = 1. (cid:46) if no such j exists, skip this step\n\n(cid:46) T0 \u2208 [0, 1]m satis\ufb01es (cid:107)T0(cid:107)1 \u2264 \u03b2m\n(cid:46) i.e., sj = (T0)1 + \u00b7\u00b7\u00b7 + (T0)j\n\nend for\nreturn T\n\nThe remainder of the proof involves lower-bounding the probability that T is accepted in each stage\nof the while loop in Algorithm 2. We refer the reader to Section A.8 for details.\n\n6 Open Directions and Related Work\n\n\u03b1\u03b2\u00012\n\n(cid:17)\n\n\u03b1, \u03b2, and \u0001. It is tempting to hope that when m = n a tight result would have k = \u02dcO(cid:16) 1\n\nFuture Directions. On the theoretical side, perhaps the most immediate open question is whether it is\npossible to improve the dependence of k (the amount of work required per worker) on the parameters\n, in\nloose analogy to recent results for the stochastic block model (Abbe and Sandon, 2015b;a; Banks and\nMoore, 2016). For stochastic block models, there is conjectured to be a gap between computational\nand information-theoretic thresholds, and it would be interesting to see if a similar phenomenon holds\nhere (the scaling for k given above is based on the conjectured computational threshold).\nA second open question concerns the scaling in n: if n (cid:29) m, can we get by with much less work\nper rater? Finally, it would be interesting to consider adaptivity: if the choice of queries is based on\nprevious worker ratings, can we reduce the amount of work?\nRelated work. Our setting is closely related to the problem of peer prediction (Miller et al., 2005),\nin which we wish to obtain truthful information from a population of raters by exploiting inter-rater\nagreement. While several mechanisms have been proposed for these tasks, they typically assume that\nrater accuracy is observable online (Resnick and Sami, 2007), that the dishonest raters are rational\nagents maximizing a payoff function (Dasgupta and Ghosh, 2013; Kamble et al., 2015; Shnayder\net al., 2016), that the raters follow a simple statistical model (Karger et al., 2014; Zhang et al., 2014;\nZhou et al., 2015), or some combination of the above (Shah and Zhou, 2015; Shah et al., 2015).\nGhosh et al. (2011) allow o(n) adversaries to behave arbitrarily but require the rest to be stochastic.\nThe work closest to ours is Christiano (2014; 2016), which studies online collaborative prediction in\nthe presence of adversaries; roughly, when raters interact with an item they predict its quality and\nafterwards observe the actual quality; the goal is to minimize the number of incorrect predictions\namong the honest raters. This differs from our setting in that (i) the raters are trying to learn the item\nqualities as part of the task, and (ii) there is no requirement to induce a \ufb01nal global estimate of the\nhigh-quality items, which is necessary for estimating quantiles. It seems possible however that there\nare theoretical ties between this setting and ours, which would be interesting to explore.\nAcknowledgments. JS was supported by a Fannie & John Hertz Foundation Fellowship, an NSF Graduate\nResearch Fellowship, and a Future of Life Institute grant. GV was supported by NSF CAREER award CCF-\n1351108, a Sloan Foundation Research Fellowship, and a research grant from the Okawa Foundation. MC was\nsupported by NSF grants CCF-1565581, CCF-1617577, CCF-1302518 and a Simons Investigator Award.\n\n8\n\n\f2016.\n\nReferences\nE. Abbe and C. Sandon. Community detection in general stochastic block models: fundamental limits and\n\nef\ufb01cient recovery algorithms. arXiv, 2015a.\n\nE. Abbe and C. Sandon. Detection in the stochastic block model with multiple clusters: proof of the achievability\n\nconjectures, acyclic BP, and the information-computation gap. arXiv, 2015b.\n\nN. Agarwal, A. S. Bandeira, K. Koiliaris, and A. Kolla. Multisection in the stochastic block model using\n\nsemide\ufb01nite programming. arXiv, 2015.\n\nJ. Banks and C. Moore. Information-theoretic thresholds for community detection in sparse networks. arXiv,\n\nT. T. Cai and X. Li. Robust and computationally feasible community detection in the presence of arbitrary outlier\n\nnodes. The Annals of Statistics, 43(3):1027\u20131059, 2015.\n\nY. Chen, S. Sanghavi, and H. Xu. Improved graph clustering. IEEE Transactions on Information Theory, 2014.\nP. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in the sparse graphs: A spectral\n\nalgorithm with optimal rate of recovery. In Conference on Learning Theory (COLT), 2015.\n\nP. Christiano. Provably manipulation-resistant reputation systems. arXiv, 2014.\nP. Christiano. Robust collaborative online learning. arXiv, 2016.\nA. Coja-Oghlan. Coloring semirandom graphs optimally. Automata, Languages and Programming, 2004.\nA. Coja-Oghlan. Solving NP-hard semirandom graph problems in polynomial expected time. Journal of\n\nA. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Random Structures\n\nAlgorithms, 62(1):19\u201346, 2007.\n\nand Algorithms, pages 116\u2013140, 2001.\n\nA. Dasgupta and A. Ghosh. Crowdsourced judgement elicitation with endogenous pro\ufb01ciency. In WWW, 2013.\nA. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00e1. Asymptotic analysis of the stochastic block model for\n\nmodular networks and its algorithmic applications. Physical Review E, 84(6), 2011.\n\nJ. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database.\n\nIn Computer Vision and Pattern Recognition (CVPR), pages 248\u2013255, 2009.\n\nU. Feige and J. Kilian. Heuristics for semirandom graph problems. Journal of Computer and System Sciences,\n\nU. Feige and R. Krauthgamer. Finding and certifying a large hidden clique in a semirandom graph. Random\n\nStructures and Algorithms, 16(2):195\u2013208, 2000.\n\nA. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection in\n\nuser-generated content. In 12th ACM conference on Electronic commerce, pages 167\u2013176, 2011.\n\nO. Gu\u00e9don and R. Vershynin. Community detection in sparse networks via Grothendieck\u2019s inequality. arXiv,\n\n63(4):639\u2013671, 2001.\n\n2014.\n\nA. Harmon. Amazon glitch unmasks war of reviewers. New York Times, 2004.\nP. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: Some \ufb01rst steps. Social Networks, 1983.\nV. Kamble, N. Shah, D. Marn, A. Parekh, and K. Ramachandran. Truth serums for massively crowdsourced\n\nevaluation tasks. arXiv, 2015.\n\nResearch, 62(1):1\u201324, 2014.\n\nD. R. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations\n\nM. Krivelevich and D. Vilenchik. Semirandom models as benchmarks for coloring algorithms. In Meeting on\n\nAnalytic Algorithmics and Combinatorics, pages 211\u2013221, 2006.\n\nC. Kulkarni, P. W. Koh, H. Huy, D. Chia, K. Papadopoulos, J. Cheng, D. Koller, and S. R. Klemmer. Peer and\n\nself assessment in massive online classes. Design Thinking Research, pages 131\u2013168, 2015.\n\nC. M. Le, E. Levina, and R. Vershynin. Concentration and regularization of random graphs. arXiv, 2015.\nK. Makarychev, Y. Makarychev, and A. Vijayaraghavan. Approximation algorithms for semi-random partitioning\n\nproblems. In Symposium on Theory of Computing (STOC), pages 367\u2013384, 2012.\n\nK. Makarychev, Y. Makarychev, and A. Vijayaraghavan. Learning communities in the presence of errors. arXiv,\n\nL. Massouli\u00e9. Community detection thresholds and the weak Ramanujan property. In STOC, 2014.\nD. Mayzlin, Y. Dover, and J. A. Chevalier. Promotional reviews: An empirical investigation of online review\n\nmanipulation. Technical report, National Bureau of Economic Research, 2012.\n\nN. Miller, P. Resnick, and R. Zeckhauser. Eliciting informative feedback: The peer-prediction method. Manage-\n\nment Science, 51(9):1359\u20131373, 2005.\n\nA. Moitra, W. Perry, and A. S. Wein. How robust are reconstruction thresholds for community detection? arXiv,\n\nmodels. arXiv, 2013a.\n\nE. Mossel, J. Neeman, and A. Sly. Stochastic block models and reconstruction. arXiv, 2012.\nE. Mossel, J. Neeman, and A. Sly. Belief propagation, robust reconstruction, and optimal recovery of block\n\nE. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. arXiv, 2013b.\nE. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In STOC, 2015.\nC. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in MOOCs. arXiv,\n\n2015.\n\n2015.\n\n2013.\n\nR. Priedhorsky, J. Chen, S. T. K. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, destroying, and restoring\n\nvalue in Wikipedia. In International ACM Conference on Supporting Group Work, pages 259\u2013268, 2007.\n\nP. Resnick and R. Sami. The in\ufb02uence limiter: provably manipulation-resistant recommender systems. In ACM\n\nConference on Recommender Systems, pages 25\u201332, 2007.\n\nN. Shah, D. Zhou, and Y. Peres. Approval voting and incentives in crowdsourcing. In ICML, 2015.\nN. B. Shah and D. Zhou. Double or nothing: Multiplicative incentive mechanisms for crowdsourcing. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2015.\n\nV. Shnayder, R. Frongillo, A. Agarwal, and D. C. Parkes. Strong truthfulness in multi-task peer prediction, 2016.\nJ. Vuurens, A. P. de Vries, and C. Eickhoff. How much spam can you take? An analysis of crowdsourcing results\n\nto increase accuracy. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval, 2011.\n\nY. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm for\n\nD. Zhou, Q. Liu, J. C. Platt, C. Meek, and N. B. Shah. Regularized minimax conditional entropy for crowdsourc-\n\ncrowdsourcing. arXiv, 2014.\n\ning. arXiv, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2190, "authors": [{"given_name": "Jacob", "family_name": "Steinhardt", "institution": "Stanford University"}, {"given_name": "Gregory", "family_name": "Valiant", "institution": "Stanford University"}, {"given_name": "Moses", "family_name": "Charikar", "institution": "Stanford University"}]}