{"title": "Crowdsourced Clustering: Querying Edges vs Triangles", "book": "Advances in Neural Information Processing Systems", "page_first": 1316, "page_last": 1324, "abstract": "We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.", "full_text": "Crowdsourced Clustering: Querying Edges vs\n\nTriangles\n\nRamya Korlakai Vinayak\n\nDepartment of Electrical Engineering\n\nCaltech, Pasadena\n\nramya@caltech.edu\n\nBabak Hassibi\n\nhassibi@systems.caltech.edu\n\nDepartment of Electrical Engineering\n\nCaltech, Pasadena\n\nAbstract\n\nWe consider the task of clustering items using answers from non-expert crowd\nworkers. In such cases, the workers are often not able to label the items directly,\nhowever, it is reasonable to assume that they can compare items and judge whether\nthey are similar or not. An important question is what queries to make, and we\ncompare two types: random edge queries, where a pair of items is revealed, and\nrandom triangles, where a triple is. Since it is far too expensive to query all possible\nedges and/or triangles, we need to work with partial observations subject to a \ufb01xed\nquery budget constraint. When a generative model for the data is available (and we\nconsider a few of these) we determine the cost of a query by its entropy; when such\nmodels do not exist we use the average response time per query of the workers\nas a surrogate for the cost. In addition to theoretical justi\ufb01cation, through several\nsimulations and experiments on two real data sets on Amazon Mechanical Turk,\nwe empirically demonstrate that, for a \ufb01xed budget, triangle queries uniformly\noutperform edge queries. Even though, in contrast to edge queries, triangle queries\nreveal dependent edges, they provide more reliable edges and, for a \ufb01xed budget,\nmany more of them. We also provide a suf\ufb01cient condition on the number of\nobservations, edge densities inside and outside the clusters and the minimum\ncluster size required for the exact recovery of the true adjacency matrix via triangle\nqueries using a convex optimization-based clustering algorithm.\n\nIntroduction\n\n1\nCollecting data from non-expert workers on crowdsourcing platforms such as Amazon Mechanical\nTurk, Zooinverse, Planet Hunters, etc. for various applications has recently become quite popular.\nApplications range from creating a labeled dataset for training and testing supervised machine\nlearning algorithms [1, 2, 3, 4, 5, 6] to making scienti\ufb01c discoveries [7, 8]. Since the workers on\nthe crowdsourcing platforms are often non-experts, the answers obtained will invariably be noisy.\nTherefore the problem of designing queries and inferring quality data from such non-expert crowd\nworkers is of great importance.\nAs an example, consider the task of collecting labels of images, e.g, of birds or dogs of different\nkinds and breeds. To label the image of a bird, or dog, a worker should either have some expertise\nregarding the bird species and dog breeds, or should be trained on how to label each of them. Since\nhiring experts or training non-experts is expensive, we shall focus on collecting labels of images\nthrough image comparison followed by clustering. Instead of asking a worker to label an image\nof a bird, we can show her two images of birds and ask: \u201cDo these two birds belong to the same\nspecies?\"(Figure 1(a)). Answering this comparison question is much easier than the labeling task\nand does not require expertise or training. Though different workers might use different criteria for\ncomparison, e.g, color of feathers, shape, size etc., the hope is that, averaged over the crowd workers,\nwe will be able to reasonably resolve the clusters (and label each).\nConsider a graph of n images that needs to be clustered, where each pairwise comparison is an \u2018edge\nquery\u2019. Since the number of edges grows as O(n2), it is too expensive to query all edges. Instead,\nwe want to query a subset of the edges, based on our total query budget, and cluster the resulting\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a) Do these two birds belong to the same species? (b) Which of these birds belong to the same species?\n\nFigure 1: Example of (a) an edge query and (b) a triangle query.\n\npartially observed graph. Of course, since the workers are non-experts, their answers will be noisy\nand this should be taken into consideration in designing the queries. For example, it is not clear what\nthe best strategy to choose the subsets of edges to be queried is.\n1.1 Our Contribution\nIn this work we compare two ways of partially observing the graph: random edge queries, where\na pair of items is revealed for comparison, and random triangle queries, where a triplet is revealed.\nWe give intuitive generative models for the data obtained for both types of queries. Based on these\nmodels we determine the cost of a query to be its entropy (the information obtained from the response\nto the query). On real data sets where such a generative model may not be known we use the average\nresponse time per query as a surrogate for the cost of the query. To fairly compare the use of edge\nvs. triangle queries we \ufb01x the total budget, de\ufb01ned as the (aforementioned) cost per query times the\ntotal number of queries. Empirical evidence, based on extensive simulations, as well as two real\ndata sets (images of birds and dogs, respectively), strongly suggests that, for a \ufb01xed query budget,\nquerying for triangles signi\ufb01cantly outperforms querying for edges. Even though, in contrast to edge\nqueries that give information on independent edges, triangle queries give information on dependent\nedges, i.e., edges that share vertices, we (theoretically and empirically) argue that triangle queries are\nsuperior because (1) they allow for far more edges to be revealed, given a \ufb01xed query budget, and (2)\ndue to the self-correcting nature of triangle queries, they result in much more reliable edges.\nFurthermore, for a speci\ufb01c convex optimization-based clustering algorithm, we also provide theoreti-\ncal guarantee for the exact recovery of the true adjacency matrix via random triangle queries, which\n\u221a\ngives a suf\ufb01cient condition on the number of queries, edge densities inside and outside the clusters\nand the minimum cluster size. In particular, we show that the lower bound of \u2126(\nn) on the cluster\nsize still holds even though the edges revealed via triangle queries are not independent.\n1.2 Problem Setup\nConsider n items with K disjoint classes/clusters plus outliers (items that do not belong to any\nclusters). Consider a graph with these n items as nodes. In the true underlying graph G\u2217, all the items\nin the same cluster are connected to each other and the items that are not in the same cluster are\nnot connected to each other. We do not have access to G\u2217. Instead we have a crowdsourced query\nmechanism that can be used to observe a noisy and partial snapshot Gobs of this graph. Our goal is to\n\ufb01nd the cluster assignments from Gobs. We consider the following two querying methods:\n\nRandom Edge Query: We sample E edges uniformly at random from(cid:0)n\nRandom Triangle Query: We sample T triangles uniformly at random from(cid:0)n\n\nshows an example of an edge query. For each edge observation, there are two possible con\ufb01gurations:\n(1) Both items are similar, denoted by ll, (2) The items are not similar, denoted by lm.\n\nFigure 1(b) shows an example of a triangle query. For each triangle observation, there are \ufb01ve\npossible con\ufb01gurations (Figure 2):(1) All items are similar, denoted by lll, (2) Items 1 and 2 are\nsimilar, denoted by llm, (3) Items 1 and 3 are similar, denoted by lml, (4) Items 2 and 3 are similar,\ndenoted by mll, (5) None are similar, denoted by lmj.\n\n(cid:1) possible edges. Figure 1(a)\n(cid:1) possible triangles.\n\n3\n\n2\n\nFigure 2: Con\ufb01gurations for a triangle query that are (a) observed and (b) not allowed.\n\n2\n\n1\"2\"3\"lll!1\"2\"3\"llm!1\"2\"3\"lml!1\"2\"3\"mll!1\"2\"3\"lmj!1\"2\"3\"1\"2\"3\"1\"2\"3\"(a)\"Allowed\"(b)\"Not\"allowed\"\fp(1 \u2212 q)2 + (1 \u2212 p)q2 + 2pq(1 \u2212 q)\n\nllm\npq2\n\n(1 \u2212 p)q(1 \u2212 q)\n(1 \u2212 p)q(1 \u2212 q)\n(1 \u2212 p)(1 \u2212 q)2\n\nlmj\nq3\nq(1 \u2212 q)2\nq(1 \u2212 q)2\nq(1 \u2212 q)2\n(1 \u2212 q)3 + 3q2(1 \u2212 q)\n\nPr(y|x)\nlll\nllm\nlml\nmll\nlmj\n\nlll\n\np3 + 3p2(1 \u2212 p)\n\np(1 \u2212 p)2\np(1 \u2212 p)2\np(1 \u2212 p)2\n(1 \u2212 p)3\n\nTable 1: Query confusion matrix for the triangle block model for the homogeneous case.\n\n2\n\n1.3 Related Works\n[9, 10, 11, 12, 13, 14] and references therein focus on the problem of inferring true labels from\ncrowdsoruced multiclass labeling. The common setup in these problems is as follows: A set of\nitems are shown to workers and labels are elicited from them. Since the workers give noisy answers,\neach item is labeled by multiple workers. Algorithms based on Expectation-Maximization [14] for\nmaximum likelihood estimation and minimax entropy based optimization [12] have been studied for\ninferring the underlying true labels. In our setup we do not ask the workers to label the items. Instead\nwe use comparison between items to \ufb01nd the clusters of items that are similar to each other.\n[15] considers the problem of inferring the complete clustering on n images from a large set of\nclustering on smaller subsets via crowdsourcing. Each HIT (Human Intelligent Task) is designed such\nthat all of them share a subset of images to ensure overlapping. Each HIT has M images and all the\n\n(cid:1) comparisons are made. Each HIT is then assigned to multiple workers to get reliable answers.\n\n(cid:0)M\n\nThese clustering are then combined using an algorithm based on variational Bayesian inference. In\nour work we consider a different setup, where either pairs or triples of images are compared by the\ncrowd to obtain a partial graph on the images which can be clustered.\n[16] considers a convex approach to graph clustering with partially observed adjacency matrices, and\nprovides an example of clustering images by crowdsourcing pairwise comparisons. However, it does\nnot consider other types of querying such as triangle queries. In this work, we extend the analysis\nin [16] and show that similar performance guarantee holds for clustering via triangle queries.\nAnother interesting line of work is learning embeddings and kernels through triplet comparison tasks\nin [17, 18, 19, 20, 21, 22] and references therein. The \u2018triplet comparison\u2019 task in these works is of\ntype: \u2018Is a closer to b or to c?\u2019, with two possible answers, to judge the relative distances between the\nitems. On the other hand, a triangle query in our work has \ufb01ve possible answers (Figure 1(b)) that\ngives a clustering (discrete partitioning) of the three items.\n2 Models\nx\u2208X Pr(y|x)Pr(x),\nwhere x is the true con\ufb01guration and X is the set of true con\ufb01gurations. Let Y be the set of all\nobserved con\ufb01gurations. Each query has a |Y| \u00d7 |X| confusion matrix [Pr(y|x)] associated to it.\n\nProbability of observing a particular con\ufb01guration y is given by: Pr(y) =(cid:80)\nNote that the columns of this confusion matrix sum to 1, i.e(cid:80)\n\ny\u2208Y Pr(y|x) = 1.\n\n2.1 Random Edge Observation Models\nFor the random edge query case, there are two observation con\ufb01gurations, Y = {ll, lm} where lm\ndenotes \u2018no edge\u2019 and ll denotes \u2018edge\u2019.\nOne-coin Edge Model: Assume all the queries are equally hard. Let the \u03b6 be the probability of\nanswering a question wrong. Then Pr(ll|ll) = Pr(lm|lm) = 1 \u2212 \u03b6, Pr(lm|ll) = Pr(ll|lm) = \u03b6.\nThis model is inspired by the one-coin Dawid-Skene Model [23], which is used in inference for item\nlabel elicitation tasks. This is a very simple model and does not capture the dif\ufb01culty of a query\ndepending on which clusters the items in the query belong to. In order to incorporate these differences\nwe consider the popular Stochastic Block model (SBM) [24, 25] which is one of the most widely\nused model for graph clustering.\nStochastic Block Model (SBM): Consider a graph on n nodes with K disjoint clusters and outliers.\nAny two nodes i and j are connected (independent of other edges) with probability p if they belong\nto the same cluster and with probability q otherwise. That is, Pr(ll|ll) = p, Pr(lm|ll) = 1 \u2212 p,\nPr(ll|lm) = q and Pr(lm|lm) = 1 \u2212 q. We assume that the density of the edges inside the clusters\nis higher than that between the clusters, that is, p > q.\n2.2 Random Triangle Observation Models\nFor the triangle query model, there are \ufb01ve possible observation con\ufb01gurations (Figure 2), Y =\n{lll, llm, lml, mll, lmj}.\nOne-coin Triangle Model: Let each question be answered correctly with probability 1 \u2212 \u03b6, and\n\n3\n\n\fPr(y|x)\nlll\nllm\nlml\nmll\nlmj\n\nlll\n\np3/zlll\n\np(1 \u2212 p)2/zlll\np(1 \u2212 p)2/zlll\np(1 \u2212 p)2/zlll\n(1 \u2212 p)3/zlll\n\nllm\n\npq2/zllm\n\np(1 \u2212 q)2/zllm\n\n(1 \u2212 p)q(1 \u2212 q)/zllm\n(1 \u2212 p)q(1 \u2212 q)/zllm\n(1 \u2212 p)(1 \u2212 q)2/zllm\n\nlmj\nq3/zlmj\nq(1 \u2212 q)2\nq(1 \u2212 q)2/zlmj\nq(1 \u2212 q)2/zlmj\n(1 \u2212 q)3/zlmj\n\nTable 2: Query confusion matrix for the conditional block model for the homogeneous case.\n\nwhen wrongly answered, all the other con\ufb01gurations are equally confusing. So, Pr(lll|lll) = 1 \u2212 \u03b6\nand Pr(llm|lll) = Pr(lml|lll) = Pr(mll|lll) = Pr(lmj|lll) = \u03b6/4 and so on. This model, as\nin the case of the one-coin model for edge query, does not capture the differences in dif\ufb01culty for\ndifferent clusters. In order to include the differences in confusion between different clusters, we\nconsider the following observation models for a triangle query.\nFor these 3 items in the triangle query, the edges are \ufb01rst generated from the SBM. This can give rise\nto 8 con\ufb01gurations, out of which 5 are allowed as an answer to triangle query while the rest 3 are not\nallowed (Figure 2). The two models differ in how they handle the con\ufb01gurations that are not allowed,\nand are described below:\nTriangle Block Model (TBM): In this model we assume that a triangle query helps in correctly\nresolving the con\ufb01gurations that are not allowed. So, when the triangle generated from the SBM\ntakes one of the 3 non-allowed con\ufb01gurations, it is mapped to the true con\ufb01guration. This gives a\n5 \u00d7 5 query confusion matrix which is given in Table 1. Note that the columns for lml and mll can\nbe \ufb01lled in a similar manner to that of llm.\nConditional Block Model (CBM): In this model when a non-allowed con\ufb01guration is encountered,\nit is redrawn again. This is equivalent to conditioning on the allowed con\ufb01gurations. De\ufb01ne the\nnormalizing factors, zlll := 3p3 \u2212 3p2 + 1, zllm := 3pq2 \u2212 2pq \u2212 q2 + 1, zllm := 3q3 \u2212 3q2 + 1 .\nThe 5 \u00d7 5 query confusion matrix which is given in Table 2.\nRemark: Note that the SBM (and hence the derived models) can be made more general by considering\ndifferent edge probabilities Pii for cluster i and Pij = Pji between clusters i (cid:54)= j.\nSome intuitive properties of the triangle query models described in this section are:\n1. If p > q, then the diagonal term will dominate any other term in a row. That is Pr(lll|lll) >\n\nPr(lll|(cid:63) (cid:54)= lll), Pr(llm|llm) > Pr(llm|(cid:63) (cid:54)= llm) and so on.\nPr(lll|lll) > Pr(llm|lll) = Pr(lml|lll) = Pr(mll|lll) > Pr(lmj|lll) etc.\n\n2. If p > 1/2 > q, then the diagonal term will dominate the other terms in the column, i.e,\n\n3. When there is a symmetry between the items, the observation probability should be the same. That\nis, if the true con\ufb01guration is llm, then observing lml and mll should be equally likely as item1\nand item2 belong to the same cluster and so on. This property will hold good in the general case\nas well except for when the true con\ufb01guration is lmj. In this case, the probability of observing\nllm, lml and mll can be different as it depends on the clusters to which items 1, 2 and 3 belong.\n\n2.3 Adjacency Matrix: Edge Densities and Edge Errors\nThe adjacency matrix, A = AT of a graph can be partially \ufb01lled by querying a subset of edges.\nSince we query edges randomly, most of the edges are seen only once. Some edges might get queried\nmultiple times, in which case, we randomly pick one of them. Similarly we can also partially \ufb01ll\nthe adjacency matrix from triangle queries. We \ufb01ll the unobserved entries of the adjacency matrix\nwith zeros. We can perform clustering on A to obtain a partition of items. The true underlying graph\nG\u2217 has perfect clusters (disjoint cliques). So, the performance of clustering on A depends on how\nnoisy it is. This in turn depends on the probability of error for each revealed edge in A, i.e, what is\nthe probability that a true edge was registered as no-edge and vice versa. The hope is that triangle\nqueries help workers to resolve the edges better and hence have less errors among the revealed edges\nthan those obtained from edge queries.\n\nIf we make E edge queries, then the probability of observing an edge is, r = E/(cid:0)n\ntriangle queries, the probability of observing an edge is rT = 3T /(cid:0)n\n\n(cid:1). If we make T\n(cid:1). Let rp (rT pT ) and rq (rT qT )\n\nbe the edge probability in side the clusters and between the clusters respectively, in A which is\npartially \ufb01lled via edge (triangle) queries. For simplicity consider a graph with K clusters of size m\neach (n = Km). The probability that a randomly chosen edge in A \ufb01lled via edge query is in error\n:= (1 \u2212 rp) (m \u2212 1)/(n \u2212 1) + rq (n \u2212 m)/(n \u2212 1). Similarly, we can\ncan be computed as: pedge\nerr\nwrite p\u2206\n\nerr. Under reasonable conditions on the parameters involved, p\u2206\n\nerr .\nerr < pedge\n\n2\n\n2\n\n4\n\n\fFigure 3: Fraction of entries in error in the matrix recovered via Program 4.1.\n\nerr < pedge\n\nerr .\nerr < pedge\n\nerr .\nerr < pedge\n\nFor example, in the case of One-coin model, for edge qurey, rp = r (1\u2212 \u03b6) and rq = r\u03b6. For triangle\nquery, rT pT = rT (1 \u2212 3\u03b6/4) and rT qT = rT \u03b6/2. If rT < 2r, we have rT qT < rq and rT pT > rp,\nand hence p\u2206\nFor the TBM, when p > 1/2 > q, with r < rT < r/(1 \u2212 q), we get rT pT > rp and rT qT < rq,\nerr . For the CBM, when p > 1/2 > q, under reasonable assumptions on r,\nand hence p\u2206\nrT qT < rq, but depending on the values of r and rT , rT pT can get below rp. If the decrease in\nedge probability between the clusters is large enough to overcome the fall in edge density inside the\nclusters, then p\u2206\nIn summary, when A is \ufb01lled by triangle queries, the edge density between the clusters decreases and\nthe overall number of edge errors decreases (we observe this in real data as well, see Table 3). Both\nof these are desirable for clustering algorithms that try to approximate the minimum cut to \ufb01nd the\nclusters like spectral clustering.\n3 Value of a Query\nTo make a meaningful comparison between edge queries and triangle queries, we need to \ufb01x a budget.\nSuppose we have a budget to make E edge queries. To \ufb01nd the number of triangle queries that can\nbe made with the same budget, we need to de\ufb01ne the value (cost) of a triangle query. Although a\ntriangle query has 3 edges, they are not independent and hence its relative cost is less than that of\nmaking 3 random edge queries. Thus we need a fair way to compare the value of a triangle query to\nthat of an edge query.\n\nLet s \u2208 [0, 1]|Y|, (cid:80)\nquery, with sy := Pr(y) =(cid:80)\nobtained from the observation, which is measured by its entropy: H(s) = \u2212(cid:80)\n\ny\u2208Y sy = 1 be the probability mass function (pmf) of the observation in a\nx\u2208X Pr(y|x)Pr(x). We de\ufb01ne the value of a query as the information\ni\u2208Y si log(si). Ideally,\nthe cost of a query should be proportional to the amount of information it provides. So, if E is the\nnumber of edge queries, then the number of triangle queries we can make with the same budget is:\nTB = E \u00d7 HE/H\u2206.\nWe should remark that detetrmining the above cost requires knowledge of the generative model of the\ngraph, which may not be available for empirical data sets. In such situations, a very reasonable cost\nis the relative time it takes for a worker to respond to a triangle query, compared to an edge query. (In\nthis manner, a \ufb01xed budget means a \ufb01xed amount of time for the queries to be completed.) A good\nrule of thumb, which is widely supported by empirical data, is the cost of 1.5, ostensibly because in\ntriangle queries workers need to study three images, rather than two, and so it takes them 50% longer\nto respond. The end result is that, for a \ufb01xed budget, triangle queries reveal twice as many edges.\n4 Guaranteed Recovery of the True Adjacency Matrix\nIn this section we provide a suf\ufb01cient condition for the full recovery of the adjacency matrix\ncorresponding to the underlying true G\u2217 from partially observed noisy A \ufb01lled via random triangle\nqueries. We consider the following convex program from [16]:\nminimize\n\n(cid:107)L(cid:107)(cid:63) + \u03bb(cid:107)S(cid:107)1\n\n(4.1)\n\nL,S\n\nn(cid:88)\n\ns. t. 1 \u2265 Li,j \u2265 Si,j \u2265 0 for all i, j \u2208 {1, 2, . . . n}, Li,j = Si,j whenever Ai,j = 0,\n\nLij \u2265 |R|\n\nwhere (cid:107).(cid:107)(cid:63) is the nuclear norm (sum of the singular values of the matrix), and (cid:107).(cid:107)1 is the l1-norm\n(sum of absolute values of the entries of the matrix) and \u03bb \u2265 0 is the regularization parameter. L\nis the low-rank matrix corresponding to the true cluster structure, S is the sparse error matrix that\naccounts only for the missing edges inside the clusters and |R| is the size of the cluster region.\n\ni,j=1\n\n5\n\n0.70.80.900.10.20.30.40.5One\u2212coin Model, r=0.2, q = 1\u2212p pFraction of Entries in Error ETETB0.70.80.900.10.20.30.40.5Triangle Block Model, r=0.2, q = 0.25 p ETETB0.70.80.900.10.20.30.40.5Conditional Block Model, r=0.2, q = 0.25 p ETETB0.70.80.900.10.20.30.40.5One\u2212coin Model, r=0.3, q = 1\u2212p p ETETB0.70.80.900.10.2Triangle Block Model, r=0.3, q = 0.25p ETETB0.70.80.900.10.2Conditional Block Model, r=0.3, q = 0.25 p ETETB\fWhen A is \ufb01lled using a subset of random edge queries, under the SBM with parameters\n{n, nmin, K, p, q}, [16] provides the following suf\ufb01cient condition for the guaranteed recovery\nof the true G\u2217:\n\n(cid:112)rp(1 \u2212 rp) + rq(1 \u2212 rq),\n\nn(cid:112)rq(1 \u2212 rq) + 2\n\n\u221a\n\n\u221a\n\u2265 2\n\n(4.2)\n\nnmax\n\nnmin r (p \u2212 q) \u2265 1\n\u03bb\n\nwhere nmin and nmax are the sizes of the smallest and the largest clusters respectively. We extend the\nanalysis in[16] to the case when A is \ufb01lled via a subset of random triangle queries, and obtain the\nfollowing suf\ufb01cient condition:\nTheorem 1 If the following condition holds:\n\n(cid:18)\n\n(cid:114)\n\nnmin rT (pT \u2212 qT ) \u2265 1\n\u03bb\n(1 \u2212 rT\n\n\u221a\n2\n\nrT\n\nn\n\n\u2265 3\n\nqT\n3\n\nqT\n3\n\n(cid:114)\n\n\u221a\n\n(cid:19)\n\nrT\n\npT\n3\n\n(1 \u2212 rT\n\npT\n3\n\n(1 \u2212 rT\n\nqT\n3\n\nqT\n3\n\n)\n\n\u221a\n\n) + 2\n\nnmax\n\n) + rT\nthen Program 4.1 succeeds in recovering the true G\u2217 with high probability.\nWhen A is \ufb01lled using random edge queries, the entries are independent of each other (since the\nedges are independent in the SBM). When we use triangle queries to \ufb01ll A, this no longer holds as\nthe 3 edges \ufb01lled from a triangle query are not independent. Due to the limited space, we present\nonly the key idea of our proof: The analysis in [16] relies on the independence of entries of A to use\nBernstein-type concentration results for the sum of independent random variables and the bound on\nthe spectral norm of random matrix with independent entries. We make the following observation:\nSplit A \ufb01lled via random triangle queries into three parts, A = A1 + A2 + A3. For each triangle\nquery, allocate one edge to each part randomly. If an edge gets queried as a part of multiple triangle\nqueries, keep one of them randomly. Each Ai now contains independent entries. The edge density\nin Ai is rT pT /3 and rT qT /3 inside the clusters and outside respectively. This allows us to use\nthe results on concentration of sum of independent random variables and the O(\nn) bound on the\nspectral norm of random matrices, with a penalty due to triangle inequality for spectral norm.\nIt can be seen that, when the number of revealed edges is the same (rT = r) and the probability of\ncorrectly identifying edges is the same (pT = p and 1 \u2212 qT = 1 \u2212 q), then the reovery condition\nof Theorem 1 is worse than that of (4.2). (This is expected, since triangle queries yield dependent\nedges.) However, it is overcompensated by the fact that triangle queries result in more reliable edges\n(pT \u2212 qT > p \u2212 q) and also reveal more edges (rT > r, since the relative cost is less than 3).\nTo illustrate this, consider a graph on n = 600 nodes with K = 3 clusters of equal size m = 200.\nWe generate the adjacency matrices from different models in Section 2 for varying p from 0.65 to 0.9.\nFor the one-coin models, 1\u2212 \u03b6 = p. For the rest of the models q = 0.25. We run the improved convex\nprogram (4.1) by setting \u03bb = 1/\nn. Figure 3 shows the fraction of the entries in the recovered\nmatrix that are wrong compared to the true adjacency matrix for r = 0.2 and 0.3 (averaged over 5\nruns; TE = (cid:100)E/3(cid:101) and TB = EHE/H\u2206). We note that the error drops signi\ufb01cantly when A is \ufb01lled\nvia triangle queries than via edge queries.\n5 Performance of Spectral Clustering: Simulated Experiments\nWe generate adjacency matrices from the edge query and the triangle query models (Section 2) and\nrun the spectral clustering algorithm [26] on them. We compare the output clustering with the ground\ntruth via variation of information (VI) [27] which is de\ufb01ned for two clusterings (partitions) of a\ndataset and has information theoretical justi\ufb01cation. Smaller values of VI indicate a closer match\nand a VI of 0 means that the clusterings are identical. We compare the performance of the spectral\nrandom edges, (2) TB = E \u00d7 HE/H\u2206 random triangles, which has the same budget as querying\nE edges and (3) TE = (cid:100)E/3(cid:101) < TB random triangles, which has same number of edges as in the\nadjacency matrix obtained by querying E edges.\nVarying Edge Density Inside the Clusters: Consider a graph on n = 450 nodes with K = 3\nclusters of equal size m = 150. We vary edge density inside the cluster p from 0.55 to 0.9. For the\none-coin models, 1 \u2212 \u03b6 = p, and q = 0.25 for the rest. Figure 4 shows the performance of spectral\nclustering for r = 0.15 and r = 0.3 (averaged over 5 runs).\nVarying Cluster Sizes: Let N = 1200. Consider a graph with K clusters of equal sizes m =\n(cid:98)N/K(cid:99) and n = K m. We vary K from 2 to 12 which varies the cluster sizes from 600 (large\n1200 \u2248 35). We set p = 0.7. For the one-coin models\nclusters) to 100 (small clusters, note that\n\nclustering algorithms on the partial adjacency matrices obtained from querying: (1) E = (cid:100)r(cid:0)n\n\n(cid:1)(cid:101)\n\n\u221a\n\n\u221a\n\n2\n\n6\n\n\fFigure 4: VI for Spectral Clustering output for varying edge density inside the clusters.\n\nFigure 5: VI for Spectral Clustering output for varying number of clusters (K).\n\n1 \u2212 \u03b6 = p and q = 0.25 for the rest. Figure 5 shows the performance of spectral clustering for\nr = 0.2 and 0.3. The performance is signi\ufb01cantly better with triangle queries compared to that with\nedge queries.\n6 Experiments on Real Data\nWe use Amazon Mechanical Turk as crowdsourcing platform. For edge queries, each HIT (Human\nIntelligence Task) has 30 queries of random pairs, a sample is shown in Figure 1(a). For triangle\nqueries, each HIT has 20 queries, with each query having 3 random images, a sample is shown in\nFigure 1(b). Each HIT is answered by a unique worker. Note that we do not provide any examples\nof different classes or any training to do the task. We \ufb01ll A as described in Section 2.3 and run the\nk-means, the Spectral Clustering and Program 4.1 followed by Spectral Clusteirng on it. Since we do\nnot know the model parameters and hence have no access to the entropy information, we can use the\nthe average time taken as the \u201ccost\u201d or value of the query. For E edge comparisons, the equivalent\nnumber of triangle comparisons would be T = E \u00d7 tE/t\u2206, where tE and t\u2206 are average time taken\nto answer an edge query and a triangle query respectively. We consider two datasets:\n1. Dogs3 dataset has images of the following 3 breeds of dogs from the Stanford Dogs Dataset [28]:\nNorfolk Terrier (172), Toy Poodle (150) and Bouvier des Flanders (151), giving a total of 473\ndogs images. On an average a worker took tE = 8.4s to answer an edge query and t\u2206 = 11.7s to\nanswer a triangle query.\n\n2. Birds5 dataset has 5 bird species from CUB-200-2011 dataset [29]: Laysan Albatross (60), Least\nTern (60), Artic Tern (58), Cardinal (57) and Green Jay (57). We also add 50 random species as\noutliers, giving us a total if 342 bird images. On an average, workers took tE = 8.3s to answer\none edge query and t\u2206 = 12.1s to answer a triangle query.\n\nDetails of the data obtained from edge query and triangle query experiments is summarized in Table 3.\nNote that the error in the revealed edges drop signi\ufb01cantly for triangle queries.\nFor the Dogs3 dataset, the empirical edge densities inside and between the clusters for A obtained\nfrom the edge queries ( \u02c6PE) and the triangle queries ( \u02c6PT ) is:\n\n(cid:34)0.7577 0.1866\n\n0.1866\n0.2043\n\n0.6117\n0.2487\n\n\u02c6PE =\n\n(cid:35)\n\n(cid:34)0.7139\n\n0.1138\n0.1253\n\n0.2043\n0.2487\n0.7391\n\n, \u02c6PT =\n\n0.1138\n0.6231\n0.1760\n\n0.1253\n0.1760\n0.7576\n\n(cid:35)\n\n.\n\nE: Edge, T: \u2206\nDogs3, Edge Query\nDogs3, \u2206 Query\nDogs3, \u2206 Query\nBirds5, Edge Query\nBirds5, \u2206 Query\nBirds5, \u2206 Query\n\n# Workers\n\n300\n150\n320\n300\n155\n285\n\n# Unique Edges % of Edges Seen % of Edge Errors\n25.2%\n19.66%\n20%\n\n7.73%\n7.74%\n15.79%\n\n14.27%\n14.74%\n25.34%\n\n14.82%\n10.96%\n11.4%\n\nTable 3: Summary of the data colleced in the real experiments.\n\nE(cid:48) = 8630\n3T (cid:48)\nE = 8644\n3T (cid:48) = 17, 626\nE(cid:48) = 8319\n3T (cid:48)\nE = 8600\n3T (cid:48) = 14, 773\n\n7\n\n0.60.8100.511.522.5One\u2212coin, r=0.15, q = 1\u2212ppVI (Variation of Information) ETETB0.60.8100.20.40.60.81Triangle Block Modelr=0.15, q = 0.25 p ETETB0.60.8100.511.5Conditional Block Modelr=0.15, q = 0.25 p ETETB0.60.8100.511.522.5One\u2212coin r=0.3, q = 1\u2212pp ETETB0.60.8100.10.20.30.4Triangle Block Modelr=0.3, q = 0.25 p ETETB0.60.8100.20.40.60.8Conditional Block Modelr=0.3, q = 0.25 p ETETB0510012345One\u2212coin r=0.2, p = 1\u2212q = 0.7KVI (Variation of Information) ETETB051001234Triangle Block Model r=0.2, p = 0.7, q = 0.25K ETETB051001234Conditional Block Model r=0.2, p = 0.7, q = 0.25K ETETB051001234One\u2212coinr=0.3, p = 1\u2212 q = 0.7K ETETB051000.511.52Triangle Block Model r=0.3, p = 0.7, q = 0.25K ETETB051000.511.52Conditional Block Model r=0.3, p = 0.7, q = 0.25K ETETB\fQuery (E: Edge, T: \u2206)\nE(cid:48) = 8630\n3T (cid:48)\nE = 8644\n3T (cid:48) = 17626\n\nk-means\n\n0.8374 \u00b1 0.0121 (K=2)\n0.6675 \u00b1 0.0246 (K=3)\n\n0.3268 \u00b1 0 (K=3)\n\nSpectral Clustering\n0.6972 \u00b1 0 (K = 3)\n0.5690 \u00b1 0 (K=3)\n0.3470 \u00b1 0 (K=3)\n\nConvex Program\n0.5176 \u00b1 0 (K=3)\n0.4605 \u00b1 0 (K = 3)\n0.2279 \u00b1 0 (K = 3)\n\nTable 4: VI for clustering output by k-means and spectral clustering for the Dogs3 dataset.\n\nQuery\nE(cid:48) = 8319\n3T (cid:48)\nE = 8600\n3T (cid:48) = 14, 773\n\nk-means\n\n1.4504 \u00b1 0.0338 (K = 2)\n1.1793 \u00b1 0.0254 (K = 3)\n0.7989 \u00b1 0 (K = 4)\n\n1.2936 \u00b1 0.0040 (K = 4)\n\nSpectral Clustering\n1.1299 \u00b1 0(K = 4)\n0.8713 \u00b1 0 (K = 4)\n\nConvex Program\n1.0392 \u00b1 0 (K = 4)\n0.9105 \u00b1 0 (K=4)\n0.9135 \u00b1 0 (K = 4)\n\nTable 5: VI for clustering output by k-means and spectral clustering for the Birds5 dataset.\n\nFor the Birds5 dataset, the emprical edge densities within and between various clusters in A \ufb01lled\nvia edge queries ( \u02c6PE) and triangle queries ( \u02c6PT ) are:\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u02c6PE =\n\n0.801\n0.304\n0.208\n0.016\n0.032\n0.100\n\n0.304\n0.778\n0.656\n0.042\n0.131\n0.123\n\n0.208\n0.656\n0.912\n0.062\n0.094\n0.096\n\n0.016\n0.042\n0.062\n0.855\n0.154\n0.110\n\n0.032\n0.131\n0.094\n0.154\n0.958\n0.158\n\n0.100\n0.123\n0.096\n0.110\n0.158\n0.224\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb , \u02c6PT =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n0.786\n0.207\n0.151\n0.011\n0.021\n0.058\n\n0.207\n0.797\n0.625\n0.023\n0.047\n\n0.1\n\n0.151\n0.625\n0.865\n0.024\n0.06\n0.071\n\n0.011\n0.023\n0.024\n0.874\n0.059\n0.078\n\n0.021\n0.047\n0.06\n0.059\n0.943\n0.08\n\n0.058\n\n0.1\n\n0.071\n0.076\n0.08\n0.182\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\nAs we see the triangle queries give rise to an adjacency matrix with signi\ufb01cantly less confusion\nacross the clusters (compare the off-diagonal entries in \u02c6PE and \u02c6PT ).\nTables 4 and 5 show the performance of clustering algorithms (in terms of variation of information)\nfor the two datasets. The no. of clusters found is given in brackets. We note that for both the datasets,\nthe performance is signi\ufb01cantly better with triangle queries than with edge queries. Furthermore,\nE \u2248 E) than that is allowed by the budget, the clustering obtained\neven with less triangle queries (3T (cid:48)\nis better compared to edge queries.\n7 Summary\nIn this work we compare two ways of querying for crowdsourcing clustering using non-experts:\nrandom edge comparisons and random triangle comparisons. We provide simple and intuitive models\nfor both. Compared to edge queries that reveal independent entries of the adjacency matrix, triangle\nqueries reveal dependent ones (edges in a triangle share a vertex). However, due to their error-\ncorrecting capabilities, triangle queries result in more reliable edges and, furthermore, because the\ncost of a triangle query is less than that of 3 edge queries, for a \ufb01xed budget, triangle queries reveal\nmany more edges. Simulations based on our models, as well as empirical evidence strongly support\nthese facts. In particular, experiments on two real datasets suggests that clustering items from random\ntriangle queries signi\ufb01cantly outperforms random edge queries when the total query budget is \ufb01xed.\nWe also provide theoretical guarantee for the exact recovery of the true adjacency matrix using\nrandom triangle queries. In the future we will focus on exploiting the structure of triangle queries via\ntensor representations and sketches, which might further improve the clustering performance.\nReferences\n[1] Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni,\n\nand Linda Moy. Learning from crowds. J. Mach. Learn. Res., 11:1297\u20131322, August 2010.\n\n[2] Rion Snow, Brendan O\u2019Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap and fast\u2014but is it good?:\nIn Proceedings of the Conference on\n\nEvaluating non-expert annotations for natural language tasks.\nEmpirical Methods in Natural Language Processing, EMNLP \u201908, pages 254\u2013263, 2008.\n\n[3] Luis Von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum. reCAPTCHA:\n\nHuman-based character recognition via web security measures. Science, 321(5895):1465\u20131468, 2008.\n\n[4] A. Sorokin and D. Forsyth. Utility data annotation with Amazon Mechanical Turk. In Computer Vision\nand Pattern Recognition Workshops, 2008. CVPRW '08. IEEE Computer Society Conference on,\npages 1\u20138. IEEE, June 2008.\n\n[5] Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The multidimensional wisdom of\n\ncrowds. In Neural Information Processing Systems Conference (NIPS), 2010.\n\n[6] Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain, and Tianbao Yang. Semi-crowdsourced clustering:\nGeneralizing crowd labeling by robust distance metric learning. In Neural Information Processing Systems\nConference (NIPS), 2012.\n\n8\n\n\f[7] Robert Simpson, Kevin R. Page, and David De Roure. Zooniverse: Observing the world\u2019s largest citizen\nscience platform. In Proceedings of the 23rd International Conference on World Wide Web, WWW \u201914\nCompanion, 2014.\n\n[8] Chris Lintott, Megan E. Schwamb, Charlie Sharzer, Debra A. Fischer, Thomas Barclay, Michael Parrish,\nNatalie Batalha, Steve Bryson, Jon Jenkins, Darin Ragozzine, Jason F. Rowe, Kevin Schawinski, Rovert\nGagliano, Joe Gilardi, Kian J. Jek, Jari-Pekka P\u00e4\u00e4kk\u00f6nen, and Tjapko Smits. Planet hunters: New kepler\nplanet candidates from analysis of quarter 2, 2012. cite arxiv:1202.6007Comment: Submitted to AJ.\n\n[9] David R. Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems.\n\nIn Neural Information Processing Systems Conference (NIPS), 2011.\n\n[10] David R. Karger, Sewoong Oh, and Devavrat Shah. Budget-optimal task allocation for reliable crowdsourc-\n\ning systems. Operations Research, 62(1):1\u201324, 2014.\n\n[11] Aditya Vempaty, Lav R. Varshney, and Pramod K. Varshney. Reliable crowdsourcing for multi-class\n\nlabeling using coding theory. CoRR, abs/1309.3330, 2013.\n\n[12] Denny Zhou, Sumit Basu, Yi Mao, and John C. Platt. Learning from the wisdom of crowds by minimax\n\nentropy. In Advances in Neural Information Processing Systems 25, pages 2195\u20132203. 2012.\n\n[13] Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing.\n\nInformation Processing Systems Conference (NIPS). 2012.\n\nIn Neural\n\n[14] Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. Spectral methods meet EM: A provably\noptimal algorithm for crowdsourcing. In Neural Information Processing Systems Conference (NIPS), 2014.\n\n[15] Ryan G. Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. Crowdclustering. In Advances in\n\nNeural Information Processing Systems 24, pages 558\u2013566. 2011.\n\n[16] Ramya Korlakai Vinayak, Samet Oymak, and Babak Hassibi. Graph clustering with missing data: Convex\n\nalgorithms and analysis. In Neural Information Processing Systems Conference (NIPS), 2014.\n\n[17] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the\n\ncrowd kernel. CoRR, abs/1105.1033, 2011.\n\n[18] Michael Wilber, Sam Kwak, and Serge Belongie. Cost-effective hits for relative similarity comparisons. In\n\nHuman Computation and Crowdsourcing (HCOMP), Pittsburgh, November 2014.\n\n[19] Eric Heim, Hamed Valizadegan, and Milos Hauskrecht. Machine Learning and Knowledge Discovery in\nDatabases: European Conference, ECML PKDD 2014, chapter Relative Comparison Kernel Learning with\nAuxiliary Kernels, pages 563\u2013578. Springer Berlin Heidelberg.\n\n[20] L. van der Maaten and K. Weinberger. Stochastic triplet embedding. In Machine Learning for Signal\n\nProcessing (MLSP), 2012 IEEE International Workshop on, pages 1\u20136, Sept 2012.\n\n[21] Catherine Wah, Grant Van Horn, Steve Branson, Subhransu Maji, Pietro Perona, and Serge Belongie.\nSimilarity comparisons for interactive \ufb01ne-grained categorization. In CVPR, pages 859\u2013866. IEEE, 2014.\n\n[22] Hannes Heikinheimo and Antti Ukkonen. The crowd-median algorithm. In HCOMP. AAAI, 2013.\n\n[23] A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM\n\nAlgorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20\u201328, 1979.\n\n[24] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps.\n\nSocial Networks, 5(2):109 \u2013 137, 1983.\n\n[25] Anne Condon and Richard M. Karp. Algorithms for graph partitioning on the planted partition model.\n\nRandom Struct. Algorithms, 18(2):116\u2013140, 2001.\n\n[26] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In\nADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, pages 849\u2013856. MIT Press, 2001.\n\n[27] Marina Meila. Comparing clusterings\u2014an information based distance. J. Multivar. Anal., 98(5):873\u2013895,\n\nMay 2007.\n\n[28] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for \ufb01ne-grained\nimage categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on\nComputer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.\n\n[29] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n9\n\n\f", "award": [], "sourceid": 725, "authors": [{"given_name": "Ramya", "family_name": "Korlakai Vinayak", "institution": "Caltech"}, {"given_name": "Babak", "family_name": "Hassibi", "institution": "Caltech"}]}