{"title": "Query Complexity of Clustering with Side Information", "book": "Advances in Neural Information Processing Systems", "page_first": 4682, "page_last": 4693, "abstract": "Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, ``do two elements $u$ and $v$ belong to the same cluster?''. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. To improve accuracy of clustering, a fruitful approach in recent years has been to ask a domain expert or crowd to obtain labeled data interactively. Many heuristics have been proposed, and all of these use a similarity function to come up with a querying strategy. Even so, there is a lack systematic theoretical study. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\\Theta(nk)$ (no similarity matrix) to $O(\\frac{k^2\\log{n}}{\\cH^2(f_+\\|f_-)})$ where $\\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$.", "full_text": "Query Complexity of Clustering with\n\nSide Information\n\nCollege of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nArya Mazumdar and Barna Saha\n\nAmherst, MA 01003\n\n{arya,barna}@cs.umass.edu\n\nAbstract\n\nSuppose, we are given a set of n elements to be clustered into k (unknown) clusters,\nand an oracle/expert labeler that can interactively answer pair-wise queries of the\nform, \u201cdo two elements u and v belong to the same cluster?\u201d. The goal is to recover\nthe optimum clustering by asking the minimum number of queries. In this paper,\nwe provide a rigorous theoretical study of this basic problem of query complexity\nof interactive clustering, and give strong information theoretic lower bounds, as\nwell as nearly matching upper bounds. Most clustering problems come with a\nsimilarity matrix, which is used by an automated process to cluster similar points\ntogether. However, obtaining an ideal similarity function is extremely challenging\ndue to ambiguity in data representation, poor data quality etc., and this is one of\nthe primary reasons that makes clustering hard. To improve accuracy of clustering,\na fruitful approach in recent years has been to ask a domain expert or crowd to\nobtain labeled data interactively. Many heuristics have been proposed, and all of\nthese use a similarity function to come up with a querying strategy. Even so, there\nis a lack systematic theoretical study. Our main contribution in this paper is to\nshow the dramatic power of side information aka similarity matrix on reducing\nthe query complexity of clustering. A similarity matrix represents noisy pair-wise\nrelationships such as one computed by some function on attributes of the elements.\nA natural noisy model is where similarity values are drawn independently from\nsome arbitrary probability distribution f+ when the underlying pair of elements\nbelong to the same cluster, and from some f\u2212 otherwise. We show that given\nsuch a similarity matrix, the query complexity reduces drastically from \u0398(nk)\nH2(f+(cid:107)f\u2212) ) where H2 denotes the squared Hellinger\n(no similarity matrix) to O( k2 log n\ndivergence. Moreover, this is also information-theoretic optimal within an O(log n)\nfactor. Our algorithms are all ef\ufb01cient, and parameter free, i.e., they work without\nany knowledge of k, f+ and f\u2212, and only depend logarithmically with n. Our\nlower bounds could be of independent interest, and provide a general framework\nfor proving lower bounds for classi\ufb01cation problems in the interactive setting.\nAlong the way, our work also reveals intriguing connection to popular community\ndetection models such as the stochastic block model and opens up many avenues\nfor interesting future research.\n\n1\n\nIntroduction\n\nClustering is one of the most fundamental and popular methods for data classi\ufb01cation. In this paper\nwe provide a rigorous theoretical study of clustering with the help of an oracle, a model that saw a\nrecent surge of popular heuristic algorithms.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fSuppose we are given a set of n points, that need to be clustered into k clusters where k is unknown\nto us. Suppose there is an oracle that either knows the true underlying clustering or can compute the\nbest clustering under some optimization constraints. We are allowed to query the oracle whether\nany two points belong to the same cluster or not. How many such queries are needed to be asked at\nminimum to perform the clustering exactly? The motivation to this problem lies at the heart of modern\nmachine learning applications where the goal is to facilitate more accurate learning from less data by\ninteractively asking for labeled data, e.g., active learning and crowdsourcing. Speci\ufb01cally, automated\nclustering algorithms that rely just on a similarity matrix often return inaccurate results. Whereas,\nobtaining few labeled data adaptively can help in signi\ufb01cantly improving its accuracy. Coupled with\nthis observation, clustering with an oracle has generated tremendous interest in the last few years with\nincreasing number of heuristics developed for this purpose [22, 40, 13, 42, 43, 18, 39, 12, 21, 29].\nThe number of queries is a natural measure of \u201cef\ufb01ciency\u201d here, as it directly relates to the amount of\nlabeled data or the cost of using crowd workers \u2013however, theoretical guarantees on query complexity\nis lacking in the literature.\nOn the theoretical side, query complexity or the decision tree complexity is a classical model of\ncomputation that has been extensively studied for different problems [16, 4, 8]. For the clustering\nproblem, one can obtain an upper bound of O(nk) on the query complexity easily and it is achievable\neven when k is unknown [40, 13]: to cluster an element at any stage of the algorithm, ask one query\nper existing cluster with this element (this is suf\ufb01cient due to transitivity), and start a new cluster if all\nqueries are negative. It turns out that \u2126(nk) is also a lower bound, even for randomized algorithms\n(see, e.g., [13]). In contrast, the heuristics developed in practice often ask signi\ufb01cantly less queries\nthan nk. What could be a possible reason for this deviation between the theory and practice?\nBefore delving into this question, let us look at a motivating application that drives this work.\nA Motivating Application: Entity Resolution. Entity resolution (ER, also known as record linkage)\nis a fundamental problem in data mining and has been studied since 1969 [17]. The goal of ER is to\nidentify and link/group different manifestations of the same real world object, e.g., different ways of\naddressing (names, email address, Facebook accounts) the same person, Web pages with different\ndescriptions of the same business, different photos of the same object etc. (see the excellent survey\nby Getoor and Machanavajjhala [20]). However, lack of an ideal similarity function to compare\nobjects makes ER an extremely challenging task. For example, DBLP, the popular computer science\nbibliography dataset is \ufb01lled with ER errors [30]. It is common for DBLP to merge publication records\nof different persons if they share similar attributes (e.g. same name), or split the publication record of\na single person due to slight difference in representation (e.g. Marcus Weldon vs Marcus K. Weldon).\nIn recent years, a popular trend to improve ER accuracy has been to incorporate human wisdom. The\nworks of [42, 43, 40] (and many subsequent works) use a computer-generated similarity matrix to\ncome up with a collection of pair-wise questions that are asked interactively to a crowd. The goal is\nto minimize the number of queries to the crowd while maximizing the accuracy. This is analogous\nto our interactive clustering framework. But intriguingly, as shown by extensive experiments on\nvarious real datasets, these heuristics use far less queries than nk [42, 43, 40]\u2013barring the \u2126(nk)\ntheoretical lower bound. On a close scrutiny, we \ufb01nd that all of these heuristics use some computer\ngenerated similarity matrix to guide in selecting the queries. Could these similarity matrices, aka side\ninformation, be the reason behind the deviation and signi\ufb01cant reduction in query complexity?\nLet us call this clustering using side information, where the clustering algorithm has access to a\nsimilarity matrix. This can be generated directly from the raw data (e.g., by applying Jaccard similarity\non the attributes), or using a crude classi\ufb01er which is trained on a very small set of labelled samples.\nLet us assume the following generative model of side information: a noisy weighted upper-triangular\nsimilarity matrix W = {wi,j}, 1 \u2264 i < j \u2264 n, where wi,j is drawn from a probability distribution\nf+ if i, j, i (cid:54)= j, belong to the same cluster, and else from f\u2212. However, the algorithm designer is\ngiven only the similarity matrix without any information on f+ and f\u2212. In this work, one of our\nmajor contributions is to show the separation in query complexity of clustering with and without such\nside information. Indeed the recent works of [18, 33] analyze popular heuristic algorithms of [40, 43]\nwhere the probability distributions are obtained from real datasets which show that these heuristics\nare signi\ufb01cantly suboptimal even for very simple distributions. To the best of our knowledge, before\nthis work, there existed no algorithm that works for arbitrary unknown distributions f+ and f\u2212 with\nnear-optimal performances. We develop a generic framework for proving information theoretic lower\nbounds for interactive clustering using side information, and design ef\ufb01cient algorithms for arbitrary\n\n2\n\n\ff+ and f\u2212 that nearly match the lower bound. Moreover, our algorithms are parameter free, that is\nthey work without any knowledge of f+, f\u2212 or k.\nConnection to popular community detection models. The model of side information considered\nin this paper is a direct and signi\ufb01cant generalization of the planted partition model, also known as\nthe stochastic block model (SBM) [28, 15, 14, 2, 1, 25, 24, 11, 36]. The stochastic block model is\nan extremely well-studied model of random graphs which is used for modeling communities in real\nworld, and is a special case of a similarity matrix we consider. In SBM, two vertices within the same\ncommunity share an edge with probability p, and two vertices in different communities share an edge\nwith probability q, that is f+ is Bernoulli(p) and f\u2212 is Bernoulli(q). It is often assumed that k, the\nnumber of communities, is a constant (e.g. k = 2 is known as the planted bisection model and is\nstudied extensively [1, 36, 15] or a slowly growing function of n (e.g. k = o(log n)). The points are\nassigned to clusters according to a probability distribution indicating the relative sizes of the clusters.\nIn contrast, not only in our model f+ and f\u2212 can be arbitrary probability mass functions (pmfs),\nwe do not have to make any assumption on k or the cluster size distribution, and can allow for any\npartitioning of the set of elements (i.e., adversarial setting). Moreover, f+ and f\u2212 are unknown. For\nSBM, parameter free algorithms are known relatively recently for constant number of linear sized\nclusters [3, 24].\nThere are extensive literature that characterize the threshold phenomenon in SBM in terms of p and\nq for exact and approximate recovery of clusters when relative cluster sizes are known and nearly\nbalanced (e.g., see [2] and therein for many references). For k = 2 and equal sized clusters, sharp\nthresholds are derived in [1, 36] for a speci\ufb01c sparse region of p and q 1. In a more general setting, the\nvertices in the ith and the jth communities are connected with probability qij and threshold results\nfor the sparse region has been derived in [2] - our model can be allowed to have this as a special case\nwhen we have pmfs fi,js denoting the distributions of the corresponding random variables. If an\noracle gives us some of the pairwise binary relations between elements (whether they belong to the\nsame cluster or not), the threshold of SBM must also change. But by what amount? This connection\nto SBM could be of independent interest to study query complexity of interactive clustering with side\ninformation, and our work opens up many possibilities for future direction.\nDeveloping lower bounds in the interactive setting appears to be signi\ufb01cantly challenging, as al-\ngorithms may choose to get any deterministic information adaptively by querying, and standard\nlower bounding techniques based on Fano-type inequalities [9, 31] do not apply. One of our major\ncontributions in this paper is to provide a general framework for proving information-theoretic lower\nbound for interactive clustering algorithms which holds even for randomized algorithms, and even\nwith the full knowledge of f+, f\u2212 and k. In contrast, our algorithms are computationally ef\ufb01cient and\nare parameter free (works without knowing f+, f\u2212 and k). The technique that we introduce for our\nupper bounds could be useful for designing further parameter free algorithms which are extremely\nimportant in practice.\nOther Related works. The interactive framework of clustering model has been studied before where\nthe oracle is given the entire clustering and the oracle can answer whether a cluster needs to be split\nor two clusters must be merged [7, 6]. Here we contain our attention to pair-wise queries, as in all\npractical applications that motivate this work [42, 43, 22, 40]. In most cases, an expert human or\ncrowd serves as an oracle. Due to the scale of the data, it is often not possible for such an oracle to\nanswer queries on large number of input data. Only recently, some heuristic algorithms with k-wise\nqueries for small values of k but k > 2 have been proposed in [39], and a non-interactive algorithm\nthat selects random triangle queries have been analyzed in [41]. Also recently, the stochastic block\nmodel with active label-queries have been studied in [19]. Perhaps conceptually closest to us is a\nrecent work by [5] where they consider pair-wise queries for clustering. However, their setting is very\ndifferent. They consider the speci\ufb01c NP-hard k-means objective with distance matrix which must be\na metric and must satisfy a deterministic separation property. Their lower bounds are computational\nand not information theoretic; moreover their algorithm must know the parameters. There exists a\nsigni\ufb01cant gap between their lower and upper bounds:\u223c log k vs k2, and it would be interesting if\nour techniques can be applied to improve this.\nHere we have assumed the oracle always returns the correct answer. To deal with the possibility that\nthe crowdsourced oracle may give wrong answers, there are simple majority voting mechanisms or\nmore complicated techniques [39, 12, 21, 29, 10, 41] to handle such errors. Our main objective is to\n\n1Most recent works consider the region of interest as p = a log n\n\nn\n\nand q = b log n\n\nn\n\nfor some a > b > 0.\n\n3\n\n\fstudy the power of side information, and we do not consider the more complex scenarios of handling\nerroneous oracle answers. The related problem of clustering with noisy queries is studied by us in a\ncompanion work [34]. Most of the results of the two papers are available online in a more extensive\nversion [32].\nContributions. Formally the problem we study in this paper can be described as follows.\nProblem 1 (Query-Cluster with an Oracle). Consider a set of elements V \u2261 [n] with k latent\nclusters Vi, i = 1, . . . , k, where k is unknown. There is an oracle O : V \u00d7 V \u2192 {\u00b11}, that when\nqueried with a pair of elements u, v \u2208 V \u00d7 V , returns +1 iff u and v belong to the same cluster,\nand \u22121 iff u and v belong to different clusters. The queries Q \u2286 V \u00d7 V can be done adaptively.\nConsider the side information W = {wu,v : 1 \u2264 u < v \u2264 n}, where the (u, v)th entry of W , wu,v is\na random variable drawn from a discrete probability distribution f+ if u, v belong to the same cluster,\nand is drawn from a discrete2 probability distribution f\u22123 if u, v belong to different clusters. The\nparameters k, f+ and f\u2212 are unknown. Given V and W , \ufb01nd Q \u2286 V \u00d7 V such that |Q| is minimum,\nand from the oracle answers and W it is possible to recover Vi, i = 1, 2, ..., k.\n\nWithout side information, as noted earlier, it is easy to see an algorithm with query complexity O(nk)\nfor Query-Cluster. When no side information is available, it is also not dif\ufb01cult to have a lower\nbound of \u2126(nk) on the query complexity. Our main contributions are to develop strong information\ntheoretic lower bounds as well as nearly matching upper bounds when side information is available,\nand characterize the effect of side information on query complexity precisely.\nUpper Bound (Algorithms). We show that with side information W , a drastic reduction in query\ncomplexity of clustering is possible, even with unknown parameters f+, f\u2212, and k. We propose a\nMonte Carlo randomized algorithm that reduces the number of queries from O(nk) to O( k2 log n\nH2(f+(cid:107)f\u2212) ),\nwhere H(f(cid:107)g) is the Hellinger divergence between the probability distributions f, and g, and recovers\nthe clusters accurately with high probability (with success probability 1 \u2212 1\nn) without knowing f+,\nf\u2212 or k (see, Theorem 1). Depending on the value of k, this could be highly sublinear in n. Note that\nthe squared Hellinger divergence between two pmfs f and g is de\ufb01ned to be,\n\n(cid:88)\n\n(cid:17)2\n(cid:16)(cid:112)f (i) \u2212(cid:112)g(i)\n\n.\n\nH2(f(cid:107)g) =\n\n1\n2\n\ni\n\nWe also develop a Las Vegas algorithm, that is one which recovers the clusters with probability 1 (and\nnot just with high probability), with query complexity O(n log n + k2 log n\nH2(f+(cid:107)f\u2212) ). Since f+ and f\u2212\ncan be arbitrary, not knowing the distributions provides a major challenge, and we believe, our recipe\ncould be fruitful for designing further parameter-free algorithms. We note that all our algorithms are\ncomputationally ef\ufb01cient - in fact, the time required is bounded by the size of the side information\nmatrix, i.e., O(n2).\nTheorem 1. Let the number of clusters k be unknown and f+ and f\u2212 be unknown discrete distri-\nbutions with \ufb01xed cardinality of support. There exists an ef\ufb01cient (polynomial-time) Monte Carlo\nk2 log n\nalgorithm for Query-Cluster that has query complexity O(min (nk,\nH2(f+(cid:107)f\u2212) )) and recovers all\nthe clusters accurately with probability 1 \u2212 o( 1\nn ). Moreover there exists an ef\ufb01cient Las Vegas\nalgorithm that with probability 1 \u2212 o( 1\nk2 log n\nH2(f+(cid:107)f\u2212) )).\nLower Bound. Our main lower bound result is information theoretic, and can be summarized in the\nfollowing theorem. Note especially that, for lower bound we can assume the knowledge of k, f+, f\u2212\nin contrast to upper bounds, which makes the results stronger. In addition, f+ and f\u2212 can be discrete\nor continuous distributions. Note that when H2(f+(cid:107)f\u2212) is close to 1, e.g., when the side information\nis perfect, no queries are required. However, that is not the case in practice, and we are interested in\nthe region where f+ and f\u2212 are \u201cclose\u201d, that is H2(f+(cid:107)f\u2212) is small.\nTheorem 2. Assume H2(f+(cid:107)f\u2212) \u2264 1\nof f+, f\u2212, and the number of clusters k, that does not perform \u2126\n\n18 . Any (possibly randomized) algorithm with the knowledge\nexpected\n\nn ) has query complexity O(n log n + min (nk,\n\nH2(f+(cid:107)f\u2212)}(cid:17)\n\nk2\n\nmin{nk,\n\n(cid:16)\n\n2Our lower bound holds for continuous distributions as well.\n3For simplicity of expression, we treat the sample space to be of constant size. However, all our results\n\nextend to any \ufb01nite sample space scaling linearly with its size.\n\n4\n\n\fnumber of queries, will be unable to return the correct clustering with probability at least 1\nO( 1\u221a\nk\nmin{nk,\n\n). And to recover the clusters with probability 1, the number of queries must be \u2126\n\nH2(f+(cid:107)f\u2212)}(cid:17)\n\nk2\n\n.\n\n(cid:16)\n6 \u2212\n\nn +\n\nThe lower bound therefore matches the query complexity upper bound within a logarithmic factor.\nNote that when no querying is allowed, this turns out exactly to be the setting of stochastic block\nmodel though with much general distributions. We have analyzed this case in Appendix C. To see how\nthe probability of error must scale, we have used a generalized version of Fano\u2019s inequality (e.g., [23]).\nHowever, when the number of queries is greater than zero, plus when queries can be adaptive, any\nsuch standard technique fails. Hence, signi\ufb01cant effort has to be put forth to construct a setting where\ninformation theoretic minimax bounds can be applied. This lower bound could be of independent\ninterest, and provides a general framework for deriving lower bounds for fundamental problems of\nclassi\ufb01cation, hypothesis testing, distribution testing etc. in the interactive learning setting. They\nmay also lead to new lower bound proving techniques in the related multi-round communication\ncomplexity model where information again gets revealed adaptively.\nOrganization. The proof of the lower bound is provided in Section 2. The Monte Carlo algorithm is\ngiven in Section 3. The detailed proof of the Monte Carlo algorithm, and the Las Vegas algorithm\nand its proof are given in Appendix A and Appendix B respectively in the supplementary material for\nspace constraint.\n\n2 Lower Bound (Proof of Theorem 2)\nIn this section, we develop our information theoretic lower bounds. We prove a more general result\nfrom which Theorem 2 follows.\nLemma 1. Consider the case when we have k equally sized clusters of size a each (that is total\nnumber of elements is n = ka). Suppose we are allowed to make at most Q adaptive queries to the\noracle. The probability of error for any algorithm for Query-Cluster is at least,\n\n(cid:114)\n\n(cid:16)\n\n(cid:17)2 \u2212\n\n1 \u2212 2\nk\n\n1 +\n\n4Q\nak\n\n4Q\n\nak(k \u2212 1)\n\n\u221a\n\u2212 2\n\naH(f+(cid:107)f\u2212).\n\nak . Therefore there exist at least ak\n\nThe main high-level technique to prove Lemma 1 is the following. Suppose, a node is to be assigned\nto a cluster. This situation is obviously akin to a k-hypothesis testing problem, and we want to use\na lower bound on the probability of error. The side information and the query answers constitute a\nrandom vector whose distributions (among the k possible) must be far apart for us to successfully\nidentify the clustering. But the main challenge comes from the interactive nature of the algorithm\nsince it reveals deterministic information and into characterizing the set of elements that are not\nqueried much by the algorithm.\nProof of Lemma 1. Since the total number of queries is Q, the average number of queries per element\nis at most 2Q\nak times. Let x\nbe one such element. We just consider the problem of assignment of x to a cluster (all other elements\nhave been correctly assigned already), and show that any algorithm will make wrong assignment with\npositive probability.\nStep 1: Setting up the hypotheses. Note that the side information matrix W = (wi,j) is provided\nwhere the wi,js are independent random variables. Now assume the scenario when we use an\nalgorithm ALG to assign x to one of the k clusters, Vu, u = 1, . . . , k. Therefore, given x, ALG takes\nas input the random variables wi,xs where i \u2208 (cid:116)tVt, makes some queries involving x and outputs\na cluster index, which is an assignment for x. Based on the observations wi,xs, the task of ALG\nis thus a multi-hypothesis testing among k hypotheses. Let Hu, u = 1, . . . k denote the k different\nhypotheses Hu : x \u2208 Vu. And let Pu, u = 1, . . . k denote the joint probability distributions of the\nrandom matrix W when x \u2208 Vu. In short, for any event A, Pu(A) = Pr(A|Hu). Going forward, the\nsubscript of probabilities or expectations will denote the appropriate conditional distribution.\nStep 2: Finding \u201cweak\u201d clusters. There must exist t \u2208 {1, . . . , k} such that,\n\n2 elements that are queried at most T < 4Q\n\nPt{ a query made by ALG involving cluster Vv} \u2264 Et{Number of queries made by ALG} \u2264 T.\n\nk(cid:88)\n\nv=1\n\n5\n\n\fWe now \ufb01nd a subset of clusters, that are \u201cweak,\u201d i.e., not queried enough if Ht were true. Consider\nthe set J(cid:48) \u2261 {v \u2208 {1, . . . , k} : Pt{ a query made by ALG involving cluster Vv} < 2T\nk(1\u2212\u03b2)}, where\n\u03b2 \u2261\n\nk(1\u2212\u03b2) \u2264 T, which implies, |J(cid:48)| \u2265 (1+\u03b2)k\n\n. We must have, (k \u2212 |J(cid:48)|) \u00b7\n\n\u221a\n\n2T\n\n1\n\n2\n\n.\n\n1+\n\n4Q\nak\n\n2 + (2\u2212\u03b2)k\n\nSince(cid:80)k\n\n2 elements. Since there are ak\n\n\u03b2k < 1, or |J(cid:48)(cid:48)| > k \u2212 \u03b2k\n\nu=1 Pt(E u) \u2264 1, we must have, (k \u2212 |J(cid:48)(cid:48)|) \u00b7 2\n|J(cid:48) \u2229 J(cid:48)(cid:48)| > (1+\u03b2)k\n\nNow, to output a cluster without using the side information, ALG has to either make a query to the\nactual cluster the element is from, or query at least k \u2212 1 times. In any other case, ALG must use\nthe side information (in addition to using queries) to output a cluster. Let E u denote the event that\nALG outputs cluster Vu by using the side information. Let J(cid:48)(cid:48) \u2261 {u \u2208 {1, . . . , k} : Pt(E u) \u2264 2\n\u03b2k}.\n2 = (2\u2212\u03b2)k\n2 . This means, {Vu : u \u2208 J(cid:48) \u2229 J(cid:48)(cid:48)} contains more\n2 \u2212 k = k\nWe have,\nthan ak\n2 elements that are queried at most T times, these two sets must\nhave nonzero intersection. Hence, we can assume that x \u2208 V(cid:96) for some (cid:96) \u2208 J(cid:48) \u2229 J(cid:48)(cid:48), i.e., let H(cid:96) be\nthe true hypothesis. Now we characterize the error events of the algorithm ALG in assignment of x.\nStep 3: Characterizing error events for \u201cx\u201d. We now consider the following two events. E1 =\n{a query made by ALG involving cluster V(cid:96)};E2 = {k \u2212 1 or more queries were made by ALG}.\nNote that if the algorithm ALG can correctly assign x to a cluster without using the side information\nthen either of E1 or E2 must have to happen. Recall, E (cid:96) denotes the event that ALG outputs cluster V(cid:96)\nassignment is at most P(cid:96)(E). We now bound this probability of correct recovery from above.\nStep 4: Bounding probability of correct recovery via Hellinger distance. We have,\n\nusing the side information. Now consider the event E \u2261 E (cid:96)(cid:83)E1\n\n(cid:83)E2. The probability of correct\n\n2\n\n.\n\nP(cid:96)(E) \u2264 Pt(E) + |P(cid:96)(E) \u2212 Pt(E)| \u2264 Pt(E) + (cid:107)P(cid:96) \u2212 Pt(cid:107)T V \u2264 Pt(E) +\n\nwhere, (cid:107)P \u2212 Q(cid:107)T V \u2261 supA |P (A) \u2212 Q(A)| denotes the total variation distance between two\nprobability distributions P and Q and in the last step we have used the relationship between total\nvariation distance and the Hellinger divergence (see, for example, [38, Eq. (3)]). Now, recall that\nP(cid:96) and Pt are the joint distributions of the independent random variables wi,x, i \u2208 \u222auVu. Now, we\nuse the fact that squared Hellinger divergence between product distribution of independent random\nvariables are less than the sum of the squared Hellinger divergence between the individual distribution.\nWe also note that the divergence between identical random variables are 0. We obtain\n\n(cid:112)2H2(P(cid:96)(cid:107)Pt) \u2264(cid:112)2 \u00b7 2aH2(f+(cid:107)f\u2212) = 2\n\nThis is true because the only times when wi,x differs under Pt and under P(cid:96) is when x \u2208 Vt or\n\u221a\nx \u2208 V(cid:96). As a result we have, P(cid:96)(E) \u2264 Pt(E) + 2\naH(f+(cid:107)f\u2212). Now, using Markov inequality\nPt(E2) \u2264 T\n\naH(f+(cid:107)f\u2212).\n\nak(k\u22121) . Therefore,\n\nk\u22121 \u2264 4Q\n\n\u221a\n\n2H(P(cid:96)(cid:107)Pt),\n\n\u221a\n\nPt(E) \u2264 Pt(E (cid:96)) + Pt(E1) + Pt(E2) \u2264 2\n\u03b2k\nTherefore, putting the value of \u03b2 we get, P(cid:96)(E) \u2264 2\nwhich proves the lemma.\n\n(cid:16)\n\nk\n\n4Q\n\n+\n\nak(k \u2212 1)\n.\n\u221a\nak(k\u22121) + 2\n\n+ 4Q\n\naH(f+(cid:107)f\u2212),\n\n+\n\n8Q\n\nak2(1 \u2212 \u03b2)\n\n(cid:113) 4Q\n\n(cid:17)2\n\n1 +\n\nak\n\nk\n\nk\n\nak\n\nk2\n\nk2\n\n1 +\n\n(cid:16)\n\n(cid:113) 4Q\n\n(cid:17)2 \u2212 4Q\n\n), if the number of queries Q \u2264 nk\n72 .\n\n9H2(f+(cid:107)f\u2212). Now consider\nk . The probability of error of any algorithm must be at least,\n6 \u2212 O( 1\u221a\n\nProof of Theorem 2. Consider two cases. In the \ufb01rst case, suppose, nk <\nthe situation of Lemma 1, with a = n\n3 \u2265 1\nak(k\u22121) \u2212 2\n1 \u2212 2\nIn the second case, suppose nk \u2265\n9H2(f+(cid:107)f\u2212)(cid:99). Then a \u2265 2, since\n9H2(f+(cid:107)f\u2212). Assume, a = (cid:98)\n18. We have nk \u2265 k2a. Consider the situation when we are already given a complete\nH2(f+(cid:107)f\u2212) \u2264 1\ncluster Vk with n \u2212 (k \u2212 1)a elements, remaining (k \u2212 1) clusters each has 1 element, and the rest\n(a \u2212 1)(k \u2212 1) elements are evenly distributed (but yet to be assigned) to the k \u2212 1 clusters. Now we\nare exactly in the situation of Lemma 1 with k \u2212 1 playing the role of k. If we have Q < ak2\n72 , The\nprobability of error is at least 1\u2212 ok(1)\u2212 1\nk2\nH2(f+(cid:107)f\u2212) ).\nNote that in this proof we have not in particular tried to optimize the constants.\nIf we want to recover the clusters with probability 1, then \u2126(n) is a trivial lower bound. Hence,\ncoupled with the above we get a lower bound of \u2126(n + min{nk,\n\n). Therefore Q must be \u2126(\n\nH2(f+(cid:107)f\u2212)}) in that case.\n\n6 \u2212 O( 1\u221a\n\n6 \u2212 2\n\n3 = 1\n\nk2\n\nk\n\n1\n\n6\n\n\f3 Algorithms\n\n|C|\n\n|C|(|C|\u22121)\n\n|{u\u2208C:wu,v=ai}|\n\n|{(u,v)\u2208C\u00d7C:u(cid:54)=v,wu,v=ai}|\n\nWe propose two algorithms (Monte Carlo and Las Vegas) both of which are completely parameter\nfree that is they work without any knowledge of k, f+ and f\u2212, and meet the respective lower bounds\nwithin an O(log n) factor. Here we present the Monte Carlo algorithm which drastically reduces\nthe number of queries from O(nk) (no side information) to O( k2 log n\nH2(f+(cid:107)f\u2212) ) and recovers the clusters\nexactly with probability at least 1\u2212 on(1). The detailed proof of it, as well as the Las Vegas algorithm\nare presented in Appendix A and Appendix B respectively in the supplementary material.\nOur algorithm uses a subroutine called Membership that takes as input an element v \u2208 V and\na subset of elements C \u2286 V \\ {v}. Assume that f+, f\u2212 are discrete distributions over \ufb01xed set\nof q points a1, a2, . . . , aq; that is wi,j takes value in the set {a1, a2, . . . , aq}. De\ufb01ne the empirical\n\u201cinter\u201d distribution pv,C for i = 1, . . . , q, pv,C(i) =\nAlso compute the \u201cintra\u201d dis-\n. Then we use Membership(v,C)\ntribution pC for i = 1, . . . , q, pC(i) =\n= \u2212H2(pv,C(cid:107)pC) as af\ufb01nity of vertex v to C, where H(pv,C(cid:107)pC) denotes the Hellinger divergence\nbetween distributions. Note that since the membership is always negative, a higher membership\nimplies that the \u2018inter\u2019 and \u2018intra\u2019 distributions are closer in terms of Hellinger distance.\nDesigning a parameter free Monte Carlo algorithm seems to be highly challenging as here, the\nnumber of queries depends only logarithmically with n. Intuitively, if an element v has the highest\nmembership in some cluster C, then v should be queried with C \ufb01rst. Also an estimation from side\ninformation is reliable when the cluster already has enough members. Unfortunately, we know neither\nwhether the current cluster size is reliable, nor we are allowed to make even one query per element.\nTo overcome this bottleneck, we propose an iterative-update algorithm which we believe will \ufb01nd\nmore uses in developing parameter free algorithms. We start by querying a few points so that there is\nat least one cluster with \u0398(log n) points. Now based on these queried memberships, we learn two\nempirical distributions p1\n+ from intra-cluster similarity values, and p1\u2212 from inter-cluster similarity\nvalues. Given an element v which has not been clustered yet, and a cluster C with the highest number\nof current members, we would like to consider the submatrix of side information pertaining to v\nand all u \u2208 C and determine whether that side information is generated from f+ or f\u2212. We know\nif the statistical distance between f+ and f\u2212 is small, then we would need more members in C to\nsuccessfully do this test. Since we do not know f+ and f\u2212, we compute the squared Hellinger\n+ and p1\u2212, and use that to compute a threshold \u03c41 on the size of C. If C crosses\ndivergence between p1\nthis size threshold, we just use the side information to determine if v should belong to C. Otherwise,\nwe query further until there is one cluster with size \u03c41, and re-estimate the empirical distributions p2\n+\nand p2\u2212. Again, we recompute a threshold \u03c42, and stop if the cluster under consideration crosses this\nnew threshold. If not we continue. Interestingly, we can show when the process converges, we have a\nvery good estimate of H(f+(cid:107)f\u2212) and, moreover it converges fast.\nAlgorithm. Phase 1. Initialization. We initialize the algorithm by selecting any element v and\ncreating a singleton cluster {v}. We then keep selecting new elements randomly and uniformly that\nhave not yet been clustered, and query the oracle with it by choosing exactly one element from each\nof the clusters formed so far. If the oracle returns +1 to any of these queries then we include the\nelement in the corresponding cluster, else we create a new singleton cluster with it. We continue this\nprocess until one cluster has grown to a size of (cid:100)C log n(cid:101), where C is a constant.\nPhase 2. Iterative Update. Let C1,C2, ...Clx be the set of clusters formed after the xth iteration for\nsome lx \u2264 k, where we consider Phase 1 as the 0-th iteration. We estimate\n\n|{u \u2208 Ci, v \u2208 Cj, i < j, i, j \u2208 [1, lx] : wu,v = ai}|\n\n(cid:80)\ni\nMembership(v,C) \u2265 \u2212( 4H(p+(cid:107)p\u2212)\nlog n ), then we include v in Waiting(C). For every\nelement in Waiting(C), we query the oracle with it by choosing exactly one element from each of the\nclusters formed so far starting with C. If oracle returns answer \u201cyes\u201d to any of these queries then we\ninclude the element in that cluster, else we create a new singleton cluster with it. We continue this\nuntil Waiting(C) is exhausted.\nWe then call C completely grown, remove it from further consideration, and move to the next grown\ncluster. if there is no other grown cluster, then we move back to Phase 2.\nAnalysis. The main steps of the analysis are as follows (for full analysis see Appendix A).\n1. First, Lemma 3 shows with high probability H(p+(cid:107)p\u2212) \u2208 [H(f+(cid:107)f\u2212) \u00b1 4H(p+(cid:107)p\u2212)2\n\nlog n ] for a\nsuitable constant B that depends on C. Using it, we can show the process converges whenever\na cluster has grown to a size of\nH2(f+(cid:107)f\u2212). The proof relies on adapting the Sanov\u2019s Theorem\n(see Lemma 2) of information theory. We are measuring the distance between distributions via\nHellinger distance, as opposed to KL divergence (which would have been a natural choice because\nof its presence in the rate function in Sanov\u2019s therem), because Hellinger distance is a metric\nwhich proves to be crucial in our analysis.\n2. Lemma 5 and Corollary 1 show that every element that is included in C in Phase (3A) truly\nbelongs to C, and elements that are not in Waiting(C) can not be in C with high probability. Once\nPhase 2 has converged, if the condition of (3A) is satis\ufb01ed, the element must belong to C. There\nis a small gray region of con\ufb01dence interval (3B) such that if an element belongs there, we cannot\nbe sure either way, but if an element does not satisfy either (3A) or 3B, it cannot be part of C.\n3. Lemma 6 shows that size of Waiting(C) is constant showing an anti-concentration property. This\nH2(f+(cid:107)f\u2212) gives the\n\ncoupled with the fact that the process converges when a cluster reaches size\ndesired query complexity bound in Lemma 7.\n\n\u221a\n\nB\n\n4C log n\n\n4 Experimental Results\n\nIn this section, we report experimental results on a popular bibliographic dataset cora [35] consisting\nof 1879 nodes, 191 clusters and 1699612 edges out of which 62891 are intra-cluster edges. We\nremove any singleton node from the dataset \u2013 the \ufb01nal number of vertices that we classify is 1812\nwith 124 clusters. We use the similarity function computation used by [18] to compute f+ and f\u2212.\nThe two distributions are shown in Figure 1 on the left. The Hellinger square divergence between the\ntwo distributions is 0.6. In order to observe the dependency of the algorithm performance on the learnt\ndistributions, we perturb the exact distributions to obtain two approximate distributions as shown\nin Figure 1 (middle) with Hellinger square divergence being 0.4587. We consider three strategies.\nSuppose the cluster in which a node v must be included has already been initialized and exists in the\ncurrent solution. Moreover, suppose the algorithm decides to use queries to \ufb01nd membership of v.\nThen in the best strategy, only one query is needed to identify the cluster in which v belongs. In the\nworst strategy, the algorithm \ufb01nds the correct cluster after querying all the existing clusters whose\ncurrent membership is not enough to take a decision using side information. In the greedy strategy,\nthe algorithm queries the clusters in non-decreasing order of Hellinger square divergence between f+\n(or approximate version of it) and the estimated distribution from side information between v and\neach existing clusters. Note that, in practice, we will follow the greedy strategy. Figure 2 shows\nthe performance of each strategy. We plot the number of queries vs F1 Score which computes the\nharmonic mean of precision and recall. We observe that the performance of greedy strategy is very\nclose to that of best. With just 1136 queries, greedy achieves 80% precision and close to 90%\nrecall. The best strategy would need 962 queries to achieve that performance. The performance of\nour algorithm on the exact and approximate distributions are also very close which indicates it is\nenough to learn a distribution that is close to exact. For example, using the approximate distributions,\n\n8\n\n\fFigure 1: (left) Exact distributions of similarity values, (middle) approximate distributions of similar-\nity values, (right) Number of Queries vs F1 Score for both distributions.\n\nto achieve similar precision and recall, the greedy strategy just uses 1148 queries, that is 12 queries\nmore than when we use when the distributions are known.\n\nFigure 2: Number of Queries vs F1 Score using three strategies: best, greedy, worst.\n\nn\n\nn\n\n) and f\u2212 is Bernoulli( b(cid:48) log n\n\u221a\nb(cid:48)) \u2265 \u221a\na(cid:48)\u2212\u221a\n\nDiscussion. This is the \ufb01rst rigorous theoretical study of interactive clustering with side information,\nand it unveils many interesting directions for future study of both theoretical and practical signi\ufb01cance\n(see Appendix D for more details). Having arbitrary f+, f\u2212 is a generalization of SBM. Also it raises\nan important question about how SBM recovery threshold changes with queries. For sparse region of\nSBM, where f+ is Bernoulli( a(cid:48) log n\n), a(cid:48) > b(cid:48), Lemma 1 is not tight\nyet. However, it shows the following trend. Let us set a = n\nk in Lemma 1 with the above f+, f\u2212.\nlog n factor that with Q queries, the sharp\nWe conjecture by ignoring the lower order terms and a\nrecovery threshold of sparse SBM changes from (\n.\nProving this bound remains an exciting open question.\nWe propose two computationally ef\ufb01cient algorithms that match the query complexity lower bound\nwithin log n factor and are completely parameter free. In particular, our iterative-update method to\ndesign Monte-Carlo algorithm provides a general recipe to develop any parameter-free algorithm,\nwhich are of extreme practical importance. The convergence result is established by extending\nSanov\u2019s theorem from the large deviation theory which gives bound only in terms of KL-divergence.\nDue to the generality of the distributions, the only tool we could use is Sanov\u2019s theorem. However,\nHellinger distance comes out to be the right measure both for lower and upper bounds. If f+ and\nf\u2212 are common distributions like Gaussian, Bernoulli etc., then other concentration results stronger\nthan Sanov may be applied to improve the constants and a logarithm factor to show the trade-off\nbetween queries and thresholds as in sparse SBM. While some of our results apply to general fi,js,\na full picture with arbitrary fi,js and closing the gap of log n between the lower and upper bound\nremain an important future direction.\n\n\u221a\nk to (\n\na(cid:48)\u2212\u221a\n\nb(cid:48)) \u2265 \u221a\n\n(cid:17)\n\n1 \u2212 Q\n\nnk\n\nk\n\n(cid:16)\n\n\u221a\n\n9\n\n\fAcknowledgement. This work is supported in part by NSF awards CCF 1642658, CCF 1642550,\nCCF 1464310, CCF 1652303, a Yahoo ACE Award and a Google Faculty Research Award. We are\nparticularly thankful to an anonymous reviewer whose comments led to notable improvement of the\npresentation of the paper.\n\nReferences\n[1] E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. IEEE\n\nTrans. Information Theory, 62(1):471\u2013487, 2016.\n\n[2] E. Abbe and C. Sandon. Community detection in general stochastic block models: Fundamental\nlimits and ef\ufb01cient algorithms for recovery. In IEEE 56th Annual Symposium on Foundations\nof Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 670\u2013688,\n2015.\n\n[3] E. Abbe and C. Sandon. Recovering communities in the general stochastic block model without\nknowing the parameters. In Advances in Neural Information Processing Systems, pages 676\u2013684,\n2015.\n\n[4] M. Ajtai, J. Komlos, W. L. Steiger, and E. Szemer\u00b4edi. Deterministic selection in o (loglog n)\nparallel time. In Proceedings of the eighteenth annual ACM symposium on Theory of computing,\npages 188\u2013195. ACM, 1986.\n\n[5] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. NIPS, 2016.\n\n[6] P. Awasthi, M.-F. Balcan, and K. Voevodski. Local algorithms for interactive clustering. In\n\nICML, pages 550\u2013558, 2014.\n\n[7] M.-F. Balcan and A. Blum. Clustering with interactive feedback. In International Conference\n\non Algorithmic Learning Theory, pages 316\u2013328. Springer, 2008.\n\n[8] B. Bollob\u00b4as and G. Brightwell. Parallel selection with high probability. SIAM Journal on\n\nDiscrete Mathematics, 3(1):21\u201331, 1990.\n\n[9] K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with general degrees\n\nin the extended planted partition model. In COLT, pages 35\u20131, 2012.\n\n[10] Y. Chen, G. Kamath, C. Suh, and D. Tse. Community recovery in graphs with locality. In\nProceedings of The 33rd International Conference on Machine Learning, pages 689\u2013698, 2016.\n\n[11] P. Chin, A. Rao, and V. Vu. Stochastic block model and community detection in the sparse\ngraphs: A spectral algorithm with optimal rate of recovery. arXiv preprint arXiv:1501.05021,\n2015.\n\n[12] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In\n\nWWW, pages 285\u2013294, 2013.\n\n[13] S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Top-k and clustering with noisy comparisons.\n\nACM Trans. Database Syst., 39(4):35:1\u201335:39, 2014.\n\n[14] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00b4a. Asymptotic analysis of the stochas-\ntic block model for modular networks and its algorithmic applications. Physical Review E,\n84(6):066106, 2011.\n\n[15] M. E. Dyer and A. M. Frieze. The solution of some random np-hard problems in polynomial\n\nexpected time. Journal of Algorithms, 10(4):451\u2013489, 1989.\n\n[16] U. Feige, P. Raghavan, D. Peleg, and E. Upfal. Computing with noisy information. SIAM\n\nJournal on Computing, 23(5):1001\u20131018, 1994.\n\n[17] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical\n\nAssociation, 64(328):1183\u20131210, 1969.\n\n10\n\n\f[18] D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an oracle. PVLDB,\n\n9(5):384\u2013395, 2016.\n\n[19] A. Gadde, E. E. Gad, S. Avestimehr, and A. Ortega. Active learning for community detection in\nstochastic block models. In Information Theory (ISIT), 2016 IEEE International Symposium on,\npages 1889\u20131893. IEEE, 2016.\n\n[20] L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges.\n\nPVLDB, 5(12):2018\u20132019, 2012.\n\n[21] A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse\n\ndetection in user-generated content. In EC, pages 167\u2013176, 2011.\n\n[22] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone:\nHands-off crowdsourcing for entity matching. In SIGMOD Conference, pages 601\u2013612, 2014.\n\n[23] A. Guntuboyina. Lower bounds for the minimax risk using-divergences, and applications. IEEE\n\nTransactions on Information Theory, 57(4):2386\u20132399, 2011.\n\n[24] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semide\ufb01nite\n\nprogramming. IEEE Transactions on Information Theory, 62(5):2788\u20132797, 2016.\n\n[25] B. E. Hajek, Y. Wu, and J. Xu. Computational lower bounds for community detection on\nrandom graphs. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris,\nFrance, July 3-6, 2015, pages 899\u2013928, 2015.\n\n[26] T. S. Han and S. Verdu. Generalizing the fano inequality. IEEE Transactions on Information\n\nTheory, 40(4):1247\u20131251, 1994.\n\n[27] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican statistical association, 58(301):13\u201330, 1963.\n\n[28] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social\n\nnetworks, 5(2):109\u2013137, 1983.\n\n[29] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In\n\nNIPS, pages 1953\u20131961, 2011.\n\n[30] H. K\u00a8opcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world\n\nmatch problems. Proceedings of the VLDB Endowment, 3(1-2):484\u2013493, 2010.\n\n[31] S. H. Lim, Y. Chen, and H. Xu. Clustering from labels and time-varying graphs. In Z. Ghahra-\nmani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 27, pages 1188\u20131196. Curran Associates, Inc., 2014.\n\n[32] A. Mazumdar and B. Saha. Clustering via crowdsourcing. arXiv preprint arXiv:1604.01839,\n\n2016.\n\n[33] A. Mazumdar and B. Saha. A Theoretical Analysis of First Heuristics of Crowdsourced Entity\n\nResolution. The Thirty-First AAAI Conference on Arti\ufb01cial Intelligence (AAAI-17), 2017.\n\n[34] A. Mazumdar and B. Saha. Clustering with noisy queries. In Advances in Neural Information\n\nProcessing Systems (NIPS) 31, 2017.\n\n[35] A. McCallum, 2004. http://www.cs.umass.edu/~mcallum/data/cora-refs.tar.gz.\n[36] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In\nProceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages\n69\u201375. ACM, 2015.\n\n[37] Y. Polyanskiy and S. Verd\u00b4u. Arimoto channel coding converse and r\u00b4enyi divergence.\n\nIn\nCommunication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on,\npages 1327\u20131333. IEEE, 2010.\n\n11\n\n\f[38] I. Sason and S. V\u00b4erdu. f divergence inequalities. IEEE Transactions on Information Theory,\n\n62(11):5973\u20136006, 2016.\n\n[39] V. Verroios and H. Garcia-Molina. Entity resolution with crowd errors. In 31st IEEE Interna-\ntional Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015,\npages 219\u2013230, 2015.\n\n[40] N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB,\n\n7(12):1071\u20131082, 2014.\n\n[41] R. K. Vinayak and B. Hassibi. Crowdsourced clustering: Querying edges vs triangles. In\n\nAdvances in Neural Information Processing Systems, pages 1316\u20131324, 2016.\n\n[42] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution.\n\nPVLDB, 5(11):1483\u20131494, 2012.\n\n[43] J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for\n\ncrowdsourced joins. In SIGMOD Conference, pages 229\u2013240, 2013.\n\n12\n\n\f", "award": [], "sourceid": 2454, "authors": [{"given_name": "Arya", "family_name": "Mazumdar", "institution": "University of Massachusetts Amherst"}, {"given_name": "Barna", "family_name": "Saha", "institution": "University of Massachusetts Amherst"}]}