{"title": "Flexible Models for Microclustering with Application to Entity Resolution", "book": "Advances in Neural Information Processing Systems", "page_first": 1417, "page_last": 1425, "abstract": "Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.", "full_text": "Flexible Models for Microclustering with\n\nApplication to Entity Resolution\n\nGiacomo Zanella\u2217\n\nDepartment of Decision Sciences\n\nBocconi University\n\ngiacomo.zanella@unibocconi.it\n\nBrenda Betancourt\u2217\n\nDepartment of Statistical Science\n\nDuke University\n\nbb222@stat.duke.edu\n\nHanna Wallach\nMicrosoft Research\nhanna@dirichlet.net\n\nJeffrey Miller\n\nAbbas Zaidi\n\nDepartment of Biostatistics\n\nDepartment of Statistical Science\n\nHarvard University\n\njwmiller@hsph.harvard.edu\n\nDuke University\n\namz19@stat.duke.edu\n\nDepartments of Statistical Science and Computer Science\n\nRebecca C. Steorts\n\nDuke University\n\nbeka@stat.duke.edu\n\nAbstract\n\nMost generative models for clustering implicitly assume that the number of data\npoints in each cluster grows linearly with the total number of data points. Finite\nmixture models, Dirichlet process mixture models, and Pitman\u2013Yor process mixture\nmodels make this assumption, as do all other in\ufb01nitely exchangeable clustering\nmodels. However, for some applications, this assumption is inappropriate. For\nexample, when performing entity resolution, the size of each cluster should be\nunrelated to the size of the data set, and each cluster should contain a negligible\nfraction of the total number of data points. These applications require models that\nyield clusters whose sizes grow sublinearly with the size of the data set. We address\nthis requirement by de\ufb01ning the microclustering property and introducing a new\nclass of models that can exhibit this property. We compare models within this class\nto two commonly used clustering models using four entity-resolution data sets.\n\n1\n\nIntroduction\n\nMany clustering applications require models that assume cluster sizes grow linearly with the size of the\ndata set. These applications include topic modeling, inferring population structure, and discriminating\namong cancer subtypes. In\ufb01nitely exchangeable clustering models, including \ufb01nite mixture models,\nDirichlet process mixture models, and Pitman\u2013Yor process mixture models, all make this linear-\ngrowth assumption, and have seen numerous successes when used in these contexts. For other cluster-\ning applications, such as entity resolution, this assumption is inappropriate. Entity resolution (includ-\ning record linkage and de-duplication) involves identifying duplicate2 records in noisy databases [1, 2],\ntraditionally by directly linking records to one another. Unfortunately, this traditional approach is\ncomputationally infeasible for large data sets\u2014a serious limitation in \u201cthe age of big data\u201d [1, 3]. As a\n\n\u2217Giacomo Zanella and Brenda Betancourt are joint \ufb01rst authors.\n2In the entity resolution literature, the term \u201cduplicate records\u201d does not mean that the records are identical,\n\nbut rather that the records are corrupted, degraded, or otherwise noisy representations of the same entity.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fresult, researchers increasingly treat entity resolution as a clustering problem, where each entity is im-\nplicitly associated with one or more records and the inference goal is to recover the latent entities (clus-\nters) that correspond to the observed records (data points) [4, 5, 6]. In contrast to other clustering appli-\ncations, the number of data points in each cluster should remain small, even for large data sets. Appli-\ncations like this require models that yield clusters whose sizes grow sublinearly with the total number\nof data points [7]. To address this requirement, we de\ufb01ne the microclustering property in section 2 and,\nin section 3, introduce a new class of models that can exhibit this property. In section 4, we compare\ntwo models within this class to two commonly used in\ufb01nitely exchangeable clustering models.\n\n2 The Microclustering Property\n\nTo cluster N data points x1, . . . , xN using a partition-based Bayesian clustering model, one \ufb01rst\nplaces a prior over partitions of [N ] = {1, . . . , N}. Then, given a partition CN of [N ], one models the\ndata points in each part c \u2208 CN as jointly distributed according to some chosen distribution. Finally,\none computes the posterior distribution over partitions and, e.g., uses it to identify probable partitions\nof [N ]. Mixture models are a well-known type of partition-based Bayesian clustering model, in which\nCN is implicitly represented by a set of cluster assignments z1, . . . , zN . These cluster assignments\ncan be regarded as the \ufb01rst N elements of an in\ufb01nite sequence z1, z2, . . ., drawn a priori from\n\n\u03c0 \u223c H and z1, z2, . . . | \u03c0 iid\u223c \u03c0,\n\nwhere H is a prior over \u03c0 and \u03c0 is a vector of mixture weights with(cid:80)\n\n(1)\nl \u03c0l = 1 and \u03c0l \u2265 0 for\nall l. Commonly used mixture models include (a) \ufb01nite mixtures where the dimensionality of \u03c0\nis \ufb01xed and H is usually a Dirichlet distribution; (b) \ufb01nite mixtures where the dimensionality of\n\u03c0 is a random variable [8, 9]; (c) Dirichlet process (DP) mixtures where the dimensionality of \u03c0\nis in\ufb01nite [10]; and (d) Pitman\u2013Yor process (PYP) mixtures, which generalize DP mixtures [11].\nEquation 1 implicitly de\ufb01nes a prior over partitions of N = {1, 2, . . .}. Any random partition CN of\n(cid:80)N\nN induces a sequence of random partitions (CN : N = 1, 2, . . .), where CN is a partition of [N ]. Via\nthe strong law of large numbers, the cluster sizes in any such sequence obtained via equation 1 grow\nn=1 I(zn = l) \u2192 \u03c0l as N \u2192 \u221e, where\nlinearly with N because, with probability one, for all l, 1\nI(\u00b7) denotes the indicator function. Unfortunately, this linear growth assumption is not appropriate\nN\nfor entity resolution and other applications that require clusters whose sizes grow sublinearly with N.\nTo address this requirement, we therefore de\ufb01ne the microclustering property: A sequence of random\npartitions (CN : N = 1, 2, . . .) exhibits the microclustering property if MN is op(N ), where MN is\nthe size of the largest cluster in CN , or, equivalently, if MN / N \u2192 0 in probability as N \u2192 \u221e.\nA clustering model exhibits the microclustering property if the sequence of random partitions implied\nby that model satis\ufb01es the above de\ufb01nition. No mixture model can exhibit the microclustering property\n(unless its parameters are allowed to vary with N). In fact, Kingman\u2019s paintbox theorem [12, 13] im-\nplies that any exchangeable partition of N, such as a partition obtained using equation 1, is either equal\nto the trivial partition in which each part contains one element or satis\ufb01es lim infN\u2192\u221e MN / N > 0\nwith positive probability. By Kolmogorov\u2019s extension theorem, a sequence of random partitions\n(CN : N = 1, 2, . . .) corresponds to an exchangeable random partition of N whenever (a) each CN\nis \ufb01nitely exchangeable (i.e., its probability is invariant under permutations of {1, . . . , N}) and (b)\nthe sequence is projective (also known as consistent in distribution)\u2014i.e., if N(cid:48) < N, the distribution\nover CN(cid:48) coincides with the marginal distribution over partitions of [N(cid:48)] induced by the distribution\nover CN . Therefore, to obtain a nontrivial model that exhibits the microclustering property, we must\nsacri\ufb01ce either (a) or (b). Previous work [14] sacri\ufb01ced (a); in this paper, we instead sacri\ufb01ce (b).\nSacri\ufb01cing \ufb01nite exchangeability and sacri\ufb01cing projectivity have very different consequences. If a\npartition-based Bayesian clustering model is not \ufb01nitely exchangeable, then inference will depend on\nthe order of the data points. For most applications, this consequence is undesirable\u2014there is no reason\nto believe that the order of the data points is meaningful. In contrast, if a model lacks projectivity,\nthen the implied joint distribution over a subset of the data points in a data set will not be the same as\nthe joint distribution obtained by modeling the subset directly. In the context of entity resolution, sac-\nri\ufb01cing projectivity is a more natural and less restrictive choice than sacri\ufb01cing \ufb01nite exchangeability.\n\n2\n\n\f3 Kolchin Partition Models for Microclustering\n\nWe introduce a new class of Bayesian models for microclustering by placing a prior on the number of\nclusters K and, given K, modeling the cluster sizes N1, . . . , NK directly. We start by de\ufb01ning\n\n(cid:123)(cid:122)\n\n, 2, . . . , 2\n\nK \u223c \u03ba and N1, . . . , NK | K iid\u223c \u00b5,\n\n, . . . . . . , K, . . . , K\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n(2)\nwhere \u03ba = (\u03ba1, \u03ba2, . . . ) and \u00b5 = (\u00b51, \u00b52, . . . ) are probability distributions over N = {1, 2, . . .}.\nk=1 Nk and, given N1, . . . , NK, generate a set of cluster assign-\nments z1, . . . , zN by drawing a vector uniformly at random from the set of permutations of\n). The cluster assignments z1, . . . , zN induce a random par-\n(1, . . . , 1\n\nWe then de\ufb01ne N = (cid:80)K\n(cid:124)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nN1 times\ntition CN of [N ], where N is itself a random variable\u2014i.e., CN is a random partition of a random\nnumber of elements. We refer to the resulting class of marginal distributions over CN as Kolchin\npartition (KP) models [15, 16] because the form of equation 2 is closely related to Kolchin\u2019s repre-\nsentation theorem for Gibbs-type partitions (see, e.g., 16, theorem 1.2). For appropriate choices of \u03ba\nand \u00b5, KP models can exhibit the microclustering property (see appendix B for an example).\n\nIf CN denotes the set of all possible partitions of [N ], then(cid:83)\u221e\npartitions of [N ] for all N \u2208 N. The probability of any given partition CN \u2208(cid:83)\u221e\n\nCN is the set of all possible\n\nCN is\n\nNK times\n\nN2 times\n\n(cid:125)\n\nN =1\n\nN =1\n\n(cid:32) (cid:89)\nN, a KP model implies that P (CN | N ) \u221d |CN|! \u03ba|CN|(cid:0)(cid:81)\n\n|CN|! \u03ba|CN|\n\nP (CN ) =\n\nc\u2208CN\n\nN !\n\nwhere | \u00b7 | denotes the cardinality of a set, |CN| is the number of clusters in CN , and |c| is the\nnumber of elements in cluster c. In practice, however, N is usually observed. Conditioned on\n\n|c|! \u00b5|c|(cid:1). Equation 3 leads to a\n\n,\n\n(3)\n\n\u201creseating algorithm\u201d\u2014much like the Chinese restaurant process (CRP)\u2014derived by sampling from\nP (CN | N, CN \\n), where CN \\n is the partition obtained by removing element n from CN :\n\n|c|! \u00b5|c|\n\n(cid:33)\n\nc\u2208CN\n\n\u2022 for n = 1, . . . , N, reassign element n to\n\n\u2013 an existing cluster c \u2208 CN \\n with probability \u221d (|c| + 1) \u00b5(|c|+1)\n\u00b5|c|\n\u2013 or a new cluster with probability \u221d (|CN \\n| + 1)\n\u03ba(|CN\\n|+1)\n\u03ba|CN\\n| \u00b51.\n\nWe can use this reseating algorithm to draw samples from P (CN | N ); however, unlike the CRP, it\ndoes not produce an exact sample if it is used to incrementally construct a partition from the empty\nset. In practice, this limitation does not lead to any negative consequences because standard posterior\ninference sampling methods do not rely on this property. When a KP model is used as the prior in a\npartition-based clustering model\u2014e.g., as an alternative to equation 1\u2014the resulting Gibbs sampling\nalgorithm for CN is similar to this reseating algorithm, but accompanied by likelihood terms. Unfor-\ntunately, this algorithm is slow for large data sets. In appendix C, we therefore propose a faster Gibbs\nsampling algorithm\u2014the chaperones algorithm\u2014that is particularly well suited to microclustering.\nIn sections 3.1 and 3.2, we introduce two related KP models for microclustering, and in section 3.4\nwe explain how KP models can be applied in the context of entity resolution with categorical data.\n\n3.1 The NBNB Model\n\nWe start with equation 3 and de\ufb01ne\n\n\u03ba = NegBin (a, q)\n\n(4)\nwhere NegBin(a, q) and NegBin(r, p) are negative binomial distributions truncated to N =\n{1, 2, . . .}. We assume that a > 0 and q \u2208 (0, 1) are \ufb01xed hyperparameters, while r and p are\ndistributed as r \u223c Gam(\u03b7r, sr) and p \u223c Beta(up, vp) for \ufb01xed \u03b7r, sr, up and vp.3 We refer to the\nresulting marginal distribution over CN as the negative binomial\u2013negative binomial (NBNB) model.\n\nand \u00b5 = NegBin (r, p) ,\n\n3We use the shape-and-rate parameterization of the gamma distribution.\n\n3\n\n\f\u03ba = NegBin (a, q)\n\nand \u00b5| \u03b1, \u00b5(0) \u223c Dir\n\nwith(cid:80)\u221e\n\nwhere \u03b1 > 0 is a \ufb01xed concentration parameter and \u00b5(0) = (\u00b5(0)\n\n1 , \u00b5(0)\n\nm=1 \u00b5(0)\n\nm = 1 and \u00b5(0)\n\nm \u2265 0 for all m. The probability of CN conditioned on N and \u00b5 is\n\nP (CN | N, a, q, \u00b5) \u221d \u0393 (|CN| + a) q|CN| (cid:89)\n\n|c|! \u00b5|c|.\n\n(7)\n\n\u03b1, \u00b5(0)(cid:17)\n\n(cid:16)\n(6)\n2 ,\u00b7\u00b7\u00b7 ) is a \ufb01xed base measure\n\n,\n\nFigure 1: The NBNB (left) and NBD (right) models appear to exhibit the microclustering property.\n\nBy substituting equation 4 into equation 3, we obtain the probability of CN conditioned N:\n\nP (CN | N, a, q, r, p) \u221d \u0393 (|CN| + a) \u03b2|CN| (cid:89)\n\n\u0393 (|c| + r)\n\n\u0393 (r)\n\nc\u2208CN\n\n,\n\n(5)\n\nwhere \u03b2 = q (1\u2212p)r\n1\u2212(1\u2212p)r . We provide the complete derivation of equation 5, along with the conditional\nposterior distributions over r and p, in appendix A.2. Posterior inference for the NBNB model involves\nalternating between (a) sampling CN from P (CN | N, a, q, r, p) using the chaperones algorithm and\n(b) sampling r and p from their respective conditional posteriors using, e.g., slice sampling [17].\n\n3.2 The NBD Model\n\nAlthough \u03ba = NegBin (a, q) will yield plausible values of K, \u00b5 = NegBin (r, p) may not be\nsuf\ufb01ciently \ufb02exible to capture realistic properties of N1, . . . , NK, especially when K is large. For\nexample, in a record-linkage application involving two otherwise noise-free databases containing\nthousands of records, K will be large and each Nk will be at most two. A negative binomial\ndistribution cannot capture this property. We therefore de\ufb01ne a second KP model\u2014the negative\nbinomial\u2013Dirichlet (NBD) model\u2014by taking a nonparametric approach to modeling N1, . . . , NK\nand drawing \u00b5 from an in\ufb01nite-dimensional Dirichlet distribution over the positive integers:\n\nc\u2208CN\n\n(cid:17)\n\n(cid:16)\n\nPosterior inference for the NBD model involves alternating between (a) sampling CN from\nP (CN | N, a, q, \u00b5) using the chaperones algorithm and (b) sampling \u00b5 from its conditional posterior:\n\n\u00b5| CN , \u03b1, \u00b5(0) \u223c Dir\n\n\u03b1 \u00b5(0)\n\n1 + L1, \u03b1 \u00b5(0)\n\n2 + L2, . . .\n\n,\n\n(8)\n\n(N + 1)-dimensional vector (\u00b51, . . . , \u00b5N , 1 \u2212(cid:80)N\n\nwhere Lm is the number of clusters of size m in CN . Although \u00b5 is an in\ufb01nite-dimensional\nvector, only the \ufb01rst N elements affect P (CN | a, q, \u00b5). Therefore, it is suf\ufb01cient to sample the\nm=1 \u00b5m) from equation 8, modi\ufb01ed accordingly,\nand retain only \u00b51, . . . , \u00b5N . We provide complete derivations of equations 7 and 8 in appendix A.3.\n\n3.3 The Microclustering Property for the NBNB and NBD Models\n\nFigure 1 contains empirical evidence suggesting that the NBNB and NBD models both exhibit the mi-\ncroclustering property. For each model, we generated samples of MN / N for N = 100, . . . , 104. For\nthe NBNB model, we set a = 1, q = 0.5, r = 1, and p = 0.5 and generated the samples using rejec-\ntion sampling. For the NBD model, we set a = 1, q = 0.5, and \u03b1 = 1 and set \u00b5(0) to be a geometric\ndistribution over N = {1, 2, . . .} with a parameter of 0.5. We generated the samples using MCMC\nmethods. For both models, MN / N appears to converge to zero in probability as N \u2192 \u221e, as desired.\nIn appendix B, we also prove that a variant of the NBNB model exhibits the microclustering property.\n\n4\n\n56789\u22126\u22124log(N)log(M N / N)56789\u22126\u22124\u22122log(N)log(M N / N)\f3.4 Application to Entity Resolution\n\nKP models can be used to perform entity resolution. In this context, the data points x1, . . . , xN are ob-\nserved records and the K clusters are latent entities. If each record consists of F categorical \ufb01elds, then\n\n(9)\n(10)\n(11)\n(12)\n\n\u03b8f k | \u03b4f , \u03b3f \u223c Dir(cid:0)\u03b4f , \u03b3f\n\nCN \u223c KP model\n\n(cid:1)\n\nzn \u223c \u03b6(CN , n)\nxf n | zn, \u03b8f 1, . . . , \u03b8f K \u223c Cat (\u03b8f zn )\n\nfor f = 1, . . . , F , k = 1, . . . , K, and n = 1, . . . , N, where \u03b6(CN , n) maps the nth record to a latent\ncluster assignment zn according to CN . We assume that \u03b4f > 0 is distributed as \u03b4f \u223c Gam (1, 1),\nwhile \u03b3f is \ufb01xed. Via Dirichlet\u2013multinomial conjugacy, we can marginalize over \u03b811, . . . , \u03b8F K to\nobtain a closed-form expression for P (x1, . . . , xN | z1, . . . , zN , \u03b4f , \u03b3f ). Posterior inference involves\nalternating between (a) sampling CN from P (CN | x1, . . . , xN , \u03b4f ) using the chaperones algorithm\naccompanied by appropriate likelihood terms, (b) sampling the parameters of the KP model from\ntheir conditional posteriors, and (c) sampling \u03b4f from its conditional posterior using slice sampling.\n\n4 Experiments\n\nwe set a and q to re\ufb02ect a weakly informative prior belief that E[K] = (cid:112)Var[K] = N\n\nIn this section, we compare two entity resolution models based on the NBNB model and the NBD\nmodel to two similar models based on the DP mixture model [10] and the PYP mixture model [11].\nAll four models use the likelihood in equations 10 and 12. For the NBNB model and the NBD model,\n2 . For the\nNBNB model, we set \u03b7r = sr = 1 and up = vp = 2.4 For the NBD model, we set \u03b1 = 1 and set \u00b5(0)\nto be a geometric distribution over N = {1, 2, . . .} with a parameter of 0.5. This base measure re\ufb02ects\na prior belief that E[Nk] = 2. Finally, to ensure a fair comparison between the two different classes\nof model, we set the DP and PYP concentration parameters to re\ufb02ect a prior belief that E[K] = N\n2 .\nWe assess how well each model \u201c\ufb01ts\u201d four data sets typical of those arising in real-world entity reso-\nlution applications. For each data set, we consider four statistics: (a) the number of singleton clusters,\n(b) the maximum cluster size, (c) the mean cluster size, and (d) the 90th percentile of cluster sizes.\nWe compare each statistic\u2019s true value to its posterior distribution according to each of the models.\nFor each model and data set combination, we also consider \ufb01ve entity-resolution summary statistics:\n(a) the posterior expected number of clusters, (b) the posterior standard error, (c) the false negative\nrate, (d) the false discovery rate, and (e) the posterior expected value of \u03b4f = \u03b4 for f = 1, . . . , F .\nThe false negative and false discovery rates are both invariant under permutations of 1, . . . , K [5, 18].\n\n4.1 Data Sets\n\nWe constructed four realistic data sets, each consisting of N records associated with K entities.\nItaly: We derived this data set from the Survey on Household Income and Wealth, conducted by the\nBank of Italy every two years. There are nine categorical \ufb01elds, including year of birth, employment\nstatus, and highest level of education attained. Ground truth is available via unique identi\ufb01ers based\nupon social security numbers; roughly 74% of the clusters are singletons. We used the 2008 and 2010\ndatabases from the Fruili region to create a record-linkage data set consisting of N = 789 records;\neach Nk is at most two. We discarded the records themselves, but preserved the number of \ufb01elds, the\nempirical distribution of categories for each \ufb01eld, the number of clusters, and the cluster sizes. We\nthen generated synthetic records using equations 10 and 12. We created three variants of this data set,\ncorresponding to \u03b4 = 0.02, 0.05, 0.1. For all three, we used the empirical distribution of categories for\n\ufb01eld f as \u03b3f . By generating synthetic records in this fashion, we preserve the pertinent characteristics\nof the original data, while making it easy to isolate the impacts of the different priors over partitions.\nNLTCS5000: We derived this data set from the National Long Term Care Survey (NLTCS)5\u2014a\nlongitudinal survey of older Americans, conducted roughly every six years. We used four of the\n\n4We used p \u223c Beta (2, 2) because a uniform prior implies an unrealistic prior belief that E[Nk] = \u221e.\n5http://www.nltcs.aas.duke.edu/\n\n5\n\n\favailable \ufb01elds: date of birth, sex, state of residence, and regional of\ufb01ce. We split date of birth into\nthree separate \ufb01elds: day, month, and year. Ground truth is available via social security numbers;\nroughly 68% of the clusters are singletons. We used the 1982, 1989, and 1994 databases and\ndown-sampled the records, preserving the proportion of clusters of each size and the maximum\ncluster size, to create a record-linkage data set of N = 5, 000 records; each Nk is at most three. We\nthen generated synthetic records using the same approach that we used to create the Italy data set.\nSyria2000 and SyriaSizes: We constructed these data sets from data collected by four human-rights\ngroups between 2011 and 2014 on people killed in the Syrian con\ufb02ict [19, 20]. Hand-matched\nground truth is available from the Human Rights Data Analysis Group. Because the records were\nhand matched, the data are noisy and potentially biased. Performing entity resolution is non-trivial\nbecause there are only three categorical \ufb01elds: gender, governorate, and date of death. We split date\nof death, which is present for most records, into three separate \ufb01elds: day, month, and year. However,\nbecause the records only span four years, the year \ufb01eld conveys little information. In addition, most\nrecords are male, and there are only fourteen governorates. We created the Syria2000 data set by\ndown-sampling the records, preserving the proportion of clusters of each size, to create a data set\nof N = 2, 000 records; the maximum cluster size is \ufb01ve. We created the SyriaSizes data set by\ndown-sampling the records, preserving some of the larger clusters (which necessarily contain within-\ndatabase duplications), to create a data set of N = 6, 700 records; the maximum cluster size is ten.\nWe provide the empirical distribution over cluster sizes for each data set in appendix D. We generated\nsynthetic records for both data sets using the same approach that we used to create the Italy data set.\n\n4.2 Results\n\nWe report the results of our experiments in table 1 and \ufb01gure 2. The NBNB and NBD models\noutperformed the DP and PYP models for almost all variants of the Italy and NLTCS5000 data sets.\nIn general, the NBD model performed the best of the four, and the differences between the models\u2019\nperformance grew as the value of \u03b4 increased. For the Syria2000 and SyriaSizes data sets, we see no\nconsistent pattern to the models\u2019 abilities to recover the true values of the data-set statistics. Moreover,\nall four models had poor false negative rates, and false discovery rates\u2014most likely because these\ndata sets are extremely noisy and contain very few \ufb01elds. We suspect that no entity resolution model\nwould perform well for these data sets. For three of the four data sets, the exception being the\nSyria2000 data set, the DP model and the PYP model both greatly overestimated the number of\nclusters for larger values of \u03b4. Taken together, these results suggest that the \ufb02exibility of the NBNB\nand NBD models make them more appropriate choices for most entity resolution applications.\n\n5 Summary\n\nIn\ufb01nitely exchangeable clustering models assume that cluster sizes grow linearly with the size of the\ndata set. Although this assumption is reasonable for some applications, it is inappropriate for others.\nFor example, when entity resolution is treated as a clustering problem, the number of data points in\neach cluster should remain small, even for large data sets. Applications like this require models that\nyield clusters whose sizes grow sublinearly with the size of the data set. We introduced the microclus-\ntering property as one way to characterize models that address this requirement. We then introduced a\nhighly \ufb02exible class of models\u2014KP models\u2014that can exhibit this property. We presented two models\nwithin this class\u2014the NBNB model and the NBD model\u2014and showed that they are better suited\nto entity resolution applications than two in\ufb01nitely exchangeable clustering models. We therefore\nrecommend KP models for applications where the size of each cluster should be unrelated to the size\nof the data set, and each cluster should contain a negligible fraction of the total number of data points.\n\nAcknowledgments\n\nWe thank Tamara Broderick, David Dunson, Merlise Clyde, and Abel Rodriguez for conversations\nthat helped form the ideas in this paper. In particular, Tamara Broderick played a key role in develop-\ning the idea of microclustering. We also thank the Human Rights Data Analysis Group for providing\nus with data. This work was supported in part by NSF grants SBE-0965436, DMS-1045153, and\nIIS-1320219; NIH grant 5R01ES017436-05; the John Templeton Foundation; the Foerster-Bernstein\nPostdoctoral Fellowship; the UMass Amherst CIIR; and an EPSRC Doctoral Prize Fellowship.\n\n6\n\n\f(a) Italy: NBD model > NBNB model > PYP mixture model > DP mixture model.\n\n(b) NLTCS5000: NBD model > NBNB model > PYP mixture model > DP mixture model.\n\n(c) Syria2000: the models perform similarly because there are so few \ufb01elds.\n\n(d) SyriaSizes: the models perform similarly because there are so few \ufb01elds.\n\nFigure 2: Box plots depicting the true value (dashed line) of each data-set statistic for each variant of\neach data set, as well as its posterior distribution according to each of the four entity resolution models.\n\n7\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)350400450500Singleton ClusterslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)2345678Maximum Cluster SizellllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)1.251.301.351.40Mean Cluster SizeDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)1.52.02.590th Percentile of Cluster SizeslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)160017001800Singleton ClusterslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)3456789Maximum Cluster SizellllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)1.581.601.621.641.661.681.70Mean Cluster SizeDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)2.02.53.03.54.090th Percentile of Cluster SizeslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)1000120014001600Singleton ClusterslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)246810Maximum Cluster SizellllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)1.101.151.201.251.301.35Mean Cluster SizelllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)1.01.21.41.61.82.090th Percentile of Cluster SizeslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)200025003000Singleton ClustersllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)681012141618Maximum Cluster SizellllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)1.41.51.61.71.8Mean Cluster SizelllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllDP(0.02)PYP(0.02)NBNB(0.02)NDB(0.02)DP(0.05)PYP(0.05)NBNB(0.05)NDB(0.05)DP(0.1)PYP(0.1)NBNB(0.1)NDB(0.1)2.02.22.42.62.83.090th Percentile of Cluster Sizes\fTable 1: Entity-resolution summary statistics\u2014the posterior expected number of clusters, the posterior\nstandard error, the false negative rate (lower is better), the false discovery rate (lower is better), and\nthe posterior expected value of \u03b4\u2014for each variant of each data set and each of the four models.\n\nData Set\n\nItaly\n\nNLTCS5000\n\n587\n\n\u03b4 = 0.05\n\n\u03b4 = 0.02\n\nTrue K Variant Model\nDP\nPYP\nNBNB\nNBD\nDP\nPYP\nNBNB\nNBD\nDP\nPYP\nNBNB\nNBD\nDP\nPYP\n\n\u03b4 = 0.02\n\n\u03b4 = 0.1\n\n3,061\n\n\u03b4 = 0.05\n\n\u03b4 = 0.1\n\nSyria2000\n\n1,725\n\n\u03b4 = 0.02\n\n\u03b4 = 0.05\n\n\u03b4 = 0.1\n\nSyriaSizes\n\n4,075\n\n\u03b4 = 0.02\n\n\u03b4 = 0.05\n\n\u03b4 = 0.1\n\nDP\nPYP\n\nDP\nPYP\n\nE[K]\n594.00\n593.90\n591.00\n590.50\n601.60\n601.50\n596.40\n592.60\n617.40\n617.40\n610.90\n596.60\n3021.70\n3018.70\nNBNB 3037.80\nNBD 3028.20\n3024.00\n3045.80\nNBNB 3040.90\nNBD 3039.30\n3130.50\n3115.10\nNBNB 3067.30\nNBD 3049.10\n1695.20\n1719.70\nNBNB 1726.80\nNBD 1715.20\n1701.80\n1742.90\nNBNB 1738.30\nNBD 1711.40\n1678.10\n1761.20\nNBNB 1779.40\nNBD 1757.30\n4175.70\n4234.30\nNBNB 4108.70\nNBD 3979.50\n4260.00\n4139.10\nNBNB 4047.10\nNBD 3863.90\n4507.40\n4540.30\nNBNB 4400.60\nNBD 4251.90\n\nDP\nPYP\n\nDP\nPYP\n\nDP\nPYP\n\nDP\nPYP\n\nDP\nPYP\n\nDP\nPYP\n\nStd. Err.\n\n4.51\n4.52\n4.43\n3.64\n5.89\n5.90\n5.79\n5.20\n7.23\n7.22\n7.81\n9.37\n24.96\n25.69\n25.18\n5.65\n26.15\n23.66\n24.86\n10.17\n21.44\n25.73\n25.31\n16.48\n25.40\n36.10\n27.96\n51.56\n31.15\n24.33\n25.48\n47.10\n40.56\n39.38\n29.84\n73.60\n66.04\n68.55\n70.56\n70.85\n77.18\n104.22\n55.18\n68.05\n82.27\n100.53\n111.91\n203.23\n\nFNR FDR E[\u03b4]\n0.02\n0.07\n0.02\n0.07\n0.04\n0.02\n0.02\n0.03\n0.03\n0.13\n0.04\n0.13\n0.11\n0.04\n0.04\n0.09\n0.07\n0.27\n0.07\n0.27\n0.08\n0.24\n0.18\n0.10\n0.03\n0.02\n0.03\n0.03\n0.02\n0.02\n0.01\n0.03\n0.06\n0.05\n0.05\n0.05\n0.05\n0.04\n0.03\n0.06\n0.10\n0.12\n0.10\n0.13\n0.11\n0.11\n0.12\n0.09\n0.70\n0.07\n0.04\n0.71\n0.05\n0.70\n0.02\n0.67\n0.77\n0.07\n0.04\n0.75\n0.04\n0.74\n0.03\n0.69\n0.18\n0.81\n0.81\n0.08\n0.04\n0.77\n0.03\n0.74\n0.01\n0.65\n0.64\n0.01\n0.01\n0.65\n0.03\n0.68\n0.02\n0.71\n0.04\n0.75\n0.73\n0.04\n0.07\n0.75\n0.03\n0.80\n0.03\n0.80\n0.80\n0.03\n0.04\n0.82\n\n0.03\n0.03\n0.03\n0.00\n0.03\n0.03\n0.04\n0.04\n0.06\n0.05\n0.06\n0.05\n0.11\n0.11\n0.07\n0.09\n0.13\n0.10\n0.06\n0.07\n0.09\n0.10\n0.08\n0.08\n0.27\n0.26\n0.28\n0.28\n0.31\n0.32\n0.31\n0.32\n0.19\n0.22\n0.26\n0.25\n0.17\n0.19\n0.19\n0.20\n0.21\n0.18\n0.20\n0.22\n0.19\n0.20\n0.23\n0.25\n\n8\n\n\fReferences\n[1] P. Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution,\n\nand Duplicate Detection. Springer, 2012.\n\n[2] P. Christen. A survey of indexing techniques for scalable record linkage and deduplication.\n\nIEEE Transactions on Knowledge and Data Engineering, 24(9), 2012.\n\n[3] W. E. Winkler. Overview of record linkage and current research directions. Technical report,\n\nU.S. Bureau of the Census Statistical Research Division, 2006.\n\n[4] R. C. Steorts, R. Hall, and S. E. Fienberg. A Bayesian approach to graphical record linkage and\n\nde-duplication. Journal of the American Statistical Society, In press.\n\n[5] R. C. Steorts. Entity resolution with empirically motivated priors. Bayesian Analysis, 10(4):849\u2013\n\n875, 2015.\n\n[6] R. C. Steorts, R. Hall, and S. E. Fienberg. SMERED: A Bayesian approach to graphical record\n\nlinkage and de-duplication. Journal of Machine Learning Research, 33:922\u2013930, 2014.\n\n[7] T. Broderick and R. C. Steorts. Variational bayes for merging noisy databases. In NIPS 2014\n\nWorkshop on Advances in Variational Inference, 2014. arXiv:1410.4792.\n\n[8] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number of\n\ncomponents. Journal of the Royal Statistical Society Series B, pages 731\u2013792, 1997.\n\n[9] J. W. Miller and M. T. Harrison. Mixture models with a prior on the number of components.\n\narXiv:1502.06241, 2015.\n\n[10] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n\n[11] H. Ishwaran and L. F. James. Generalized weighted Chinese restaurant processes for species\n\nsampling mixture models. Statistica Sinica, 13(4):1211\u20131236, 2003.\n\n[12] J. F. .C Kingman. The representation of partition structures. Journal of the London Mathematical\n\nSociety, 2(2):374\u2013380, 1978.\n\n[13] D. Aldous. Exchangeability and related topics. \u00c9cole d\u2019\u00c9t\u00e9 de Probabilit\u00e9s de Saint-Flour\n\nXIII\u20141983, pages 1\u2013198, 1985.\n\n[14] H. M. Wallach, S. Jensen, L. Dicker, and K. A. Heller. An alternative prior process for\nnonparametric Bayesian clustering. In Proceedings of the 13th International Conference on\nArti\ufb01cial Intelligence and Statistics, 2010.\n\n[15] V. F. Kolchin. A problem of the allocation of particles in cells and cycles of random permutations.\n\nTheory of Probability & Its Applications, 16(1):74\u201390, 1971.\n\n[16] J. Pitman. Combinatorial stochastic processes. \u00c9cole d\u2019\u00c9t\u00e9 de Probabilit\u00e9s de Saint-Flour\n\nXXXII\u20142002, 2006.\n\n[17] R. M. Neal. Slice sampling. Annals of Statistics, 31:705\u2013767, 2003.\n\n[18] R. C. Steorts, S. L. Ventura, M. Sadinle, and S. E. Fienberg. A comparison of blocking methods\nfor record linkage. In International Conference on Privacy in Statistical Databases, pages\n253\u2013268, 2014.\n\n[19] M. Price, J. Klingner, A. Qtiesh, and P. Ball. Updated statistical analysis of documentation of\nkillings in the Syrian Arab Republic, 2013. United Nations Of\ufb01ce of the UN High Commissioner\nfor Human Rights.\n\n[20] M. Price, J. Klingner, A. Qtiesh, and P. Ball. Updated statistical analysis of documentation of\n\nkillings in the Syrian Arab Republic. Human Rights Data Analysis Group, Geneva, 2014.\n\n9\n\n\f", "award": [], "sourceid": 811, "authors": [{"given_name": "Brenda", "family_name": "Betancourt", "institution": "Duke University"}, {"given_name": "Giacomo", "family_name": "Zanella", "institution": "The University of Warick"}, {"given_name": "Jeffrey", "family_name": "Miller", "institution": "Duke University"}, {"given_name": "Hanna", "family_name": "Wallach", "institution": "Microsoft Research"}, {"given_name": "Abbas", "family_name": "Zaidi", "institution": "Duke University"}, {"given_name": "Rebecca", "family_name": "Steorts", "institution": "Duke University"}]}