{"title": "Repulsive Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 1889, "page_last": 1897, "abstract": "Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning.  Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool.  One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant.   Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings.  Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components.  To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components.  We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation.  The methods are illustrated using synthetic examples and an iris data set.", "full_text": "Repulsive Mixtures\n\nDepartment of Statistical Science\n\nGatsby Computational Neuroscience Unit\n\nFrancesca Petralia\n\nDuke University\n\nfp12@duke.edu\n\nVinayak Rao\n\nUniversity College London\n\nvrao@gatsby.ucl.ac.uk\n\nDavid B. Dunson\n\nDepartment of Statistical Science\n\nDuke University\n\ndunson@stat.duke.edu\n\nAbstract\n\nDiscrete mixtures are used routinely in broad sweeping applications ranging from\nunsupervised settings to fully supervised multi-task learning. Indeed, \ufb01nite mix-\ntures and in\ufb01nite mixtures, relying on Dirichlet processes and modi\ufb01cations, have\nbecome a standard tool. One important issue that arises in using discrete mix-\ntures is low separation in the components; in particular, different components can\nbe introduced that are very similar and hence redundant. Such redundancy leads\nto too many clusters that are too similar, degrading performance in unsupervised\nlearning and leading to computational problems and an unnecessarily complex\nmodel in supervised settings. Redundancy can arise in the absence of a penalty on\ncomponents placed close together even when a Bayesian approach is used to learn\nthe number of components. To solve this problem, we propose a novel prior that\ngenerates components from a repulsive process, automatically penalizing redun-\ndant components. We characterize this repulsive prior theoretically and propose\na Markov chain Monte Carlo sampling algorithm for posterior computation. The\nmethods are illustrated using synthetic examples and an iris data set.\n\nKey Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model-based\nclustering; Repulsive point process; Well separated mixture.\n\nIntroduction\n\n1\nDiscrete mixture models characterize the density of y \u2208 Y \u2282 (cid:60)m as\n\nk(cid:88)\n\nf (y) =\n\nph\u03c6(y; \u03b3h)\n\n(1)\n\nh=1\n\nwhere p = (p1, . . . , pk)T is a vector of probabilities summing to one, and \u03c6(\u00b7; \u03b3) is a kernel de-\npending on parameters \u03b3 \u2208 \u0393, which may consist of location and scale parameters. In analyses of\n\ufb01nite mixture models, a common concern is over-\ufb01tting in which redundant mixture components\nlocated close together are introduced. Over-\ufb01tting can have an adverse impact on predictions and\ndegrade unsupervised learning. In particular, introducing components located close together can\nlead to splitting of well separated clusters into a larger number of closely overlapping clusters. Ide-\nally, the criteria for selecting k in a frequentist analysis and the prior on k and {\u03b3h} in a Bayesian\nanalysis should guard against such over-\ufb01tting. However, the impact of the criteria used and prior\nchosen can be subtle.\n\n1\n\n\fRecently, [1] studied the asymptotic behavior of the posterior distribution in over-\ufb01tted Bayesian\nmixture models having more components than needed. They showed that a carefully chosen prior\nwill lead to asymptotic emptying of the redundant components. However, several challenging prac-\ntical issues arise. For their prior and in standard Bayesian practice, one assumes that \u03b3h \u223c P0\nindependently a priori. For example, if we consider a \ufb01nite location-scale mixture of multivariate\nGaussians, one may choose P0 to be multivariate Gaussian-inverse Wishart. However, the behavior\nof the posterior can be sensitive to P0 for \ufb01nite samples, with higher variance P0 favoring allocation\nto fewer clusters. In addition, drawing the component-speci\ufb01c parameters from a common prior\ntends to favor components located close together unless the variance is high.\nSensitivity to P0 is just one of the issues. For \ufb01nite samples, the weight assigned to redundant\ncomponents is often substantial. This can be attributed to non- or weak identi\ufb01ability. Each mixture\ncomponent can potentially be split into multiple components having the same parameters. Even\nif exact equivalence is ruled out, it can be dif\ufb01cult to distinguish between models having different\ndegrees of splitting of well-separated components into components located close together. This\nissue can lead to an unnecessarily complex model, and creates dif\ufb01culties in estimating the number\nof components and component-speci\ufb01c parameters. Existing strategies, such as the incorporation\nof order constraints, do not adequately address this issue, since it is dif\ufb01cult to choose reasonable\nconstraints in multivariate problems and even with constraints, the components can be close together.\nThe problem of separating components has been studied for Gaussian mixture models ([2]; [3]).\nTwo Gaussians can be separated by placing an arbitrarily chosen lower bound on the distance be-\ntween their means. Separated Gaussians have been mainly utilized to speed up convergence of the\nExpectation-Maximization (EM) algorithm. In choosing a minimal separation level, it is not clear\nhow to obtain a good compromise between values that are too low to solve the problem and ones\nthat are so large that one obtains a poor \ufb01t. To avoid such arbitrary hard separation thresholds, we\ninstead propose a repulsive prior that smoothly pushes components apart.\nIn contrast to the vast majority of the recent Bayesian literature on discrete mixture models, instead\nof drawing the component-speci\ufb01c parameters {\u03b3h} independently from a common prior P0, we\npropose a joint prior for {\u03b31, . . . , \u03b3k} that is chosen to assign low density to \u03b3hs located close\ntogether. The deviation from independence is speci\ufb01ed a priori by a pair of repulsion parameters.\nThe proposed class of repulsive mixture models will only place components close together if it\nresults in a substantial gain in model \ufb01t. As we illustrate, the prior will favor a more parsimonious\nrepresentation of densities, while improving practical performance in unsupervised learning. We\nprovide strong theoretical results on rates of posterior convergence and develop Markov chain Monte\nCarlo algorithms for posterior computation.\n\n2 Bayesian repulsive mixture models\n\n2.1 Background on Bayesian mixture modeling\n\nConsidering the \ufb01nite mixture model in expression (1), a Bayesian speci\ufb01cation is completed by\nchoosing priors for the number of components k, the probability weights p, and the component-\nspeci\ufb01c parameters \u03b3 = (\u03b31, . . . , \u03b3k)T . Typically, k is assigned a Poisson or multinomial prior, p a\nDirichlet(\u03b1) prior with \u03b1 = (\u03b11, . . . , \u03b1k)T , and \u03b3h \u223c P0 independently, with P0 often chosen to\nbe conjugate to the kernel \u03c6. Posterior computation can proceed via a reversible jump Markov chain\nMonte Carlo algorithm involving moves for adding or deleting mixture components. Unfortunately,\nin making a k \u2192 k + 1 change in model dimension, ef\ufb01cient moves critically depend on the choice\nof proposal density. [4] proposed an alternate Markov chain Monte Carlo method, which treats the\nparameters as a marked point process, but does not have clear computational advantages relative to\nreversible jump.\nIt has become popular to use over-\ufb01tted mixture models in which k is chosen as a conservative\nupper bound on the number of components under the expectation that only relatively few of the\ncomponents will be occupied by subjects in the sample. From a practical perspective, the success of\nover-\ufb01tted mixture models has been largely due to ease in computation.\nAs motivated in [5], simply letting \u03b1h = c/k for h = 1, . . . , k and a constant c > 0 leads to an\napproximation to a Dirichlet process mixture model for the density of y, which is obtained in the\n\n2\n\n\flimit as k approaches in\ufb01nity. An alternative \ufb01nite approximation to a Dirichlet process mixture is\nobtained by truncating the stick-breaking representation of [6], leading to a similarly simple Gibbs\nsampling algorithm [7]. These approaches are now used routinely in practice.\n\n2.2 Repulsive densities\n\nWe seek a prior on the component parameters in (1) that automatically favors spread out compo-\nnents near the support of the data. Instead of generating the atoms \u03b3h independently from P0, one\ncould generate them from a repulsive process that automatically pushes the atoms apart. This idea\nis conceptually related to the literature on repulsive point processes [8]. In the spatial statistics liter-\nature, a variety of repulsive processes have been proposed. One such model assumes that points are\nclustered spatially, with the cluster centers having a Strauss density [9], that is p(k, \u03b3) \u221d \u03b2k\u03c1r(\u03b3)\nwhere k is the number of clusters, \u03b2 > 0, 0 < \u03c1 \u2264 1 and r(\u03b3) is the number of pairwise centers that\nlie within a pre-speci\ufb01ed distance r of each other. A possibly unappealing feature is that repulsion\nis not directly dependent on the pairwise distances between the clusters. We propose an alternative\nclass of priors, which smoothly push apart components based on pairwise distances.\nDe\ufb01nition 1. A density h(\u03b3) is repulsive if for any \u03b4 > 0 there is a corresponding \u0001 > 0 such that\nh(\u03b3) < \u03b4 for all \u03b3 \u2208 \u0393 \\ G\u0001, where G\u0001 = {\u03b3 : d(\u03b3s, \u03b3i) > \u0001; s = 1, . . . , k; i < s} and d is a metric.\n\nDepending on the speci\ufb01cation of the metric d(\u03b3s, \u03b3j), a prior satisfying de\ufb01nition 1 may limit over-\n\ufb01tting or favor well separated clusters. When d(\u03b3s, \u03b3j) is the distance between sub-vectors of \u03b3s and\n\u03b3j corresponding to only locations the proposed prior favors well separated clusters. Instead, when\nd(\u03b3s, \u03b3j) is the distance between the sth and jth kernel, a prior satisfying de\ufb01nition 1 limits over-\n\ufb01tting in density estimation. Though both cases can be implemented, in this paper we will focus\nexclusively on the clustering problem. As a convenient class of repulsive priors which smoothly\npush components apart, we propose\n\n(cid:33)\n\n(cid:32) k(cid:89)\n\nh=1\n\n\u03c0(\u03b3) = c1\n\ng0(\u03b3h)\n\nh(\u03b3),\n\n(2)\n\nwith c1 being the normalizing constant that depends on the number of components k. The proposed\nprior is related to a class of point processes from the statistical physics and spatial statistics\nliterature referred to as Gibbs processes [10]. We assume g0 : \u0393 \u2192 (cid:60)+ and h : \u0393k \u2192 [0,\u221e) are\ncontinuous with respect to Lesbesgue measure, and h is bounded above by a positive constant c2\nand is repulsive according to de\ufb01nition 1. It follows that density \u03c0 de\ufb01ned in (2) is also repulsive.\nA special hardcore repulsion is produced if the repulsion function is zero when at least one pairwise\ndistance is smaller than a pre-speci\ufb01ed threshold. Such a density implies choosing a minimal\nseparation level between the atoms. As mentioned in the introduction, we avoid such arbitrary\nhard separation thresholds by considering repulsive priors that smoothly push components apart. In\nparticular, we propose two repulsion functions de\ufb01ned as\n\n(cid:89)\n\ng{d(\u03b3s, \u03b3j)}\n\n(3)\n\nh(\u03b3) =\n\n{(s,j)\u2208A}\n\n(4)\nwith A = {(s, j) : s = 1, . . . , k; j < s} and g : (cid:60)+ \u2192 [0, M ] a strictly monotone differentiable\nfunction with g(0) = 0, g(x) > 0 for all x > 0 and M < \u221e. It is straightforward to show that h\nin (3) and (4) is integrable and satis\ufb01es de\ufb01nition 1. The two alternative repulsion functions differ\nin their dependence on the relative distances between components, with all the pairwise distances\nplaying a role in (3), while (4) only depends on the minimal separation. A \ufb02exible choice of g\ncorresponds to\n\nh(\u03b3) = min\n\n{(s,j)\u2208A} g{d(\u03b3s, \u03b3j)}\n\ng{d(\u03b3s, \u03b3j)} = exp(cid:2) \u2212 \u03c4{d(\u03b3s, \u03b3j)}\u2212\u03bd(cid:3),\n\n(5)\n\nwhere \u03c4 > 0 is a scale parameter and \u03bd is a positive integer controlling the rate at which g approaches\nzero as d(\u03b3s, \u03b3j) decreases. Figure 1 shows contour plots of the prior \u03c0(\u03b31, \u03b32) de\ufb01ned as (2) with\ng0 being the standard normal density, the repulsive function de\ufb01ned as (3) or (4) and g de\ufb01ned as\n(5) for different values of (\u03c4, \u03bd). As \u03c4 and \u03bd increase, the prior increasingly favors well separated\ncomponents.\n\n3\n\n\fFigure 1: Contour plots of the repulsive prior \u03c0(\u03b31, \u03b32) under (3), either (4) or (5) and (6) with\nhyperparameters (\u03c4, \u03bd) equal to (I)(1, 2), (II)(1, 4), (III)(5, 2) and (IV )(5, 4)\n\n2.3 Theoretical properties\n\nLet the true density f0 : (cid:60)m \u2192 (cid:60)+ be de\ufb01ned as f0 =(cid:80)k0\ndistance. Let f =(cid:80)k\nkernels. Let | \u00b7 |1 denote the L1 norm and KL(f0, f ) = (cid:82) f0 log(f0/f ) refer to the Kullback-\n\nh=1 p0h\u03c6(\u03b30h) with \u03b30h \u2208 \u0393 and \u03b30js\nsuch that there exists an \u00011 > 0 such that min{(s,j):s<j} d(\u03b30s, \u03b30j) \u2265 \u00011 with d being the Euclidean\nh=1 ph\u03c6(\u03b3h) with \u03b3h \u2208 \u0393. Let \u03b3 \u223c \u03c0 with \u03b3 = (\u03b31, . . . , \u03b3k)T and \u03c0 satisfying\nde\ufb01nition 1. Let p \u223c \u03bb with \u03bb = Dirichlet(\u03b1) and k \u223c \u00b5 with \u00b5(k = k0) > 0. Let \u03b8 = (p, \u03b3).\nThese assumptions on f0 and f will be referred to as condition B0. Let \u03a0 be the prior induced on\n\u222a\u221e\nj=1Fk, where Fk is the space of all distributions de\ufb01ned as (1).\nWe will focus on \u03b3 being a location parameter, though the results can be extended to location-scale\n\nLeibler (K-L) divergence between f0 and f. Density f0 belongs to the K-L support of the prior \u03a0\nif \u03a0{f : KL(f0, f ) < \u0001} > 0 for all \u0001 > 0. The next lemma provides suf\ufb01cient conditions under\nwhich the true density is in the K-L support of the prior.\nLemma 1. Assume condition B0 is satis\ufb01ed with m = 1. Let D0 be a compact set containing\nparameters (\u03b301, . . . , \u03b30k0 ). Suppose \u03b3 \u223c \u03c0 with \u03c0 satisfying de\ufb01nition 1. Let \u03c6 and \u03c0 satisfy the\nfollowing conditions:\nA1. for any y \u2208 Y, the map \u03b3 \u2192 \u03c6(y; \u03b3) is uniformly continuous\nA2. for any y \u2208 Y, \u03c6(y; \u03b3) is bounded above by a constant\n\n(cid:12)(cid:12)log(cid:8)sup\u03b3\u2208D0 \u03c6(\u03b3)(cid:9) \u2212 log {inf \u03b3\u2208D0 \u03c6(\u03b3)}(cid:12)(cid:12) < \u221e\n\nA3.(cid:82) f0\n\nA4. \u03c0 is continuous with respect to Lebesgue measure and for any vector x \u2208 \u0393k with\nmin{(s,j):s<j} d(xs, xj) \u2265 \u03c5 for some \u03c5 > 0 there is a \u03b4 > 0 such that \u03c0(\u03b3) > 0 for all \u03b3\nsatisfying ||\u03b3 \u2212 x||1 < \u03b4\nThen f0 is in the K-L support of the prior \u03a0.\n\nLemma 2. The repulsive density in (2) with h de\ufb01ned as either (3) or (4) satis\ufb01es condition A4 in\nlemma 1.\n\nThe next lemma formalizes the posterior rate of concentration for univariate location mixtures of\nGaussians.\nLemma 3. Let condition B0 be satis\ufb01ed, let m = 1 and \u03c6 be the normal kernel depending on a\nlocation parameter \u03b3 and a scale parameter \u03c3. Assume that condition (i), (ii) and (iii) of theorem\n3.1 in [11] and assumption A4 in lemma 1 are satis\ufb01ed. Furthermore, assume that\nC1) the joint density \u03c0 leads to exchangeable random variables and for all k the marginal density\n\nof the location parameter \u03b31 satis\ufb01es \u03c0m(|\u03b31| \u2265 t) (cid:46) exp(cid:0)\u2212q1t2(cid:1) for a given q1 > 0\n\n4\n\n(I)\u2212505\u2212505(II)\u2212505\u2212505(III)\u2212505\u2212505(IV)\u2212505\u2212505\fC2) there are constants u1, u2, u3 > 0, possibly depending on f0, such that for any \u0001 \u2264 u3\n\n\u03c0(||\u03b3 \u2212 \u03b30||1 \u2264 \u0001) \u2265 u1 exp(\u2212u2k0 log(1/\u0001))\n\nThen the posterior rate of convergence relative to the L1 metric is \u0001n = n\u22121/2 log n.\nLemma 3 is essentially a modi\ufb01cation of theorem 3.1 in [11] to the proposed repulsive mixture\nmodel. Lemma 4 gives suf\ufb01cient conditions for \u03c0 to satisfy condition C1 and C2 in lemma 3.\nLemma 4. Let \u03c0 be de\ufb01ned as (2) and h be de\ufb01ned as either (3) or (4), then \u03c0 satis\ufb01es condition\nC2 in lemma 3. Furthermore, if for a positive constant n1 the function g0 satis\ufb01es g0(|x| \u2265 t) (cid:46)\nexp(\u2212n1t2), \u03c0 satis\ufb01es condition C1 in lemma 3.\nAs motivated above, when the number of mixture components is chosen to be unnecessarily large, it\nis appealing for the posterior distribution of the weights of the extra components to be concentrated\nnear zero. Theorem 1 formalizes the rate of concentration with increasing sample size n. One\nof the main assumptions required in theorem 1 is that the posterior rate of convergence relative to\nthe L1 metric is \u03b4n = n\u22121/2(log n)q with q \u2265 0. We provided the contraction rate, under the\nproposed prior speci\ufb01cation and univariate Gaussian kernel, in lemma 3. However, theorem 1 is a\nmore general statement and it applies to multivariate mixture density of any kernel.\nTheorem 1. Let assumptions B0 \u2212 B5 be satis\ufb01ed. Let \u03c0 be de\ufb01ned as (2) and h be de\ufb01ned as\neither (3) or (4). If \u00af\u03b1 = max(\u03b11, . . . , \u03b1k) < m/2 and for positive constants r1, r2, r3 the function\ng satis\ufb01es g(x) \u2264 r1xr2 for 0 \u2264 x < r3 then\n\n(cid:34)\n\n(cid:40)\n\n(cid:32) k(cid:88)\n\n(cid:33)\n\n(cid:41)(cid:35)\n\nM\u2192\u221e lim sup\nlim\nn\u2192\u221e\n\nE0\nn\n\nP\n\nmin\n{\u03c3\u2208Sk}\n\np\u03c3(i)\n\ni=k0+1\n\n> M n\u22121/2(log n)q(1+s(k0,\u03b1)/sr2 )\n\n= 0\n\nwith s(k0, \u03b1) = k0 \u2212 1 + mk0 + \u00af\u03b1(k \u2212 k0), sr2 = r2 + m/2 \u2212 \u00af\u03b1 and Sk the set of all possible\npermutations of {1, . . . , k}.\nAssumptions (B1 \u2212 B5) can be found in the supplementary material. Theorem 1 is a modi\ufb01cation\nof theorem 1 in [1] to the proposed repulsive mixture model. Theorem 1 implies that the posterior\n\nexpectation of weights of the extra components is of order O(cid:8)n\u22121/2(log n)q(1+s(k0,\u03b1)/sr2 )(cid:9). When\n\ng is de\ufb01ned as (5), parameters r1 and r2 can be chosen such that r1 = \u03c4 and r2 = \u03bd.\nWhen the number of components is unknown, with only an upper bound known, the posterior rate\nof convergence is equivalent to the parametric rate n\u22121/2 [12]. In this case, the rate in theorem 1\nis n\u22121/2 under usual priors or the repulsive prior. However, in our experience using usual priors,\nthe sum of the extra components can be substantial in small to moderate sample sizes, and often\nhas high variability. As we show in Section 3, for repulsive priors the sum of the extra component\nweights is close to zero and has small variance for small as well as large sample sizes. On the\nother hand, when an upper bound on the number of components is unknown, the posterior rate of\nconcentration is n\u22121/2(log n)q with q > 0. In this case, according to theorem 1, using the proposed\nprior speci\ufb01cation the logarithmic factor in theorem 1 of [1] can be improved.\n\n2.4 Parameter calibration and posterior computation\n\nThe parameters involved in the repulsion function h are chosen such that a priori, with high proba-\nbility, the clusters will be adequately separated. Consider the case where \u03c6 is a location-scale kernel\nwith location and scale parameters (\u03b3, \u03a3) and is symmetric about \u03b3. Here, it is natural to relate\nthe separation of two densities to the distance between their location parameters. The following\nde\ufb01nition introduces the concept of separation level between two densities.\nDe\ufb01nition 2. Let f1 and f2 be two densities having location-scale parameters (\u03b31, \u03a31) and (\u03b32, \u03a32)\nrespectively, with \u03b31, \u03b32 \u2208 \u0393 and \u03a31, \u03a32 \u2208 \u2126. Given a metric t(\u00b7,\u00b7), a positive constant c and a\nfunction \u03c9 : \u2126 \u00d7 \u2126 \u2192 (cid:60)+, f1 and f2 are c-separated if\n\nt(\u03b31, \u03b32) \u2265 c\u03c9(\u03a31, \u03a32)1/2\n\nDe\ufb01nition 2 is in the spirit of [2] but generalized to any symmetric location-scale kernel. A mixture\nof k densities is c-separated if all pairs of densities are c-separated. The parameters of the repulsion\n\n5\n\n\fFigure 2: (I) Student\u2019s t density, (II) two-components mixture of poorly (solid) and well separated\n(dot-dash) Gaussian densities, referred as (IIa, IIb), (III) mixture of poorly (dot-dash) and well\nseparated (solid) Gaussian and Pearson densities, referred as (IIIa, IIIb), (IV ) two-components\nmixture of two-dimensional non-spherical Gaussians\n\nfunction, (\u03c4, \u03bd), will be chosen such that, for an a priori chosen separation level c, de\ufb01nition 2\nis satis\ufb01ed with high probability. In practice, for a given pair (\u03c4, \u03bd), we estimate the probability\nof pairwise c-separation empirically by simulating N replicates of (\u03b3h, \u03a3h) for each component\nh = 1, . . . , k from the prior. The appropriate values (\u03c4, \u03bd) are obtained by starting with small values,\nand increasing until the pre-speci\ufb01ed pairwise c-separated probability is reached. In practice, only \u03c4\nwill be calibrated to reach a particular probability value. This is because \u03bd controls the rate at which\nthe density tends to zero as two components approach but not the separation level across them. In\npractice we have found that \u03bd = 2 provides a good default value and we \ufb01x \u03bd at this value in all our\napplications below.\nA possible issue with the proposed repulsive mixture prior is that the full conditionals are nonstan-\ndard, complicating posterior computation. To address this, we propose a data augmentation scheme,\nintroducing auxiliary slice variables to facilitate sampling [13]. This algorithm is straightforward\nto implement and is ef\ufb01cient by MCMC standards. Further details can be found in the supplemen-\ntary material. It will be interesting in future work to develop fast approximations to MCMC for\nimplementation of repulsive mixture models, such as variational methods for approximating the full\nposterior and optimization methods for obtaining a maximum a posteriori estimate. The latter ap-\nproach would provide an alternative to usual maximum likelihood estimation via the EM algorithm,\nwhich provides a penalty on components located close together.\n\n3 Synthetic examples\n\nSynthetic toy examples were considered to assess the performance of the repulsive prior in density\nestimation, classi\ufb01cation and emptying the extra components. Figure 2 plots the true densities in the\nvarious synthetic cases that we considered. For each synthetic dataset, repulsive and non-repulsive\nmixture models were compared considering a \ufb01xed upper bound on the number of components; extra\ncomponents should be assigned small probabilities and hence effectively excluded. The auxiliary\nvariable sampler was run for 10, 000 iterations with a burn-in of 5, 000. The chain was thinned by\nkeeping every 10th simulated draw. To overcome the label switching problem, the samples were\npost-processed following the algorithm of [14]. Details on parameters involved in the true densities\nand choice of prior distributions can be found in the supplementary material.\nTable 1 shows summary statistics of the K-L divergence, the misclassi\ufb01cation error and the sum of\nextra weights under repulsive and non-repulsive mixtures with six mixture components as the upper\nbound. Table 1 shows also the misclassi\ufb01cation error resulting from hierarchical clustering [15]. In\npractice, observations drawn from the same mixture component were considered as belonging to the\nsame category and for each dataset a similarity matrix was constructed. The misclassi\ufb01cation error\nwas established in terms of divergence between the true similarity matrix and the posterior similar-\n\n6\n\n\u221210\u22125051000.10.20.30.4(I)\u221220200.20.40.6(II)\u22123\u22122\u22121012300.20.40.60.81(III)(IV)\u22122\u221210123\u22122\u221210123\fity matrix. As shown in table 1, the K-L divergences under repulsive and non-repulsive mixtures\nbecome more similar as the sample size increases. For smaller sample sizes, the results are more\nsimilar when components are very well separated. Since a repulsive prior tends to discourage over-\nlapping mixture components, a repulsive model might not estimate the density quite as accurately\nwhen a mixture of closely overlapping components is needed. However, as the sample size increases,\nthe \ufb01tted density approaches the true density regardless of the degree of closeness among clusters.\nAgain, though repulsive and non-repulsive mixtures perform similarly in estimating the true density,\nrepulsive mixtures place considerably less probability on extra components leading to more inter-\npretable clusters. In terms of misclassi\ufb01cation error, the repulsive model outperforms the other two\napproaches while, in most cases, the worst performance was obtained by the non-repulsive model.\nPotentially, one may favor fewer clusters, and hence possibly better separated clusters, by penalizing\nthe introduction of new clusters more through modifying the precision in the Dirichlet prior for the\nweights; in the supplemental materials, we demonstrate that this cannot solve the problem.\n\nTable 1: Mean and standard deviation of K-L divergence, misclassi\ufb01cation error and sum of extra\nweights resulting from non-repulsive (N-R) and repulsive (R) mixtures with a maximum number of\nclusters equal to six under different synthetic data scenarios\n\nn=100\n\nn=1000\n\nI\n\nIIa\n\nIIb\n\nIIIa\n\nIIIb\n\nIV\n\nI\n\nIIa\n\nIIb\n\nIIIa\n\nIIIb\n\nIV\n\nK-L divergence\nN-R\n\nR\n\n0\u00b705\n0\u00b703\n0\u00b703\n0\u00b702\n\n0\u00b703\n0\u00b701\n0\u00b708\n0\u00b702\n\nMisclassi\ufb01cation\nHCT 0\u00b712\n0\u00b768\nN-R\n0\u00b709\n0\u00b706\n0\u00b705\n\n0\u00b711\n0\u00b726\n0\u00b710\n0\u00b709\n0\u00b704\n\nR\n\nSum of extra weights\n0\u00b721\nN-R\n0\u00b711\n0\u00b701\n0\u00b701\n\n0\u00b730\n0\u00b710\n0\u00b701\n0\u00b701\n\nR\n\n0\u00b707\n0\u00b702\n0\u00b709\n0\u00b703\n\n0\u00b741\n0\u00b706\n0\u00b705\n0\u00b700\n0\u00b702\n\n0\u00b709\n0\u00b707\n0\u00b701\n0\u00b701\n\n0\u00b705\n0\u00b702\n0\u00b707\n0\u00b703\n\n0\u00b712\n0\u00b717\n0\u00b709\n0\u00b705\n0\u00b703\n\n0\u00b716\n0\u00b709\n0\u00b701\n0\u00b701\n\n0\u00b708\n0\u00b703\n0\u00b709\n0\u00b703\n\n0\u00b778\n0\u00b705\n0\u00b706\n0\u00b700\n0\u00b701\n\n0\u00b707\n0\u00b707\n0\u00b701\n0\u00b701\n\n0\u00b722\n0\u00b704\n0\u00b724\n0\u00b704\n\n0\u00b721\n0\u00b713\n0\u00b705\n0\u00b709\n0\u00b703\n\n0\u00b713\n0\u00b707\n0\u00b708\n0\u00b705\n\n0\u00b700\n0\u00b700\n0\u00b701\n0\u00b700\n\n0\u00b745\n0\u00b765\n0\u00b711\n0\u00b705\n0\u00b705\n\n0\u00b730\n0\u00b711\n0\u00b701\n0\u00b701\n\n0\u00b701\n0\u00b700\n0\u00b701\n0\u00b700\n\n0\u00b742\n0\u00b724\n0\u00b708\n0\u00b708\n0\u00b702\n\n0\u00b721\n0\u00b711\n0\u00b700\n0\u00b700\n\n0\u00b701\n0\u00b700\n0\u00b701\n0\u00b700\n\n0\u00b714\n0\u00b703\n0\u00b704\n0\u00b700\n0\u00b702\n\n0\u00b703\n0\u00b704\n0\u00b700\n0\u00b700\n\n0\u00b700\n0\u00b700\n0\u00b701\n0\u00b700\n\n0\u00b742\n0\u00b714\n0\u00b708\n0\u00b703\n0\u00b703\n\n0\u00b716\n0\u00b710\n0\u00b700\n0\u00b700\n\n0\u00b701\n0\u00b700\n0\u00b701\n0\u00b700\n\n0\u00b709\n0\u00b702\n0\u00b703\n0\u00b700\n0\u00b701\n\n0\u00b703\n0\u00b703\n0\u00b700\n0\u00b700\n\n0\u00b702\n0\u00b700\n0\u00b703\n0\u00b700\n\n0\u00b720\n0\u00b719\n0\u00b702\n0\u00b718\n0\u00b701\n\n0\u00b729\n0\u00b703\n0\u00b726\n0\u00b703\n\n4 Real data\n\nWe assessed the clustering performance of the proposed method on a real dataset. This dataset\nconsists of 150 observations from three different species of iris each with four measurements. This\ndataset was previously analyzed by [16] and [17] proposing new methods to estimate the number of\nclusters based on minimizing loss functions. They concluded the optimal number of clusters was\ntwo. This result did not agree with the number of species due to low separation in the data between\ntwo of the species. Such point estimates of the number of clusters do not provide a characterization\nof uncertainty in clustering in contrast to Bayesian approaches.\nRepulsive and non-repulsive mixtures were \ufb01tted under different choices of upper bound on the\nnumber of components. Since the data contains three true biological clusters, with two of these\nhaving similar distributions of the available features, we would expect the posterior to concen-\ntrate on two or three components. Posterior means and standard deviations of the three highest\nweights were (0\u00b730, 0\u00b723, 0\u00b713) and (0\u00b705, 0\u00b704, 0\u00b704) for non-repulsive and (0\u00b760, 0\u00b730, 0\u00b704) and\n(0\u00b704, 0\u00b703, 0\u00b702) for repulsive under six components. Clearly, repulsive priors lead to a posterior\nmore concentrated on two components, and assign low probability to more than three components.\n\n7\n\n\fFigure 3: Posterior density of the total probability weight assigned to more than three components\nin the Iris data under a max of 6 or 10 components for non-repulsive (6:solid, 10:dash-dot) and\nrepulsive (6:dash, 10:dot) mixtures.\n\nFigure 3 shows the density of the total probability assigned to the extra components. This quantity\nwas computed considering the number of species as the true number of clusters. According to\n\ufb01gure 3, our repulsive prior speci\ufb01cation leads to extra component weights very close to zero\nregardless of the upper bound on the number of components. The posterior uncertainty is also\nsmall. Non-repulsive mixtures assign large weight to extra components, with posterior uncertainty\nincreasing considerably as the number of components increases.\n\nDiscussions\n\nWe have proposed a new repulsive mixture modeling framework, which should lead to substantially\nimproved unsupervised learning (clustering) performance in general applications. A key aspect is\nsoft penalization of components located close together to favor, without sharply enforcing, well sep-\narated clusters that should be more likely to correspond to the true missing labels. We have focused\non Bayesian MCMC-based methods, but there are numerous interesting directions for ongoing re-\nsearch, including fast optimization-based approaches for learning mixture models with repulsive\npenalties.\n\nAcknowledgments\n\nThis research was partially supported by grant 5R01-ES-017436-04 from the National Institute of\nEnvironmental Health Sciences (NIEHS) of the National Institutes of Health (NIH) and DARPA\nMSEE.\n\n8\n\n00.10.20.30.40.50.60.70510152025\fReferences\n[1] J. Rousseau and K. Mengersen. Asymptotic Behaviour of the Posterior Distribution in Over-Fitted Mod-\n\nels. Journal of the Royal Statistical Society B, 73:689\u2013710, 2011.\n\n[2] S. Dasgupta. Learning Mixtures of Gaussians. Proceedings of the 40th Annual Symposium on Foundations\n\nof Computer Science, pages 633\u2013644, 1999.\n\n[3] S. Dasgupta and L. Schulman. A Probabilistic Analysis of EM for Mixtures of Separated, Spherical\n\nGaussians. The Journal of Machine Learning Research, 8:203\u2013226, 2007.\n\n[4] M. Stephens. Bayesian Analysis of Mixture Models with an Unknown Number of Components - An\n\nAlternative to Reversible Jump Methods. The Annals of Statistics, 28:40\u201374, 2000.\n\n[5] H. Ishwaran and M. Zarepour. Dirichlet Prior Sieves in Finite Normal Mixtures. Statistica Sinica, 12:941\u2013\n\n963, 2002.\n\n[6] J. Sethuraman. A Constructive Denition of Dirichlet Priors. Statistica Sinica, 4:639\u2013650, 1994.\n[7] H. Ishwaran and L. F. James. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American\n\nStatistical Association, 96:161\u2013173, 2001.\n\n[8] M. L. Huber and R. L. Wolpert. Likelihood-Based Inference for Matern Type-III Repulsive Point Pro-\n\ncesses. Advances in Applied Probability, 41:958\u2013977, 2009.\n\n[9] A. Lawson and A. Clark. Spatial Cluster Modeling. Chapman & Hall CRC, London, UK, 2002.\n[10] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer, 2008.\n[11] Catia Scricciolo. Posterior Rates of Convergence for Dirichlet Mixtures of Exponential Power Densities.\n\nElectronic Journal of Statistics, 5:270\u2013308, 2011.\n\n[12] H. Ishwaran, L. F. James, and J. Sun. Bayesian Model Selection in Finite Mixtures by Marginal Density\n\nDecompositions. Journal of American Statistical Association, 96:1316\u20131332, 2001.\n\n[13] Paul Damien, Jon Wake\ufb01eld, and Stephen Walker. Gibbs Sampling for Bayesian Non-Conjugate and\nHierarchical Models by Using Auxiliary Variables. Journal of the Royal Statistical Society B, 61:331\u2013\n344, 1999.\n\n[14] M. Stephens. Dealing with label switching in mixture models. Journal of the Roya; statistical society B,\n\n62:795\u2013810, 2000.\n\n[15] H. Locarek-Junge and C. Weihs. Classi\ufb01cation as a Tool for Research. Springer, 2009.\n[16] C. Sugar and G. James. Finding the number of clusters in a data set: an information theoretic approach.\n\nJournal of the American Statistical Association, 98:750\u2013763, 2003.\n\n[17] J. Wang. Consistent selection of the number of clusters via crossvalidation. Biometrika, 97:893\u2013904,\n\n2010.\n\n9\n\n\f", "award": [], "sourceid": 940, "authors": [{"given_name": "Francesca", "family_name": "Petralia", "institution": null}, {"given_name": "Vinayak", "family_name": "Rao", "institution": null}, {"given_name": "David", "family_name": "Dunson", "institution": null}]}