{"title": "The Infinite Mixture of Infinite Gaussian Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 28, "page_last": 36, "abstract": "Dirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG.", "full_text": "The In\ufb01nite Mixture of In\ufb01nite Gaussian Mixtures\n\nHalid Z. Yerebakan\n\nDepartment of\n\nComputer and Information Science\n\nIUPUI\n\nIndianapolis, IN 46202\n\nhzyereba@cs.iupui.edu\n\nBartek Rajwa\n\nBindley Bioscience Center\n\nPurdue University\n\nW. Lafayette, IN 47907\n\nrajwa@cyto.purdue.edu\n\nMurat Dundar\nDepartment of\n\nComputer and Information Science\n\nIUPUI\n\nIndianapolis, IN 46202\n\ndundar@cs.iupui.edu\n\nAbstract\n\nDirichlet process mixture of Gaussians (DPMG) has been used in the literature for\nclustering and density estimation problems. However, many real-world data ex-\nhibit cluster distributions that cannot be captured by a single Gaussian. Modeling\nsuch data sets by DPMG creates several extraneous clusters even when clusters are\nrelatively well-de\ufb01ned. Herein, we present the in\ufb01nite mixture of in\ufb01nite Gaus-\nsian mixtures (I2GMM) for more \ufb02exible modeling of data sets with skewed and\nmulti-modal cluster distributions. Instead of using a single Gaussian for each clus-\nter as in the standard DPMG model, the generative model of I2GMM uses a single\nDPMG for each cluster. The individual DPMGs are linked together through cen-\ntering of their base distributions at the atoms of a higher level DP prior. Inference\nis performed by a collapsed Gibbs sampler that also enables partial paralleliza-\ntion. Experimental results on several arti\ufb01cial and real-world data sets suggest\nthe proposed I2GMM model can predict clusters more accurately than existing\nvariational Bayes and Gibbs sampler versions of DPMG.\n\n1\n\nIntroduction\n\nThe traditional approach to \ufb01tting a Gaussian mixture model onto the data involves using the well-\nknown expectation-maximization algorithm to estimate component parameters [7]. The major lim-\nitation of this approach is the need to de\ufb01ne the number of clusters in advance. Although there are\nseveral ways to predict the number of clusters in a data set in an of\ufb02ine manner, these techniques\nare in general suboptimal as they decouple the two interdependent tasks: predicting the number of\nclusters and predicting model parameters.\nDirichlet process mixture of Gaussians (DPMG), also known as the in\ufb01nite Gaussian mixture model\n(IGMM), is a Gaussian mixture model (GMM) with a Dirichlet process (DP) prior de\ufb01ned over\nmixture components [8]. Unlike traditional mixture modeling, DPMG predicts the number of clus-\nters while simultaneously performing model inference. In the DPMG model the number of clusters\ncan arbitrarily grow to better accommodate data as needed. DPMG in general works well when\nthe clusters are well-de\ufb01ned with Gaussian-like distributions. When the distributions of clusters are\nheavy-tailed, skewed, or multi-modal multiple mixture components per cluster may be needed for\nmore accurate modeling of cluster data. Since there is no dependency structure in DPMG to asso-\n\n1\n\n\fciate mixture components with clusters, additional mixture components produced during inference\nare all treated as independent clusters. This results in a suboptimal clustering of underlying data.\nWe propose the in\ufb01nite mixture of IGMMs (I2GMM) for more accurate clustering of data sets ex-\nhibiting skewed and multi-modal cluster distributions. The underlying generative model of I2GMM\nemploys a different DPMG for each cluster data. A dependency structure is imposed across individ-\nual DPMGs through centering of their base distibutions at one of the atoms of the higher level DP.\nThis way individual cluster data are modeled by lower level DPs using one DPMG for each cluster\nand atoms de\ufb01ning the base distributions of individual clusters and cluster proportions are modeled\nby the higher level DP. Our model allows sharing of the covariance matrices across mixture com-\nponents of the same DPMG. The data model, which is conjugate to the base distributions of both\nhigher and lower level DPs, makes obtaining closed form solutions of posterior predictive distribu-\ntions possible. We use a collapsed Gibbs sampler scheme for inference. Each scan of the Gibbs\nsampler involves two loops. One that iterates over individual data instances to sample component\nindicator variables and another one that iterates over components to sample cluster indicator vari-\nables. Conditioned on the cluster indicator variables, component indicator variables can be sampled\nin a parallel fashion, which signi\ufb01cantly speeds up inference under certain circumstances.\n\n2 Related Work\n\nDependent Dirichlet processes (DDP) have been studied in the literature for modeling collection\nof distributions that vary in time, in spatial region, in covariate space, or in grouped data settings\n(images, documents, biological samples). Previous work most related to the current work involves\nstudies that investigate DDP in grouped data settings.\nTeh et al. uses a hierarchical DP (HDP) prior over the base distributions of individual DP models to\nintroduce a sharing mechanism that allows for sharing of atoms across multiple groups [15]. When\neach group data is modeled by a different DPMG this allows for sharing of the same mean vector\nand covariance matrix across multiple groups. Such a dependency may potentially be useful in a\nmulti-group setting. However, when all data are contained in a single group as in the current study\nsharing the same mixture component across multiple cluster distributions leads to shared mixture\ncomponents being statistically unidenti\ufb01able.\nThe HDP-RE model by Kim & Smyth [10] and transformed DP by Sudderth et al. [14] relaxes the\nexact sharing imposed by HDP to have a dependency structure between multiple groups that allow\nfor components to share perturbed copies of atoms. Although such a sharing mechanism may be\nuseful for modeling random variations in component parameters across multiple groups, it is not\nvery useful for clustering data sets with skewed and multi-modal distributions. Both HDP-RE and\ntransformed DP still model each group data by a single DPMG and suffer from the same drawbacks\nas DPMG when clustering data sets with skewed and multi-modal distributions.\nThe nested Dirichlet Pricess (nDP) by Rodriguez et al. [13] is a DP whose base distribution is in\nturn another DP. This model is introduced for modeling multi-group data sets where groups share\nnot just individual mixture components as in HDP but the entire mixture model de\ufb01ned by a DPMG.\nnDP can be adapted to single group data sets with multiple clusters but with the restriction that\neach DPMG is shared only once to ensure identi\ufb01ability. Such a restriction practically eliminates\ndependencies across DPMGs modeling different clusters and would not offer clustering property at\nthe group level.\nUnlike existing work which creates dependencies across multiple DPMG through exact or perturbed\nsharing of mixture components or through sharing of the entire mixture model, proposed I2GMM\nmodel associates each cluster with a distinct atom of the higher level DP through centering of the\nbase distribution of the corresponding DPMG at that atom. Thus, the higher level DP de\ufb01nes meta-\nclusters whereas lower level DPs model actual cluster data. Mixture components associated with\nthe same DPMG have their own mean vectors but share the same covariance matrix. Apart from\npreserving the conjugacy of the data model covariance sharing across mixture components of the\nsame DPMG allows for identi\ufb01cation of clusters that differ in cluster shapes even when they are not\nwell separated by their means.\n\n2\n\n\f3 Dirichlet Process Mixture\n\nDirichlet process is a distribution over discrete distributions. It is parameterized by a concentration\nparameter \u03b1 and a base distribution H denoted by DP (\u03b1H). Each probability mass in a sample\ndiscrete distribution is called as atom. According to the stick-breaking construction of DP [9], each\nsample from a DP can be considered as a collection of countably in\ufb01nite number of atoms. In this\nrepresentation base distribution is a prior over the locations of the atoms and concentration parame-\nter affects the distribution of the atom weights, i.e., stick lengths. Another popular characterization\nof DP includes the Chinese restaurant process (CRP) [3] which we utilize during model inference.\nDiscrete nature of its samples makes DP suitable as a prior distribution over mixture weights in\nmixture models. Although samples from DP are de\ufb01ned by an in\ufb01nite dimensional discrete distri-\nbution, the posterior distribution conditioned on a \ufb01nite data always uses \ufb01nite number of mixture\ncomponents.\nWe denote each data instance by xi \u2208 Rd where i \u2208 {1, ..., n}, n is the total number of data\ninstances. For each instance, \u03b8i indicates the set of parameters from which the instance is sampled.\nFor the Gaussian data model \u03b8i = {\u00b5i, \u03a3i} where \u00b5i denotes the mean vector and \u03a3i the covariance\nmatrix. The generative model of the Dirichlet Process Gaussian Mixture is given by (1).\n\nxi \u223c p(xi|\u03b8i)\n\u03b8i \u223c G\nG \u223c DP (\u03b1H)\n\n(1)\n\nOwing to the discreteness of the distribution G, \u03b8i\u2019s corresponding to different instances will not be\nall distinct. It is this property of DP that offers clustering over \u03b8i and in turn over data instances.\nChoosing H from a family of distributions conjugate to the Gaussian distribution produces a closed-\nform solution for the posterior predictive distribution of DPMG. The bivariate prior over the atoms\nof G is de\ufb01ned in (2).\n\nH = N IW (\u00b50, \u03a30, \u03ba0, m) = N (\u00b5|\u00b50,\n\n) \u00d7 W \u22121(\u03a3|\u03a30, m)\n\n\u03a3\n\u03ba0\n\n(2)\n\nwhere \u00b50 is the prior mean and \u03ba0 is a scaling constant that controls the deviation of the mean\nvectors from the prior mean. The parameter \u03a30 is the scaling matrix and m is degrees of freedom.\nThe posterior predictive distribution for a Gaussian data model and NIW prior can be obtained\nby integrating out \u00b5 and \u03a3 analytically. Integrating out \u00b5 and \u03a3 leaves us with the component\nindicator variables ti for each instance xi as the only random variables in the state space. Using the\nCRP representation of DP, ti\u2019s can be sampled as in (3).\n\n\u03b1p(xi)\nk p(xi|A\u2212i\nn\u2212i\n\nif k = K + 1\nif k \u2264 K\n\np(ti = k|X, t\u2212i) \u221d\n\nk , \u00afx\u2212i\nk )\n\n(3)\nwhere p(xi) and p(xi|Ak, \u00afxk) denote the posterior predictive distributions for an empty and oc-\ncupied component, respectively, both of which are multivariate Student-t distributions. X and t\ndenote the sets of all data instances and their corresponding indicator variables, respectively. nk is\nthe number of data instances in component k. Ak and \u00afxk are the scatter matrix and sample mean\nfor component k, respectively. The superscript \u2212i notation indicates the exclusion of the effect of\ninstance i from the corresponding variable. Inference for DPMG can also be performed using the\nstick-breaking representation of DP with the actual inference performed either by a Gibbs sampler\nor through variational Bayes [5, 11].\n\n(cid:26)\n\n(cid:27)\n\n4 The In\ufb01nite Mixture of In\ufb01nite Gaussian Mixture Models\n\nWhen modeling data sets containing skewed and multi-modal clusters, DPMG tends to produce\nmultiple components for each cluster. Owing to the single-layer structure of DPMG, no direct asso-\nciations among different components of the same cluster can be made. As a result of this limitation\nall components are treated as independent clusters resulting in a situation where the number of clus-\nters are overpredicted and the actual cluster data are split into multiple subclusters. A more \ufb02exible\nmodel for clustering data sets with skewed and multi-modal clusters can be obtained using a two-\n\n3\n\n\flayer generative model as in (4).\n\nxi \u223c N (xi|\u00b5i, \u03a3j)\n\u00b5i \u223c Gj\nGj \u223c DP (\u03b1Hj)\nHj = N (\u00b5j, \u03a3j/\u03ba1)\n\n(\u00b5j, \u03a3j) \u223c G\n\nG \u223c DP (\u03b3H)\nH = N IW (\u00b50, \u03a30, \u03ba0, m)\n\n(4)\n\nIn this model, top layer DP generates cluster-speci\ufb01c parameters \u00b5j and \u03a3j according to the base\ndistribution H and concentration parameter \u03b3. These parameters in turn de\ufb01ne the base distributions\nHj of the bottom layer DPs. Since each Hj is representing a different cluster, Hj\u2019s can be considered\nas meta-clusters from which mixture components of the corresponding cluster are generated. In this\nmodel both the number of clusters and the number of mixture components within a cluster can\nbe potentially in\ufb01nite hence the name I2GMM. The top layer DP models the number of clusters,\ntheir sizes, and the base distribution of the bottom layer DPs whereas each bottom layer DP models\nthe number of components in a cluster and their sizes. Allowing atom locations in the bottom\nlayer DPGMs to be different than their corresponding cluster atom provides the \ufb02exibility to model\nclusters that cannot be effectively modeled by a single Gaussian. The scaling parameter \u03ba1 adjusts\nwithin cluster scattering of the component mean vectors whereas the scaling parameter \u03ba0 adjusts\nbetween cluster scattering of the cluster-speci\ufb01c mean vectors. Expressing both H and Hj\u2019s as\nfunctions of \u03a3j not only preserves the conjugacy of the model but also allows for sharing of the\nsame covariance matrix across mixture components of the same cluster.\nPosterior inference for the proposed model in (4) can be performed by a collapsed Gibbs sampler\nby iteratively sampling component indicator variables t = {ti}n\ni=1 of data instances and cluster\nindicator variables c = {ck}K\nk=1 of mixture components. When sampling ti we restrict sampling\nwith components whose cluster indicator variables are equal to cti in addition to a new component.\nThe conditional distribution for sampling ti can be expressed by the following equation.\n\n(cid:26) \u03b1p(xi)\n\nk , \u00afx\u2212i\n(cid:90) (cid:90)\n\nk , Sck )\n\nk , \u00afx\u2212i\n\nk p(xi|A\u2212i\nn\u2212i\n\nif k = K + 1\nif k : ck = cti\n\np(ti = k|X, t\u2212i, c) \u221d\nwhere Sck = {A(cid:96), \u00afx(cid:96), n(cid:96)}(cid:96):c(cid:96)=ck\n. When sampling component indicator variables, owing to the\ndependency among data instances, removing a data instance from a component not only affect the\nparameters of the components it belongs to but also the corresponding cluster parameters. Techni-\ncally speaking the parameters of both the component and corresponding cluster has to be updated\nfor exact inference. However, updating cluster parameters for every data instance removed will sig-\nni\ufb01cantly slow down inference. For practical purposes we only update component parameters and\nassume that removing a single data instance does not signi\ufb01cantly change cluster parameters. The\nconditional distribution for sampling ck can be expressed by the following equation.\n\n(5)\n\np(ck = j|X, t, c\u2212k) \u221d\n\n(6)\nwhere Sj = {A(cid:96), \u00afx(cid:96), n(cid:96)}(cid:96):c(cid:96)=j, J is the number of clusters, and mj is the number of mixture\ncomponents assigned to cluster j. Next, we discuss the derivation of the component-level posterior\npredictive distributions, i.e., p(xi|A\u2212i\nk , Sck ), which can be obtained by evaluating the integral\nin (7).\n\nmj\n\nif j = J + 1\nif j \u2264 J\n\n(cid:26) \u03b3(cid:81)\n(cid:81)\ni:ti=k p(xi)\ni:ti=k p(xi|Sj)\n\np(xi|A\u2212i\n\nk , \u00afx\u2212i\n\nk , Sck ) =\n\np(xi|\u00b5k, \u03a3ck )p(\u00b5k, \u03a3ck|A\u2212i\n\nk , \u00afx\u2212i\n\nk , Sck )\u2202\u00b5k\u2202\u03a3ck\n\n(7)\n\nTo evaluate the integral in (7) we need the posterior distribution of the component parameters,\nnamely p(\u00b5k, \u03a3ck|A\u2212i\nk , \u00afx\u2212i\n\nk , \u00afx\u2212i\nk , Sck ) \u221d p(\u00b5k, \u03a3ck , A\u2212i\n\nk , Sck ), which is proportional to\nk , \u00afx\u2212i\nk |\u00b5k, \u03a3ck )p(A\u2212i\n\nk |Sck )\nk |\u03a3ck )p(\u00b5k|\u03a3ck , Sck )p(\u03a3k|Sck )\n\np(\u00b5k, \u03a3ck|A\u2212i\n\n= p(\u00afx\u2212i\n\n(8)\n\n4\n\n\f(cid:1)\n\nk )\u22121\u03a3ck\n\nwhere\n\np(\u00afx\u2212i\nk |\u03a3ck )\np(A\u2212i\np(\u00b5k|\u03a3ck , Sck ) = N ( \u00af\u00b5, \u00af\u03ba\u22121\u03a3ck )\np(\u03a3ck|Sck )\n\u00af\u00b5\n\nk |\u00b5k, \u03a3ck ) = N(cid:0)\u00b5k, (n\u2212i\n= W(cid:0)\u03a3ck , n\u2212i\nk \u2212 1(cid:1)\n= W \u22121(cid:0)\u03a30 +(cid:80)\n(cid:80)\n(cid:80)\n((cid:80)\n(cid:80)\nOnce we substitute p(\u00b5k, \u03a3ck|A\u2212i\nk , \u00afx\u2212i\np(xi|A\u2212i\n\nk , \u00afx\u2212i\n\n(cid:96):c(cid:96)=ck\n\n(cid:96):c(cid:96)=ck\n\n(cid:96):c(cid:96)=ck\n\n=\n\n=\n\n(cid:96):c(cid:96)=ck\n\n\u00af\u03ba\n\n(cid:96):c(cid:96)=ck\n\nn(cid:96) \u03ba1\n(n(cid:96)+\u03ba1) \u00afx(cid:96)+\u03ba0\u00b50\n\nn(cid:96) \u03ba1\n(n(cid:96)+\u03ba1) +\u03ba0\nn(cid:96) \u03ba1\n(n(cid:96)+\u03ba1 ) +\u03ba0)\u03ba1\nn(cid:96) \u03ba1\n(n(cid:96)+\u03ba1 ) +\u03ba0+\u03ba1\n\nA(cid:96), m +(cid:80)\n\n(cid:96):c(cid:96)=ck\n\n(n(cid:96) \u2212 1)(cid:1)\n\nk , Sck ) into (7) and evaluate the integral we obtain\n\nk , Sck ) in the form of a multivariate Student-t distribution.\nk , Sck ) = stu \u2212 t( \u02c6\u00b5, \u02c6\u03a3, v)\n\np(xi|A\u2212i\n\nk , \u00afx\u2212i\n\nThe location vector (\u02c6\u00b5), the scale matrix ( \u02c6\u03a3), and the degrees of freedom (v) are given below.\nLocation vector:\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nScale matrix:\n\n\u03a30 +(cid:80)\n\n(cid:96):c(cid:96)=ck\n\n\u02c6\u03a3 =\n\nDegrees of freedom:\n\n(cid:88)\n\nv = m +\n\n\u02c6\u00b5 =\n\nn\u2212i\nk \u00afx\u2212i\nk + \u00af\u03ba \u00af\u00b5\nn\u2212i\nk + \u00af\u03ba\n\nA(cid:96) + A\u2212i\n\n\u2212i\nk \u00af\u03ba\nk + n\n\u2212i\nn\nk +\u00af\u03ba\n\u2212i\nk ) v\n\u2212i\nk +1)\n\n(\u00af\u03ba+n\n(\u00af\u03ba+n\n\n(n(cid:96) \u2212 1) + n\u2212i\n\nk \u2212 d + 1\n\n(\u00afx\u2212i\n\nk \u2212 \u00af\u00b5)(\u00afx\u2212i\n\nk \u2212 \u00af\u00b5)T\n\n(cid:96):c(cid:96)=ck\n\nThe cluster-level posterior predictive distributions can be readily obtained from p(xi|A\u2212i\nk , Sck )\nby dropping Ak, \u00afxk, and nk from (10)-(12). Similarly, posterior predictive distribution for an empty\ncomponent/cluster can be obtained by dropping Sck from (10)-(12) in addition to Ak, \u00afxk, and nk.\nThanks to the two-layer structure of the proposed model, the inference for I2GMM can be partially\nparallelized. Conditioned on the cluster indicator variables, component indicator variables for data\ninstances in the same cluster can be sampled independent of the data instances in other clusters.\nThe amount of actual speed up that can be achieved by parallelization depends on multiple factors\nincluding the number of clusters, cluster sizes, and how fast the other loop that iterates over cluster\nindicator variables can be run.\n\nk , \u00afx\u2212i\n\n5 Experiments\n\nWe evaluate the proposed I2GMM model on \ufb01ve different data sets and compare its performance\nagainst three different versions of DPMG in terms of clustering accuracy and run time.\n\n5.1 Data Sets\n\nFlower formed by Gaussians: We generated a \ufb02ower-shaped two-dimensional arti\ufb01cial data set\nusing a different Gaussian mixture model for each of the four different parts (petals, stem, and two\nleaves) of the \ufb02ower. Each part is considered as a separate cluster. Although covariance matrices\nare same for all Gaussian components within a mixture they do differ between mixtures to create\nclusters of different shapes. Petals are formed by a mixture of nine Gaussians sharing a spherical\ncovariance. Stem is formed by a mixture of four Gaussians sharing a diagonal covariance. Each leaf\nis formed by a mixture of two Gaussians sharing a full covariance. There are a total of seventeen\nGaussian components, four clusters, and 17,000 instances (1000 instances per component) in this\ndata set. Scatter plot of this data set is shown in Fig 1a.\nLymphoma: Lymphoma data set is one of the data sets used in the FlowCAP (Flow Cytometry Crit-\nical Assessment of Population Identi\ufb01cation Methods) 2010 competition [1]. This data set consists\n\n5\n\n\fof thirty sub-data sets each generated from a lymph node biopsy sample of a patient using a \ufb02ow\ncytometer. Flow cytometry is a single-cell screening, analysis, and sorting technology that plays a\ncrucial role in research and clinical immunology, hematology, and oncology. The cellular pheno-\ntypes are de\ufb01ned in FC by combinations of morphological features (measured by elastic light scatter)\nand abundances of surface and intracellular markers revealed by \ufb02uorescently labeled antibodies. In\nthe lymphoma data set each of the sub-data set contains thousands of instances with each instance\nrepresenting a cell by a \ufb01ve-dimensional feature vector. For each sub-data set cell populations are\nmanually gated by experts. Each sub-data has between two to four cell populations, i.e., clusters.\nOwing to the intrinsic mechanical and optical limitations of a \ufb02ow cytometer, distributions of cell\npopulations in the FC data end up being heavy-tailed or skewed, which makes their modeling by a\nsingle Gaussian highly impractical [12]. Although clusters in this data set are relatively well-de\ufb01ned\naccurate modeling of cell distributions is a challenge due to skewed nature of distributions.\nRare cell populations: This data set is a small subset of one of the data sets used in the FlowCAP\n2012 competition [1]. The data set contains about 279,546 instances with each instance characteriz-\ning a white blood cell in a six-dimensional feature space. There are three clusters manually labeled\nby experts. This is an interesting data set for two reasons. First, clusters are highly unbalanced in\nterms of the number of instances belonging to each cluster. Two of the clusters, which are highly\nsigni\ufb01cant for measuring immunological response of the patient, are extremely rare. The ratios of\nthe number of instances available from each of the two rare classes to the total number of instances\nare 0.0004 and 0.0005, respectively. Second, the third cluster, which contains all cells not belong-\ning to one of the two rare-cell populations, has a distribution that is both skewed and multi-modal\nmaking it extremely challenging to recover its distribution as a single cluster.\nHyperspectral imagery: This data set is a \ufb02ightline over a university campus. The hyperspectral\ndata provides image data in 126 spectral bands in the visible and infrared regions. A total of 21,518\npixels from eight different land cover types are manually labeled. Some of the land cover types\nsuch as roof tops have multi-modal distributions. Cluster sizes are also relatively unbalanced with\npixels belonging to roof tops constituting about one half of the labeled pixels. To reduce run time the\ndimensionality is reduced by projecting the original data onto its \ufb01rst thirty principal components.\nThe data with reduced dimensionality is used in all experiments.\nLetter recognition: This is a benchmark data set available through the UCI machine learning repos-\nitory [4]. There are twenty six well-balanced clusters (one for each letter) in this data set.\n\n(a) True Clusters\n\n(b) I2GMM\n\n(c) VB\n\n(d) KD-VB\n\n(e) ColGibbs\n\nFigure 1: Clusters predicted by I2GMM, VB, KD-VB, and ColGibbs on the \ufb02ower data set. Black\ncontours in the \ufb01rst \ufb01gure indicate distributions of individual Gaussian components forming the\n\ufb02ower. Each color refers to a different cluster. Points denote data instances.\n\n6\n\n\u22123\u22122\u221210123\u22127\u22126\u22125\u22124\u22123\u22122\u22121012\u22124\u22123\u22122\u2212101234\u221210\u22128\u22126\u22124\u22122024\u22124\u22123\u22122\u2212101234\u221210\u22128\u22126\u22124\u22122024\u22123\u22122\u221210123\u22127\u22126\u22125\u22124\u22123\u22122\u22121012\fTable 1: Micro and macro F1 scores produced by I2GMM, VB, KD-VB, and ColGibbs on the \ufb01ve\ndata sets. For each data set the \ufb01rst line includes micro F1 scores and the second line macro F1\nscores. Numbers in parenthesis indicate standard deviations across ten repetitions. Results for the\nlyphoma data set are the average of results from thirty sub-data sets.\nData set\nFlower\n\nI2GMMp\n\nI2GMM\n\nLymphoma\n\nRare classes\n\nHyperspectral\n\nLetter Recognition\n\n0.975 (0.032)\n0.982 (0.015)\n0.920 (0.016)\n0.847 (0.021)\n0.487 (0.031)\n0.756 (0.012)\n0.624 (0.017)\n0.667 (0.018)\n0.459 (0.015)\n0.460 (0.015)\n\n0.991 (0.003)\n0.990 (0.002)\n0.922 (0.020)\n0.847 (0.022)\n0.493 (0.020)\n0.756 (0.010)\n0.626 (0.021)\n0.661 (0.012)\n0.467 (0.017)\n0.467 (0.017)\n\nVB\n\n0.640 (0.087)\n0.643 (0.059)\n0.454 (0.056)\n0.509 (0.044)\n0.182 (0.015)\n0.441 (0.032)\n0.433 (0.031)\n0.580 (0.034)\n0.420 (0.015)\n0.420 (0.015)\n\nKD-VB\n0.584\n0.639\n0.819\n0.762\n0.353\n0.472\n0.554\n0.380\n0.267\n0.267\n\nColGibbs\n\n0.525 (0.010)\n0.611 (0.009)\n0.634 (0.034)\n0.656 (0.029)\n0.234 (0.059)\n0.638 (0.023)\n0.427 (0.024)\n0.596 (0.020)\n0.398 (0.018)\n0.399 (0.018)\n\n5.2 Benchmark Models and Evaluation Metric\n\nWe compare the performance of the proposed I2GMM model with three different versions of DPMG.\nThese include the collapsed Gibbs sampler version (ColGibbs) discussed in Section 3, the variational\nBayes version (VB) introduced in [5], and the KD-tree based accelerated variational Bayes version\n(KD-VB) introduced in [11]. For I2GMM and ColGibbs we used our own implementations devel-\noped in C++. For VB and KD-VB we used existing MATLAB R(cid:13)(Natick, MA) implementations 1.\nIn order to see the effect of parallelization over execution times we ran the proposed technique in\ntwo modes: parallelized (I2GMMp) and unparallelized (I2GMM).\nAll data sets are scaled to have unit variance for each feature. The ColGibbs model has \ufb01ve free\nparameters (\u03b1, \u03a30, m, \u03ba0, \u00b50), I2GMM model has two more parameters (\u03ba1, \u03b3) than ColGibbs. We\nuse vague priors with \u03b1 and \u03b3 by \ufb01xing their value to one. We set m to the minimum feasible value,\nwhich is d+2, to achieve maximum degrees of freedom in the shape of the covariance matrices. The\nprior mean \u00b50 is set to the mean of the entire data. The scale matrix \u03a30 is set to I/s, where I is the\nidentity matrix. This leaves the scaling constant s of \u03a30, \u03ba0, and \u03ba1 as the three free parameters. We\nuse s = 150/(d(logd)), \u03ba0 = 0.05, and \u03ba1 = 0.5 in experiments with all \ufb01ve data sets described\nabove.\nMicro and macro F1 scores are used as performance measures for comparing clustering accuracy of\nthese four techniques. As one-to-many matchings are expected between true and predicted clusters,\nthe F1 score for a true cluster is computed as the maximum of the F1 scores for all predicted clusters.\nThe Gibbs sampler for ColGibbs and I2GMM are run for 1500 sweeps. The \ufb01rst 1000 samples are\nignored as burn-in and eleven samples drawn with \ufb01fty sweeps apart are saved for \ufb01nal evaluation.\nWe used an approach similar to the one proposed in [6] for matching cluster labels across different\nsamples. The mode of cluster labels computed across ten samples are assigned as the \ufb01nal cluster\nlabel for each data instance. ColGibbs and I2GMM use stochastic sampling whereas VB use a\nrandom initialization stage. Thus, these three techniques may produce results that vary from one\nrun to other on the same data set. Therefore we repeat each experiment ten times and report average\nresults of ten repetitions for these three techniques.\n\n5.3 Results and Discussion\n\nMicro and macro F1 produced by the four techniques on all \ufb01ve data sets are reported in Table 1. On\nthe \ufb02ower data set I2GMM achieves almost perfect micro and macro F1 scores and correctly predicts\nthe true number of clusters. The other three techniques produce several extraneous clusters which\nlead to poor F1 scores. Clusters predicted by each of the four techniques are shown in Fig. 1. As\nexpected ColGibbs identify distributions of individual Gaussian components as clusters as opposed\nto the actual clusters formed by mixtures of Gaussians. The piece-wise linear cluster boundaries\n\n1https://sites.google.com/site/kenichikurihara/academic-software/\n\nvariational-dirichlet-process-gaussian-mixture-model\n\n7\n\n\fTable 2: Execution times for I2GMM, I2GMMp, VB, KD-VB, and ColGibbs in seconds on the\n\ufb01ve data sets. Numbers in parenthesis indicate standard deviations across ten repetitions. For the\nlymphoma data set results reported are average run-time per sub-data set.\n\nKD-VB\n\nColGibbs\n\n7\n3\n16\n2\n12\n\n59 (1)\n63 (3)\n\n7,250 (182)\n7,455 (221)\n2,785 (123)\n\nData set\nFlower\nLymphoma\nRare classes\nHyperspectral\nLetter Recognition\n\nI2GMM\n54 (2)\n119 (4)\n\n9,738 (349)\n5,385 (109)\n1545 (63)\n\nI2GMMp\n\n41 (4)\n85 (4)\n\n5,034 (220)\n3,456 (174)\n\n953 (26)\n\nVB\n1 (0.2)\n51 (10)\n\n2171 (569)\n582 (156)\n122 (22)\n\nobtained by VB and KD-VB, splitting original clusters into multiple subclusters, can be explained\nby simplistic model assumptions and approximations that characterize variational Bayes algorithms.\nOn the lymphoma data set the proposed I2GMM model achieves an average micro and macro F1\nscores of 0.920 and 0.848, respectively. These values are not only signi\ufb01cantly higher than corre-\nsponding F1 scores produced by the other three techniques but also on par with the best performing\ntechniques in the FlowCAP 2010 competition [2]. Results for thirty individual sub-data sets in the\nlymphoma data set are available in the supplementary document. A similar trend is also observed\nwith the other three real-world data sets as I2GMM achieves the best F1 score among the four tech-\nniques. Between I2GMM and ColGibbs, I2GMM consistently generates less number of clusters\nacross all data sets as expected. Overall, among the three different versions of DPMG that differ in\nthe inference algorithm used, there is no clear consensus across \ufb01ve data sets as to which version\npredicts clusters more accurately. However, the proposed I2GMM model which extends DPMG to\nskewed and multi-modal clusters, clearly stands out as the most accurate model on all \ufb01ve data sets.\nRun time results included in Table 2 favors variational Bayes techniques over the Gibbs sampler-\nbased ones as expected. Despite longer run times, signi\ufb01cantly higher F1 scores achieved on data\nsets with diverse characteristics suggest that I2GMM can be preferred over DPMG for more accurate\nclustering. Results also suggest that I2GMM can bene\ufb01t from parallelization. The actual amount of\nimprovement in execution time depend on data characteristics as well as how fast the unparallelized\nloop can be run. The largest gain by parallelization is obtained on the rare classes data set which\noffered almost two-fold increase by parallelization on an eight-core workstation.\n\n6 Conclusions\n\nWe introduced I2GMM for more effective clustering of multivariate data sets containing skewed\nand multi-modal clusters. The proposed model extends DPMG to introduce dependencies between\ncomponents and clusters by a two-layer generative model. Unlike standard DPMG where each\ncluster is modeled by a single Gaussian, I2GMM offers the \ufb02exibility to model each cluster data\nby a mixture of potentially in\ufb01nite number of components. Results on experiments with real and\narti\ufb01cial data sets favor I2GMM over variational Bayes and collapsed Gibbs sampler versions of\nDPMG in terms of clustering accuracy. Although execution time can be improved by sampling\ncomponent indicator variables in parallel, the amount of speed up that can be gained is limited with\nthe execution time of the sampling of the cluster indicator variables. As most time consuming part of\nthis task is the sequential computation of likelihoods for data instances, signi\ufb01cant gains in execution\ntime can be achieved by parallelizing the computation of likelihoods. I2GMM is implemented in\nC++. The source \ufb01les and executables are available on the web. 2\n\nAcknowledgments\n\nThis research was sponsored by the National Science Foundation (NSF) under Grant Number IIS-\n1252648 (CAREER), by the National Institute of Biomedical Imaging and Bioengineering (NIBIB)\nunder Grant Number 5R21EB015707, and by the PhRMA Foundation (2012 Research Starter Grant\nin Informatics). The content is solely the responsibility of the authors and does not represent the\nof\ufb01cial views of NSF, NIBIB or PhRMA.\n\n2https://github.com/halidziya/I2GMM\n\n8\n\n\fReferences\n[1] FlowCAP - \ufb02ow cytometry: Critical assessment of population identi\ufb01cation methods. http:\n\n//flowcap.flowsite.org/.\n\n[2] N. Aghaeepour, G. Finak, FlowCAP Consortium, DREAM Consortium, H. Hoos, T. R. Mos-\nmann, R. Brinkman, R. Gottardo, and R. H. Scheuermann. Critical assessment of automated\n\ufb02ow cytometry data analysis techniques. Nature Methods, 10(3):228\u2013238, mar 2013.\n\n[3] D. J. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019 \u00b4Et\u00b4e St Flour 1983, pages 1\u2013198.\n\nSpringer-Verlag, 1985. Lecture Notes in Math. 1117.\n\n[4] K. Bache and M. Lichman. Uci machine learning repository, 2013.\n[5] D. M. Blei and M. I. Jordan. Variational inference for dirichlet process mixtures. Bayesian\n\nAnalysis, 1(1):121\u2013144, 2006.\n\n[6] A. J. Cron and M. West. Ef\ufb01cient classi\ufb01cation-based relabeling in mixture models. The\n\nAmerican Statistician, 65:16\u201320, 2011. PMC3110018.\n\n[7] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, 39(1):1\u201338, 1977.\n\n[8] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,\n\n1(2):209\u2013230, 1973.\n\n[9] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of\n\nthe American Statistical Association, 96(453):pp. 161\u2013173, 2001.\n\n[10] S. Kim and P. Smyth. Hierarchical Dirichlet processes with random effects. In B. Sch\u00a8olkopf,\nJ. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19,\npages 697\u2013704, Cambridge, MA, 2007. MIT Press.\n\n[11] K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational dirichlet process mixtures.\n\nIn Advances in Neural Information Processing Systems 19. 2002.\n\n[12] S. Pyne, X. Hu, K. Wang, E. Rossin, T.-I. Lin, L. M. Maier, C. Baecher-Allan, G. J. McLachlan,\nP. Tamayo, D. A. Ha\ufb02er, P. L. De Jager, and J. P. Mesirov. Automated high-dimensional \ufb02ow\ncytometric data analysis. Proc Natl Acad Sci U S A, 106(21):8519\u201324, 2009.\n\n[13] A. Rodriguez, D. B. Dunson, and A. E. Gelfand. The nested Dirichlet process. Journal of The\n\nAmerican Statistical Association, 103:1131\u20131154, 2008.\n\n[14] E. B. Sudderth, A. B. Torralba, W. T. Freeman, and A. S. Willsky. Describing visual scenes\nusing transformed objects and parts. International Journal of Computer Vision, 77:291\u2013330,\n2008.\n\n[15] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n9\n\n\f", "award": [], "sourceid": 26, "authors": [{"given_name": "Halid", "family_name": "Yerebakan", "institution": "IUPUI"}, {"given_name": "Bartek", "family_name": "Rajwa", "institution": "Purdue University"}, {"given_name": "Murat", "family_name": "Dundar", "institution": "IUPUI"}]}