{"title": "A nonparametric variable clustering model", "book": "Advances in Neural Information Processing Systems", "page_first": 2987, "page_last": 2995, "abstract": "Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i.e. a clustering, of observed variables so that variables in a cluster are highly correlated. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date.", "full_text": "A nonparametric variable clustering model\n\nKonstantina Palla\u2217\nUniversity of Cambridge\nkp376@cam.ac.uk\n\nDavid A. Knowles\u2217\nStanford University\n\ndavidknowles@cs.stanford.edu\n\nZoubin Ghahramani\nUniversity of Cambridge\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nFactor analysis models effectively summarise the covariance structure of high di-\nmensional data, but the solutions are typically hard to interpret. This motivates at-\ntempting to \ufb01nd a disjoint partition, i.e. a simple clustering, of observed variables\ninto highly correlated subsets. We introduce a Bayesian non-parametric approach\nto this problem, and demonstrate advantages over heuristic methods proposed to\ndate. Our Dirichlet process variable clustering (DPVC) model can discover block-\ndiagonal covariance structures in data. We evaluate our method on both synthetic\nand gene expression analysis problems.\n\n1\n\nIntroduction\n\nLatent variables models such as principal components analysis (Pearson, 1901; Hotelling, 1933; Tip-\nping and Bishop, 1999; Roweis, 1998) and factor analysis (Young, 1941) are popular for summaris-\ning high dimensional data, and can be seen as modelling the covariance of the observed dimensions.\nSuch models may be used for tasks such as collaborative \ufb01ltering, dimensionality reduction, or data\nexploration. For all these applications sparse factor analysis models can have advantages in terms\nof both predictive performance and interpretability (Fokoue, 2004; Fevotte and Godsill, 2006; Car-\nvalho et al., 2008). For example, data exploration might involve investigating which variables have\nsigni\ufb01cant loadings on a shared factor, which is aided if the model itself is sparse. However, even\nusing sparse models interpreting the results of a factor analysis can be non-trivial since a variable\nwill typically have signi\ufb01cant loadings on multiple factors.\nAs a result of these problems researchers will often simply cluster variables using a traditional\nagglomerative hierarchical clustering algorithm (Vigneau and Qannari, 2003; Duda et al., 2001).\nInterest in variable clustering exists in many applied \ufb01elds, e.g. chemistry (Basak et al., 2000a,b)\nand acturial science (Sanche and Lonergan, 2006). However, it is most commonly applied to gene\nexpression analysis (Eisen et al., 1998; Alon et al., 1999; D\u2019haeseleer et al., 2005), which will also\nbe the focus of our investigation. Note that variable clustering represents the opposite regime to the\nusual clustering setting where we partition samples rather than dimensions (but of course a clustering\nalgorithm can be made to work like this simply by transposing the data matrix). Typical clustering\nalgorithms, and their probabilistic mixture model analogues, consider how similar entities are (e.g.\nin terms of Euclidean distance) rather how correlated they are, which would be closer in spirit to the\nability of factor analysis to model covariance structure. While using correlation distance (one minus\nthe Pearson correlation coef\ufb01cient) between variables has been proposed for clustering genes with\nheuristic methods, the corresponding probabilistic model appears not to have been explored to the\nbest of our knowledge.\n\n\u2217These authors contributed equally to this work\n\n1\n\n\fTo address the general problem of variable clustering we develop a simple Bayesian nonparametric\nmodel which partitions observed variables into sets of highly correlated variables. We denote our\nmethod DPVC for \u201cDirichlet Process Variable Clustering\u201d. DPVC exhibits the usual advantages\nover heuristic methods of being both probabilistic and non-parametric: we can naturally handle\nmissing data, learn the appropriate number of clusters from data, and avoid over\ufb01tting.\nThe paper is organised as follows. Section 2 describes the generative process. In Section 3 we\nnote relationships to existing nonparametric sparse factor analysis models, Dirichlet process mixture\nmodels, structure learning with hidden variables, and the closely related \u201cCrossCat\u201d model (Shafto\net al., 2006). In Section 4 we describe ef\ufb01cient MCMC and variational Bayes algorithms for per-\nforming posterior inference in DPVC, and point out computational savings resulting from the simple\nnature of the model. In Section 5 we present results on synthetic data where we test the method\u2019s\nability to recover a \u201ctrue\u201d partitioning, and then focus on clustering genes based on gene expression\ndata, where we assess predictive performance on held out data. Concluding remarks are given in\nSection 6.\n\n2 The Dirichlet Process Variable Clustering Model\nConsider observed data {yn \u2208 RD : n = 1, .., N} where we have D observed dimensions and N\nsamples. The D observed dimensions correspond to measured variables for each sample, and our\ngoal is to cluster these variables. We partition the observed dimensions d = {1, ..., D} according\nto the Chinese restaurant process (Pitman, 2002, CRP). The CRP de\ufb01nes a distribution over parti-\ntionings (clustering) where the maximum possible number of clusters does not need to be speci\ufb01ed\na priori. The CRP can be described using a sequential generative process: D customers enter a\nChinese restaurant one at a time. The \ufb01rst customer sits at some table and each subsequent customer\nsits at table k with mk current customers with probability proportional to mk, or at a new table with\nprobability proportional to \u03b1, where \u03b1 is a parameter of the CRP. The seating arrangement of the\ncustomers at tables corresponds to a partitioning of the D customers. We write\n\n(1)\nwhere cd = k denotes that variable d belongs to cluster k. The CRP partitioning allows each\ndimension to belong only to one cluster. For each cluster k we have a single latent factor\n\n(c1, ..., cD) \u223c CRP(\u03b1),\n\ncd \u2208 N\n\n(2)\nwhich models correlations between the variables in cluster k. Given these latent factors, real valued\nobserved data can be modeled as\n\nxkn \u223c N (0, \u03c32\nx)\n\nydn = gdxcdn + \u0001dn\n\n(3)\nwhere gd is a factor loading for dimension d, and \u0001dn \u223c N (0, \u03c32\nd) is Gaussian noise. We place a\ng) on every element gd independently. It is straightforward to generalise the\nGaussian prior N (0, \u03c32\nmodel by substituting other noise models for Equation 3, for example using a logistic link for binary\ndata ydn \u2208 {0, 1}. However, in the following we will focus on the Gaussian case.\nTo improve the \ufb02exibility of the model, we put Inverse Gamma priors on \u03c32\nprior on the CRP concentration parameter \u03b1 as follows:\n\u03b1 \u223c G(1, 1)\ng \u223c IG(1, 1)\n\u03c32\nd \u223c IG(1, 0.1)\n\u03c32\n\ng and \u03c32\n\nd and a Gamma\n\nNote that we \ufb01x \u03c3x = 1 due to the scale ambiguity in the model.\n\n3 Related work\n\nSince DPVC is a hybrid mixture/factor analysis model there is of course a wealth of related work,\nbut we aim to highlight a few interesting connections here.\nDPVC can be seen as a simpli\ufb01cation of the in\ufb01nite factor analysis models proposed by Knowles\nand Ghahramani (2007) and Rai and Daum\u00b4e III (2008), which we will refer to as Non-parametric\n\n2\n\n\fx1\n\ny2\n\ny3\n\nx2\n\ny4\n\ny5\n\nx3\n\ny6\n\ny1\n\nFigure 1: Graphical model structure that could be learnt using the model, corresponding to cluster\nassignments c = {1, 1, 1, 2, 2, 3}. Gray nodes represent the D = 6 observed variables yd and white\nnodes represent the K = 3 latent variables xk.\n\nSparse Factor Analysis (NSFA). Where they used the Indian buffet process to allow dimensions\nto have non-zero loadings on multiple factors, we use the Chinese restaurant process to explicitly\nenforce that a dimension can be explained by only one factor. Obviously this will not be appropriate\nin all circumstances, but where it is appropriate we feel it allows easier interpretation of the results.\nTo see the relationship more clearly, introduce the indicator variable zdk = I[cd = k]. We can then\nwrite our model as\n\nyn = (G \u00b7 Z)xn + \u0001n\n\n(4)\nwhere G is a D \u00d7 K Gaussian matrix, and \u00b7 denotes elementwise multiplication. Replacing our\nChinese restaurant process prior on Z with an Indian buffet prior recovers an in\ufb01nite factor analysis\nmodel. Equation 4 has the form of a factor analysis model. It is straightforward to show that the\nconditional covariance of y given the factor loading matrix W := G \u00b7 Z is \u03c32\n\u0001 I.\nAnalogously for DPVC we \ufb01nd\n\nxWWT + \u03c32\n\n(cid:26) \u03c32\n\ncov(ydn, yd(cid:48)n|G, c) =\n\nxgdgd(cid:48) + \u03c32\n0,\n\nd\u03b4dd(cid:48),\n\ncd = cd(cid:48)\notherwise\n\n(5)\n\nThus we see the covariance structure implied by DPVC is block diagonal: only dimensions belong-\ning to the same cluster have non-zero covariance.\nThe obvious probabilistic approach to clustering genes would be to simply apply a Dirichlet process\nmixture (DPM) of Gaussians, but considering the genes (our dimensions) as samples, and our sam-\nples as \u201cfeatures\u201d so that the partitioning would be over the genes. However, this approach would\nnot achieve the desired result of clustering correlated variables, and would rather cluster together\nvariables close in terms of Euclidean distance. For example two variables which have the relation-\nship yd = ayd(cid:48) for a = \u22121 (or a = 2) are perfectly correlated but not close in Euclidean space;\na DPM approach would likely fail to cluster these together. Also, practitioners typically choose ei-\nther to use restrictive diagonal Gaussians, or full covariance Gaussians which result in considerably\ngreater computational cost than our method (see Section 4.3).\nDPVC can also be seen as performing a simple form of structure learning, where the observed\nvariables are partitioned into groups explained by a single latent variable. This is subset of the\nstructures considered in Silva et al. (2006), but we maintain uncertainty over the structure using a\nfully Bayesian analysis. Figure 1 illustrates this idea.\nDPVC is also closely related to CrossCat (Shafto et al., 2006). CrossCat also uses a CRP to partition\nvariables into clusters, but then uses a second level of independent CRPs to model the dependence\nof variables within a cluster. In other words whereas the latent variables x in Figure 1 are discrete\nvariables (indicating cluster assignment) in CrossCat, they are continuous variables in DPVC cor-\nresponding to the latent factors. For certain data the CrossCat model may be more appropriate but\nour simple factor analysis model is more computationally tractable and often has good predictive\nperformance as well. The model of Niu et al. (2012) is related to CrossCat in the same way that\nNSFA is related to DPVC, by allowing an observed dimension to belong to multiple features using\nthe IBP rather than only one cluster using the CRP.\n\n4\n\nInference\n\nWe demonstrate both MCMC and variational inference for the model.\n\n3\n\n\fAlgorithm 1 Marginal conditional\n1: for m = 1 to M do\n2:\n3:\n4: end for\n\n\u03b8(m) \u223c P (\u03b8)\nY (m) \u223c P (Y |\u03b8(m))\n\n4.1 MCMC\n\nAlgorithm 2 Successive conditional\n1: \u03b8(1) \u223c P (\u03b8)\n2: Y (1) \u223c P (Y |\u03b8(1))\n3: for m = 2 to M do\n4:\n5:\n6: end for\n\n\u03b8(m) \u223c Q(\u03b8|\u03b8(m\u22121), Y (m\u22121))\nY (m) \u223c P (Y |\u03b8(m))\n\nWe use a partially collapsed Gibbs sampler to explore the posterior distribution over all latent vari-\ng and \u03b1. The Gibbs update equations for the factor\nables g, c, X as well as hyperparameters \u03c32\nloadings g, factors X, noise variance \u03c32\ng are standard, and therefore only sketched out be-\nlow with the details deferred to supplementary material. The Dirichlet concentration parameter \u03b1 is\nsampled using slice sampling (Neal, 2003). We sample the cluster assignments c using Algorithm 8\nof Neal (2000), with g integrated out but instantiating X. Updating the factor loading matrix G is\ndone elementwise, sampling from\n\nd, \u03c32\nd and \u03c32\n\nThe factors X can be jointly sampled as\n\ngdk|Y, G\u2212dk, C, X, \u03c3g, \u03c3x, \u03c3d, \u03b1 \u223c N (\u00b5\u2217\n\ng, \u03bb\u22121\ng )\nX:n|Y, G, C, \u03c3g, \u03c3x, \u03c3d, \u03b1 \u223c N (\u00b5X:n, \u039b\u22121\n\n)\n\n(7)\nWhen sampling the cluster assignments, c we found it bene\ufb01cial to integrate out g, while instantiat-\ning X. We require\n\nX:n\n\nP (cd = k|yd:, xk:, \u03c3g, c\u2212d) = P (cd|c\u2212d)\n\nP (yd:|xk:, gd)p(gd|\u03c3g)dgd\n\n(cid:90)\n\n(6)\n\nthe calculation of which is given in the supplementary material, along with expressions for\n\u00b5\u2217\ng, \u03bbg, \u00b5X:n and \u039bX:n.\nWe con\ufb01rm the correctness of our algorithm using the joint distribution testing methodology of\nGeweke (2004). There are two ways to sample from the joint distribution, P (Y, \u03b8) over parameters,\n\u03b8 = {g, c, X} and data, Y de\ufb01ned by a probabilistic model such as DPVC. The \ufb01rst we will refer\nto as \u201cmarginal-conditional\u201d sampling, shown in Algorithm 1. Both steps here are straightforward:\nsampling from the prior followed by sampling from the likelihood model. The second way, referred\nto as \u201csuccessive-conditional\u201d sampling is shown in Algorithm 2, where Q represents a single (or\nmultiple) iteration(s) of our MCMC sampler. To validate our sampler we can then check, either\ninformally or using hypothesis tests, whether the samples drawn from the joint P (Y, \u03b8) in these two\ndifferent ways appear to have come from the same distribution.\nWe apply this method to our DPVC sampler with just N = D = 2, and all hyperparameters \ufb01xed\nas follows: \u03b1 = 1, \u03c3d = 0.1, \u03c3g = 1, \u03c3x = 1. We draw 104 samples using both the marginal-\nconditional and successive-conditional procedures. We look at various characteristics of the sam-\nples, including the number of clusters and the mean of X. The distribution of the number of features\nunder the successive-conditional sampler matches that under the marginal-conditional sampler al-\nmost perfectly. Under the correct successive-conditional sampler the average number of clusters is\n1.51 (it should be 1.5): a hypothesis test did not reject the null hypothesis that the means of the two\ndistributions are equal. While this cannot completely guarantee correctness of the algorithm and\ncode, 104 samples is a large number for such a small model and thus gives strong evidence that our\nalgorithm is correct.\n\n4.2 Variational inference\n\nWe use Variational Message Passing (Winn and Bishop, 2006) under the Infer.NET frame-\nwork (Minka et al., 2010) to \ufb01t an approximate posterior q to the true posterior p, by minimising the\nKullback-Leibler divergence\n\nq(v) log p(v)dv\n\n(8)\n\n(cid:90)\n\nKL(q||p) = \u2212H[q(v)] \u2212\n\n4\n\n\fwhere H[q(v)] = \u2212(cid:82) q(v) log q(v)dv is the entropy and v = {w, g, c, X, \u03c32\n\ng}, where w is\n\nd, \u03c32\n\nintroduced so that the Dirichlet process can be approximated as\nw \u223c Dirichlet(\u03b1/T, ..., \u03b1/T )\ncd \u223c Discrete(w)\n\n(9)\n(10)\n\nwhere we have truncated to allow a maximum of T clusters. Where not otherwise speci\ufb01ed we\nchoose T = D so that every dimension could use its own cluster if this is supported by the data.\nNote that the Dirichlet process is recovered in the limit T \u2192 \u221e.\nWe use a variational posterior of the form\n\nD(cid:89)\n\nN(cid:89)\n\nq(v) = qw(w)q\u03c32\n\ng\n\n(\u03c32\ng)\n\nqcd (cd)q\u03c32\n\nd\n\nd)qgd|cd (gd|cd)\n(\u03c32\n\nqxnd (xnd)\n\n(11)\n\nd=1\n\nn=1\n\ng\n\nd\n\nand q\u03c32\n\nwhere qw is a Dirichlet distribution, each qcd is a discrete distribution on {1, .., T}, q\u03c32\nare\nInverse Gamma distributions and qnd and qgd|cd are univariate Gaussian distributions. We found that\nusing the structured approximation qgd|cd (gd|cd) where the variational distribution on gd is condi-\ntional on the cluster assignment cd gave considerably improved performance. Using the representa-\ntion of the Dirichlet process in Equation 10 this model is conditionally conjugate (i.e. all variables\nhave exponential family distributions conditioned on their Markov blanket) so the VB updates are\nstandard and therefore omitted here.\nDue to the symmetry of the model under permutation of the clusters, we are require to somehow\nbreak symmetry initially. We experimented with initialising either the variational distribution over\nthe factors qxnd (xnd) with mean N (0, 0.1) and variance 1 or each cluster assignments distribution\nqcd (cd) to a sample from a uniform Dirichlet. We found initialising the cluster assignments gave\nconsiderably better solutions on average. We also typically ran the algorithm L = 10 times and took\nthe solution with the best lower bound on the marginal likelihood.\nWe also experimented with using Expectation Propagation (Minka, 2001) for this model but found\nthat the algorithm often diverged, presumably because of the multimodality in the posterior. It might\nbe possible to alleviate this using damping, but we leave this to future work.\n\n4.3 Computational complexity\n\nDPVC enjoys some computational savings compared to NSFA. For both models sampling the fac-\ntor loadings matrix is O(DKN ), where K is the number of active features/clusters. However, for\nDPVC sampling the factors X is considerably cheaper. Calculating the diagonal precision matrix is\nO(KD) (compared to O(K 2D) for the precision in NSFA), and \ufb01nding the square root of the di-\nagonal elements is negligible at O(K) (compared to a O(K 3) Cholesky decomposition for NSFA).\nFinally both models require an O(DKN ) operation to calculate the conditional mean of X. Thus\nwhere NSFA is O(DKN + DK 2 + K 3), DPVC is only O(DKN ), which is the same complexity\nas k-means or Expectation Maximisation (EM) for mixture models with diagonal Gaussian clus-\nters. Note that mixture models with full covariance clusters would typically cost O(DKN 3) in this\nsetting due to the need to perform Cholesky decompositions on N \u00d7 N matrices.\n\n5 Results\n\nWe present results on synthetic data and two gene expression data sets. We show comparisons\nto k-means and hierarchical clustering, for which we use the algorithms provided in the Matlab\nstatistics toolbox. We also compare to our implementation of Bayesian factor analysis (see for\nexample Kaufman and Press (1973) or Rowe and Press (1998)) and the non-parametric sparse factor\nanalysis (NSFA) model of (Knowles and Ghahramani, 2011). We experimented with three publicly\navailable implementations of DPM of Gaussian using full covariance matrices, but found that none\nof them were suf\ufb01ciently numerically robust to cope with the high dimensional and sometimes\nill conditioned gene expression data analysed in Section 5. To provide a similar comparison we\nimplemented a DPM of diagonal covariance Gaussians using a collapsed Gibbs sampler.\n\n5\n\n\fFigure 2: Performance of DPVC compared to k-means at recoverying the true partitioning used to\nsimulate the data.\n\nDataset\nFA (K = 20)\nBreast cancer \u22120.876 \u00b1 0.024 \u22120.634 \u00b1 0.038 \u22121.348 \u00b1 0.108 \u22121.129 \u00b1 0.043 \u22121.275 \u00b1 0.056 \u22121.605 \u00b1 0.072\n\u22120.849 \u00b1 0.012 \u22120.653 \u00b1 0.061 \u22121.397 \u00b1 0.419 \u22121.974 \u00b1 1.925 \u22121.344 \u00b1 0.165 \u22121.115 \u00b1 0.052\nYeast\n\nFA (K = 10)\n\nDPVC\n\nNSFA\n\nDPM\n\nFA (K = 5)\n\nTable 1: Predictive performance (mean log predictive loglikelihood over the test elements) results\non two gene expression datasets.\n\n5.1 Synthetic data\n\nIn order to test the ability of the models to recover a true underlying partitioning of the variables\ninto correlated groups we use synthetic data. We generate synthetic data with D = 20 dimensions\npartitioned into K = 5 equally sized clusters (of four variables). Within each cluster we sample\nanaloguously to our model: sample xkn \u223c N (0, 1) for all k, n, then gd \u223c N (0, 1) for all d and\n\ufb01nally sample ydn \u223c N (gdxcdn, 0.1) for all d, n. We vary the sample size N and perform 10\nrepeats for each sample size. We compare k-means (with the true number of clusters 5) using\nEuclidean distance and correlation distance, and DPVC with inference using MCMC or variational\nBayes. To compare the inferred and true partitions we calculate the well known Rand index, which\nvaries between 0 and 1, with 1 denoting perfect recovering of the true clustering. The results are\nshown in Figure 2. We see that the MCMC implementation of DPVC consistently outperforms\nthe k-means methods. As expected given the nature of the data simulation, k-means using the\ncorrelation distance performs better than using Euclidean distance. DPVC VB\u2019s performance is\nsomewhat disappointing, suggesting that even the structured variational posterior we use is a poor\napproximation of the true posterior. We emphasise that k-means is given a signi\ufb01cant advantage:\nit is provided with the true number of clusters. In this light, the performance of DPVC MCMC is\nimpressive, and the seemingly poor performance of DPVC VB is more forgivable (DPVC VB used\na truncation level T = D = 20).\n\n5.2 Breast cancer dataset\n\nWe assess these algorithms in terms of predictive performance on the breast cancer dataset of West\net al. (2007), including 226 genes across 251 individuals. The samplers were found to have con-\nverged after around 500 samples according to standard multiple chain convergence measures, so\n1000 MCMC iterations were used for all models. The predictive log likelihood was calculated using\nevery 10th sample form the \ufb01nal 500 samples. We ran 10 repeats holding out a different random 10%\nof the the elements of the matrix as test data each time. The results are shown in Table 1. We see\nthat NSFA performs the best, followed by DPVC. This is not surprising and is the price DPVC pays\nfor a more interpretable solution. However, DPVC does outperform both the DPM and the \ufb01nite\n(non-sparse) factor analysis models. We also ran DPVC VB on this dataset but its performance was\n\n6\n\n1011021030.10.20.30.40.50.60.70.80.911.1RAND indexsample size, N DPVC MCMCK\u2212means (distance)K\u2212means (correlation)DPVC VB\fFigure 3: Clustering of the covariance structure. Left: k-means using correlation distance. Middle:\nAgglomerative heirarchical clustering using average linkage and correlation distance. Right: DPVC\nMCMC.\n\nsigni\ufb01cantly below that of the MCMC method, with a predictive log likelihood of \u22121.154 \u00b1 0.010.\nPerforming a Gene Ontology enrichment analysis we \ufb01nd clusters enriched for genes involved in\nboth cell cycle regulation and cell division, which is biologically reasonable in a cancer orientated\ndataset\nOn this relatively small dataset it is possible to visualise the D \u00d7 D empirical correlation matrix\nof the data, and investigate what structure our clustering has uncovered, as shown in Figure 3. The\ngenes have been reordered in each plot according three different clusterings coming from k-means,\nhierarchical clustering and DPVC (MCMC, note we show the clustering corresponding to the pos-\nterior sample with the highest joint probability). For both k-means and hierarchical clustering it\nwas necessary to \u201ctweak\u201d the number of clusters to give a sensible result. Hierarchical clustering in\nparticular appeared to have a strong bias towards putting the majority of the genes in one large clus-\nter/clade. Note that such a visualisation is straightforward only because we have used a clustering\nbased method rather than a factor analysis model, emphasising how partitionings can be more useful\nsummaries of data for certain tasks than low dimensional embeddings.\n\n5.3 Yeast in varying environmental conditions\n\nWe use the data set of (Gasch et al., 2000), a collection of N = 175 non-cell-cycle experiments\non S. cerevisiae (yeast), including conditions such as heat shock, nitrogen depletion and amino acid\nstarvation. Measurements are available for D = 6152 genes. Again we ran 10 repeats holding\nout a different random 10% of the the elements of the matrix as test data each time. The results\nshown in Table 1 are broadly consistent with our \ufb01ndings for the breat cancer dataset: DPVC sits\nbetween NSFA and the less performant DPM and FA models. Running 1000 iterations of DPVC\nMCMC on this dataset takes around 1.2 hours on a standard dual core desktop running at 2.5GHz\nwith 4Gb RAM. Unfortunately we were unable to run the VB algorithm on a dataset of this size due\nto memory constraints.\n\n6 Discussion\n\nWe have introduced DPVC, a model for clustering variables into highly correlated subsets. While,\nas expected, we found the predictive performance of DPVC is somewhat worse than that of state of\nthe art nonparametric sparse factor analysis models (e.g. NSFA), DPVC outperforms both nonpara-\nmetric mixture models and Bayesian factor analysis models when applied to high dimensional data\nsuch as gene expression microarrays. For a practitioner we see interpretability as the key advantage\nof DPVC relative to a model such as NSFA: one can immediately see which groups of variables are\ncorrelated, and use this knowledge to guide further analysis. An example use one could envisage\nwould be using DPVC in an analoguous fashion to principal components regression: regressing a\ndependent variable against the inferred factors X. Regression coef\ufb01cients would then correspond to\nthe predictive ability of the clusters of variables.\n\n7\n\n501001502002040608010012014016018020022050100150200204060801001201401601802002205010015020020406080100120140160180200220\f7 Acknowledgements\n\nThis work was supported by the Engineering and Physical Sciences Research Council (EPSRC)\nunder Grant Number EP/I036575/1 and EP/H019472/1.\n\nReferences\nAlon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A. (1999). Broad\npatterns of gene expression revealed by clustering analysis of tumor and normal colon tissues\nprobed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745.\nBasak, S., Balaban, A., Grunwald, G., and Gute, B. (2000a). Topological indices: their nature and\n\nmutual relatedness. Journal of chemical information and computer sciences, 40(4):891\u2013898.\n\nBasak, S., Grunwald, G., Gute, B., Balasubramanian, K., and Opitz, D. (2000b). Use of statistical\nand neural net approaches in predicting toxicity of chemicals. Journal of Chemical Information\nand Computer Sciences, 40(4):885\u2013890.\n\nCarvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008). High-\ndimensional sparse factor modeling: Applications in gene expression genomics. Journal of the\nAmerican Statistical Association, 103(484):1438\u20131456.\n\nD\u2019haeseleer, P. et al. (2005). How does gene expression clustering work? Nature biotechnology,\n\n23(12):1499\u20131502.\n\nDuda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classi\ufb01cation. Wiley-Interscience, 2nd\n\nedition.\n\nEisen, M., Spellman, P., Brown, P., Botstein, D., Sherlock, G., Zhang, M., Iyer, V., Anders, K.,\nBotstein, D., Futcher, B., et al. (1998). Gene expression: Clustering. Proc Natl Acad Sci US A,\n95(25):14863\u20138.\n\nFevotte, C. and Godsill, S. J. (2006). A Bayesian approach for blind separation of sparse sources.\n\nAudio, Speech, and Language Processing, IEEE Transactions on, 14(6):2174\u20132188.\n\nFokoue, E. (2004). Stochastic determination of the intrinsic structure in Bayesian factor analysis.\n\nTechnical report, Statistical and Applied Mathematical Sciences Institute.\n\nGasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M., Storz, G., Botstein, D., and Brown,\nP. (2000). Genomic expression programs in the response of yeast cells to environmental changes.\nScience\u2019s STKE, 11(12):4241.\n\nGeweke, J. (2004). Getting it right. Journal of the American Statistical Association, 99(467):799\u2013\n\n804.\n\nHotelling, H. (1933). Analysis of a complex of statistical variables into principal components.\n\nJournal of Educational Psychology, 24:417\u2013441.\n\nKaufman, G. M. and Press, S. J. (1973). Bayesian factor analysis. Technical Report 662-73, Sloan\n\nSchool of Management, University of Chicago.\n\nKnowles, D. A. and Ghahramani, Z. (2007). In\ufb01nite sparse factor analysis and in\ufb01nite independent\ncomponents analysis. In 7th International Conference on Independent Component Analysis and\nSignal Separation, volume 4666, pages 381\u2013388. Springer.\n\nKnowles, D. A. and Ghahramani, Z. (2011). Nonparametric Bayesian sparse factor models with\n\napplication to gene expression modeling. The Annals of Applied Statistics, 5(2B):1534\u20131552.\n\nMinka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), volume 17.\n\nMinka, T. P., Winn, J. M., Guiver, J. P., and Knowles, D. A. (2010). Infer.NET 2.4.\nNeal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal\n\nof computational and graphical statistics, 9(2):249\u2013265.\n\nNeal, R. M. (2003). Slice sampling. The Annals of Statistics, 31(3):705\u2013741.\nNiu, D., Dy, J., and Ghahramani, Z. (2012). A nonparametric bayesian model for multiple clustering\n\nwith overlapping feature views. Journal of Machine Learning Research, 22:814\u2013822.\n\n8\n\n\fPearson, K. (1901). On lines and planes of closest \ufb01t to systems of points in space. Philosophical\n\nMagazine Series 6, 2:559\u2013572.\n\nPitman, J. (2002). Combinatorial stochastic processes. Technical report, Department of Statistics,\n\nUniversity of California at Berkeley.\n\nRai, P. and Daum\u00b4e III, H. (2008). The in\ufb01nite hierarchical factor regression model. In Advances in\n\nNeural Information Processing Systems (NIPS).\n\nRowe, D. B. and Press, S. J. (1998). Gibbs sampling and hill climbing in Bayesian factor analysis.\n\nTechnical Report 255, Department of Statistics, University of California Riverside.\n\nRoweis, S. (1998). EM algorithms for PCA and SPCA. In Advances in Neural Information Pro-\n\ncessing Systems (NIPS), pages 626\u2013632. MIT Press.\n\nSanche, R. and Lonergan, K. (2006). Variable reduction for predictive modeling with clustering. In\n\nCasualty Actuarial Society Forum, pages 89\u2013100.\n\nShafto, P., Kemp, C., Mansinghka, V., Gordon, M., and Tenenbaum, J. (2006). Learning cross-\nIn Proceedings of the 28th annual conference of the Cognitive\n\ncutting systems of categories.\nScience Society, pages 2146\u20132151.\n\nSilva, R., Scheines, R., Glymour, C., and Spirtes, P. (2006). Learning the structure of linear latent\n\nvariable models. The Journal of Machine Learning Research, 7:191\u2013246.\n\nTipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of\n\nthe Royal Statistical Society. Series B (Statistical Methodology), 61(3):611\u2013622.\n\nVigneau, E. and Qannari, E. (2003). Clustering of variables around latent components. Communi-\n\ncations in Statistics-Simulation and Computation, 32(4):1131\u20131150.\n\nWest, M., Chang, J., Lucas, J., Nevins, J. R., Wang, Q., and Carvalho, C. (2007). High-dimensional\nsparse factor modelling: Applications in gene expression genomics. Technical report, ISDS, Duke\nUniversity.\n\nWinn, J. and Bishop, C. M. (2006). Variational message passing. Journal of Machine Learning\n\nResearch, 6(1):661.\n\nYoung, G. (1941). Maximum likelihood estimation and factor analysis. Psychometrika, 6(1):49\u201353.\n\n9\n\n\f", "award": [], "sourceid": 1354, "authors": [{"given_name": "Konstantina", "family_name": "Palla", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "David", "family_name": "Knowles", "institution": null}]}