{"title": "Generalized Model Selection for Unsupervised Learning in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 970, "page_last": 976, "abstract": null, "full_text": "Generalized Model Selection For Unsupervised \n\nLearning In High Dimensions \n\nShivakumar Vaithyanathan \nIBM Almaden Research Center \n650 Harry Road \nSan Jose, CA 95136 \nShiv@almaden.ibm.com \n\nByron Dom \nIBM Almaden Research Center \n650 Harry Road \nSan Jose, CA 95136 \ndom@almaden.ibm.com \n\nAbstract \n\nWe describe a Bayesian approach to model selection in unsupervised \nlearning that determines both the feature set and the number of \nclusters. We then evaluate this scheme (based on marginal likelihood) \nand one based on cross-validated likelihood. For the Bayesian \nscheme we derive a closed-form solution of the marginal likelihood \nby assuming appropriate forms of the likelihood function and prior. \nExtensive experiments compare these approaches and all results are \nverified by comparison against ground truth. In these experiments the \nBayesian scheme using our objective function gave better results than \ncross-validation. \n\n1 Introduction \n\nRecent efforts define the model selection problem as one of estimating the number of \nclusters[ 10, 17]. It is easy to see, particularly in applications with large number of \nfeatures, that various choices of feature subsets will reveal different structures \nunderlying the data. It is our contention that this interplay between the feature subset \nand the number of clusters is essential to provide appropriate views of the data.We thus \ndefine the problem of model selection in clustering as selecting both the number of \nclusters and the feature subset. Towards this end we propose a unified objective \nfunction whose arguments include the both the feature space and number of clusters. \nWe then describe two approaches to model selection using this objective function. The \nfirst approach is based on a Bayesian scheme using the marginal likelihood for model \nselection. The second approach is based on a scheme using cross-validated likelihood. \nIn section 3 we apply these approaches to document clustering by making assumptions \nabout the document generation model. Further, for the Bayesian approach we derive a \nclosed-form solution for the marginal likelihood using this document generation model. \nWe also describe a heuristic for initial feature selection based on the distributional \nclustering of terms. Section 5 describes the experiments and our approach to validate \nthe proposed models and algorithms. Section 6 reports and discusses the results of our \nexperiments and finally section 7 provides directions for future work. \n\n\fModel Selection for Unsupervised Learning in High Dimensions \n\n971 \n\n2 Model selection in clustering \n\nModel selection approaches in clustering have primarily concentrated on determining \nthe number of components/clusters. These attempts include Bayesian approaches \n[7,10], MDL approaches [15] and cross-validation techniques [17] . As noticed in [17] \nhowever, the optimal number of clusters is dependent on the feature space in which the \nclustering is performed. Related work has been described in [7]. \n\n2.1 A generalized model for clustering \n\nLet D be a data-set consisting of \"patterns\" {d I, .. , d v }, which we assume to be \nrepresented in some feature space T with dimension M. The particular problem we \naddress is that of clustering D into groups such that its likelihood described by a \nprobability model p(DTIQ), is maximized, where DT indicates the representation of D \nin feature space T and Q is the structure of the model, which consists of the number of \nclusters, the partitioning of the feature set (explained below) and the assignment of \npatterns to clusters. This model is a weighted sum of models {P(DTIQ, ~)I~ E [Rm} \nwhere ~ is the set of all parameters associated with Q . To define our model we begin \nby assuming that the feature space T consists of two sets: U - useful features and N -\nnoise features. Our feature-selection problem will thus consist of partitioning T (into U \nand N) for a given number of clusters. \n\nAssumption 1 The feature sets represented by U and N are conditionally \nindependent \n\np(DTIQ,~) = P(DN I Q, ~) P(D u I Q,~) \n\n(1) \n\nwhere DN indicates data represented in the noise feature space and D U indicates data \nrepresented in useful feature space. \n\nUsing assumption 1 and assuming that the data is independently drawn, we can rewrite \nequation (1) as \n\np(DTIQ,~) = {n p(d~ I ~N). nn p(dy I ~f)} \n\nk=IJED, \n\n1=1 \n\n(2) \n\nwhere V is the number of patterns in D, p(dy I ~u) is the probability of dy given \nthe parameter vector ~f and p(d~ I ~N) is the probability of d~ given the parameter \nvector ~N . Note that while the explicit dependence on Q has been removed in this \nnotation , it is implicit in the number of clusters K and the partition of T into Nand U. \n2.2 Bayesian approach to model selection \n\nThe objective function, represented in equation (2) is not regularized and attempts to \noptimize it directly may result in the set N becoming empty - resulting in overfitting. \nTo overcome this problem we use the marginallikelihood[2]. \n\nAssumption 2 All parameter vectors are independent. n (~) = n (~N). n n (~f) \nwhere the n( ... ) denotes a Bayesian prior distribution. The marginal likelihood, using \nassumption 2, can be written as \n\nP(DT I Q)= IN [UP(d~ I ~N)]n(~N)d~N. DL [!lp(dY I ~f)]n(~f)d~f(3) \n\nk=1 \n\nK \n\n\f972 \n\nS. Vaithyanathan and B. Dom \n\nwhere SN, SV are integral limits appropriate to the particular parameter spaces. These \nwill be omitted to simplify the notation. \n\n3.0 Document clustering \n\nDocument clustering algorithms typically start by representing the document as a \n\"bag-of-words\" in which the features can number - 104to 105 . Ad-hoc dimensionality \nreduction techniques such as stop-word removal, frequency based truncations [16] and \ntechniques such as LSI [5] are available. Once the dimensionality has been reduced, the \ndocuments are usually clustered into an arbitrary number of clusters. \n\n3.1 Multinomial models \n\nSeveral models of text generation have been studied[3]. Our choice is multinomial \nmodels using term counts as the features. This choice introduces another parameter \nindicating the probability of the Nand U split. This is equivalent to assuming a \ngeneration model where for each document the number of noise and useful terms are \ndetermined by a probability (}s and then the terms in a document are \"drawn\" with a \nprobability ((}n or ()~ ). \n\n3.2 Marginal likelihood / stochastic complexity \n\nTo apply our Bayesian objective function we begin by substituting multinomial models \ninto (3) and simplifying to obtain \n\nP(D I Q) = (tN ;,tV ) S[((}S)tN (1- (}s)t u]n((}s)d(}S . \n\nk==1 IEDk \n\n[Ii: II ({t. tIYEU}J] S[II((}k)t,.u]n((}f)d(}f \u00b7 \n[iI ( tf J 1 S[II((}n)ti .\u2022 ] n((}N) d(}N \n\nI,U U \n\nUEV \n\nj=1 \n\n{tj,nlnEN} \n\nnEN \n\n(4) \n\nwhere ( (.'.\\) is the multinomial coefficient, ti,u is the number of occurrences of the \nfeature term u in document i, \ndocument i (tY = L U ti,u, t~:, and \nnoise features, (n = kl(~~k)! , tNis the total number of all noise features in all patterns \n\ntYis the total number of all useful features (terms) in \nti,n are to be interpreted similar to above but for \n\nand tVis the total number of all useful features in all patterns. \n\nTo solve (4) we still need a form for the priors {n( ... )}. The Beta family is conjugate \nto the Binomial family [2] and we choose the Dirichlet distribution (mUltiple Beta) as \nthe form for both n((}f) and n((}N) and the Beta distribution for n((}s). Substituting \nthese into equation (8) and simplifying yields \nP (D I Q) = [ f(Ya + Yb) f(tN + Ya)f(tV + Yb) ] \u2022 [ \n\nII f(/Jn + tn) ] \n\nf(/J) \n\nf(Ya)f(Yb) [(tV + tN + Ya + Yb) \n[ [(0') K r(O'k + IDkl)] [K \n[(0'+ v) D f(lD kl) \n\nf(/J + tN) nEN \n\u2022 D f(a+ tU(k) Du \n\nf(a) \n\nf(/Jn) \n\n[(au + tV ] \n\nf(a u) \n\n(5) \n\n\fModel Selection for Unsupervised Learning in High Dimensions \n\n973 \n\nUEU \n\nneN \n\nf3, and au are the hyper-parameters of the Dirichlet prior for noise and useful \nwhere \nfeatures respectively, f3 = L f3n , a = L au, U = L ukand ro is the \"gamma\" function. \nFurther, Yu, Yure the hyper parameters of the Beta prior for the split probability, IDkl is \nthe number of documents in cluster k and tU(k is computed as L tf. The results \nreported for our evaluation will be the negative of the log of equation (5), which \n(following Rissanen [14]) we refer to as Stochastic Complexity (SC). In our \nexperiments all values of the hyper-parameters pj ,aj (Jk> Ya and Y bare set equal to 1 \nyielding uniform priors. \n\niEDk \n\nk=1 \n\n3.3 Cross-Validated likelihood \n\nTo compute the cross validated likelihood using multinomial models we first substitute \nthe multinomial functional forms, using the MLE found using the training set. This \nresults in the following equation \n\nP(CVT I QP) = [(05)t\" .. (1- ( 5)1,,] IT p(evf ION) . IT IT peevy I O~i)' p(q) (6) \n\n{ \n\n,.....,., U VIt!\\1 \n\n,......., N \n\n~ \n\n,.....,.., \n\nK \n\n1=1 \n\nk=IJEDk \n\n,..., ---\n\n,..., \n\nwhere Os, ON and O~i) are the MLE of the appropriate parameter vectors. For our \nimplementation of MCCV, following the suggestion in [17], we have used a 50% split \nof the training and test set. For the vCV criterion although a value of v = 10 was \nsuggested therein, for computational reasons we have used a value of v = 5. \n\n3.4 Feature subset selection algorithm for document clustering \n\nAs noted in section 2.1, for a feature-set of size M there are a total of 2M partitions and \nfor large M it would be computationally intractable to search through all possible \npartitions to find the optimal subset. In this section we propose a heuristic method to \nobtain a subset of tokens that are topical (indicative of underlying topics) and can be \nused as features in the bag-of-words model to cluster documents. \n\n3.4.1 Distributional clustering for feature subset selection \n\nIdentifying content-bearing and topical terms, is an active research area [9]. We are less \nconcerned with modeling the exact distributions of individual terms as we are with \nsimply identifying groups of terms that are topical. Distributional clustering (DC), \napparently first proposed by Pereira et al [13], has been used for feature selection in \nsupervised text classification [1] and clustering images in video sequences [9]. We \nhypothesize that function, content-bearing and topical terms have different distributions \nover the documents. DC helps reduce the size of the search space for feature selection \nfrom 2M to 2e, where C is the number of clusters produced by the DC algorithm. \nFollowing the suggestions in [9], we compute the following histogram for each token. \nThe first bin consists of the number of documents with zero occurrences of the token, \nthe second bin is the number of documents consisting of a single occurrence of the \ntoken and the third bin is the number of documents that contain more two or more \noccurrences of the term. The histograms are clustered using reLative entropy ~(. II .) as \n\n\f974 \n\nS. Vaithyanathan and B. Dom \n\na distance measure. For two terms with probability distributions PI (.) and P2(.), this is \ngiven by [4]: \n\n,1.(Pt(t) II P2(t)) = k PI(t) log P2(t) \n\nPI(t) \n\n(7) \n\n'\" \n\nt \n\nWe use a k-means-style algorithm in which the histograms are normalized to sum to \none and the sum in equation (7) is taken over the three bins corresponding to counts of \n0,1, and ~ 2. During the assignment-to-clusters step of k-means we compute \n,1.(pw II PCk) (where pw is the normalized histogram for term wand Pq(t) is the centroid \nof cluster k) and the term w is assigned to the cluster for which this is minimum [13,8]. \n\n4.0 Experimental setup \n\nOur evaluation experiments compared the clustering results against human-labeled \nground truth. The corpus used was the AP Reuters Newswire articles from the TREC-6 \ncollection. A total of 8235 documents, from the routing track, existing in 25 classes \nwere analyzed in our experiments. To simplify matters we disregarded multiple \nassignments and retained each document as a member of a single class. \n\n4.1 Mutual information as an evaluation measure of clustering \n\nWe verify our models by comparing our clustering results against pre-classified text. \nWe force all clustering algorithms to produce exactly as many clusters as there are \nclasses in the pre-classified text and we report the mutual information[ 4] (MI) between \nthe cluster labels and pre-classified class labels \n\n5.0 Results and discussions \n\nAfter tokenizing the documents and discarding terms that appeared in less than 3 \ndocuments we were left with 32450 unique terms. We experimented with several \nnumbers of clusters for DC but report only the best (lowest SC) for lack of space. For \neach of these clusters we chose the best of 20 runs corresponding to different random \nstarting clusters. Each of these sets includes one cluster that consists of high-frequency \nwords and upon examination were found to contain primarily function words, which we \neliminated from further consideration. The remaining non-function-word clusters were \nused as feature sets for the clustering algorithm. Only combinations of feature sets that \nproduced good results were used for further document clustering runs. \n\nWe initialized the EM algorithm using k-means algorithm - other initialization schemes \nare discussed in [11]. The feature vectors used in this k -means initialization were \ngenerated using the pivoted normal weighting suggested in [16]. All parameter vectors \nOf and eN were estimated using Laplace's Rule of Succession[2]. Table 1 shows the \nbest results of the SC criterion, the vCV and MCCV using the feature subsets selected \nby the different combinations of distributional clusters. The feature subsets are coded as \nFSXP where X indicates the number of clusters in the distributional clustering and P \nindicates the cluster number(s) used as U. For SC and MI all results reported are \naverages over 3 runs of the k-means+EM combination with different initialization fo \nk-means. For clarity, the MI numbers reported are normalized such that the theoretical \nmaximum is 1.0. We also show comparisons against no feature selection (NF) and LSI. \n\n\fModel Selection for Unsupervised Learning in High Dimensions \n\n975 \n\nFor LSI, the principal 165 eigenvectors were retained and k-means clustering was \nperformed in the reduced dimensional space. While determining the number of clusters, \nfor computational reasons we have limited our evaluation to only the feature subset that \nprovided us with the highest MI, i.e., FS41-3 . \n\nFeature \n\nSet \n\nFS41-3 \nFS52 \nNF \nLSI \n\nUseful \nFeatures \n\n6,157 \n386 \n\n32,450 \n\n324501165 \n\nSC \nX 107 \n2.66 \n2.8 \n2.96 \nNA \n\nvCV \nX 107 \n0.61 \n0.3 \n1.25 \nNA \n\nMCCV \nX 107 \n1.32 \n0.69 \n2.8 \nNA \n\nMl \n\n0.61 \n0.51 \n0.58 \n0.57 \n\nTable 1 Comparison Of Results \n\n\u2022 . ..... \n\nFigure 1 \n\n.: \n\n1~1 \n\n.. \n\n. . . \n.. . \" \n\u2022 \n\n, .. \n\n\u2022 \n\n1~1 \n.. \n\nI \n, \n\n. ~ .. \n\nFlgur.2 \n\n\" . \n\n\u2022 \n\n... \n\nMCCV\u00b7 .... gII ..... l avl ... ....-ood \n\nI \n\" \n\n, .. \n\n\" \n\n'\" .. \n\nThIwI:_P'!Ch \n\n....,.. ..... Log .... IQ ..... L ... hood \n\n5.3 Discussion \n\nThe consistency between the MI and SC (Figure 1) is striking. The monotonic trend is \nmore apparent at higher SC indicating that bad clusterings are more easily detected by \nSC while as the solution improves the differences are more subtle. Note that the best \nvalue of SC and Ml coincide. Given the assumptions made in deriving equation (5), this \nconsistency and is encouraging. The interested reader is referred to [18] for more \ndetails. Figures 2 and 3 indicate that there is certainly a reasonable consistency \nbetween the cross-validated likelihood and the MI although not as striking as the SC. \nNote that the MI for the feature sets picked by MCCV and vCV is significantly lower \nthan that of the best feature-set. Figures 4,5 and 6 show the plots of SC, MCCV and \nvCV as the number of clusters is increased. Using SC we see that FS41-3 reveals an \noptimal structure around 40 clusters. As with feature selection, both MCCV and vCV \nobtain models of lower complexity than Sc. Both show an optimum of about 30 \nclusters. More experiments are required before we draw final conclusions, however, the \nfull Bayesian approach seems a practical and useful approach for model selection in \ndocument clustering. Our choice of likelihood function and priors provide a \nclosed-form solution that is computationally tractable and provides meaningful results. \n\n6.0 Conclusions \n\nIn this paper we tackled the problem of model structure determination in clustering. \nThe main contribution of the paper is a Bayesian objective function that treats optimal \nmodel selection as choosing both the number of clusters and the feature subset. An \nimportant aspect of our work is a formal notion that forms a basis for doing feature \nselection in unsupervised learning. We then evaluated two approaches for model \nselection: one using this objective function and the other based on cross-validation. \n\n\f976 \n\nS. Vaithyanathan and B. Dom \n\nBoth approaches performed reasonably well - with the Bayesian scheme outperforming \nthe cross-validation approaches in feature selection. More experiments using different \nparameter settings for the cross-validation schemes and different priors for the Bayesian \nscheme should result in better understanding and therefore more powerful applications \nof these approaches. \n\nFig .... \n\nte. \n\n:: I \n\n. \"\"\". \n\n! 1: '--\" ____ ------' \nI \n.. \n\nt. \n\nWCCV \u00b7 ,..,._lIogLDiIfIood \n\nt \n\nI \n\nXl01 \n\n....... \n\n\" \n\n. \n\n\u2022 \n\u2022 \u2022 \n. _o l iO - . \n\nHI \n\n\" \n\n. . I . \n\n\u2022 \n\nReferences \n\nI la~ 1-'-_ _ __\nr:: ~.. \n\n----1 ..... \n\n_ \n\n.\u2022 ..\u2022 .. \u2022 - + \u2022 \u2022. ' \n\n, t l . \" \n\n__ 0.-. \n\n\u2022 ~ \u2022 \u2022 \" , ._ \n\nFlgur,' \n\n~ \n\n1 ... \nI OM \n\" 1=': \n~ ... \n\nOM \n\n\u2022 \n\n. \n\n__ \"0..-. \n\n\u2022 \n\n[I] Baker, D., et aI, Distributional Clustering of Words for Text Classification, SIGIR 1998. \n[2] Bernardo, J. M. and Smith, A. F. M., Bayesian Theory, Wiley, 1994. \n[3] Church, K.W. et aI, Poisson Mixtures. Natural Language Engineering. 1(12), 1995. \n[4] Cover, T.M . and Thomas, J.A. Elements of Information Theory. Wiley-Interscience, 1991. \n[5] Deerwester,S. et aI, Indexing by Latent Semantic Analysis,JASIS, 1990. \n[6] Dempster, A.et aI., Maximum Likelihood from Incomplete Data Via the EM Algorithm. \nJRSS, 39,1977. \n[7] Hanson,R., et aI, Bayesian Classification with Correlation and Inheritance, IJCAI,1991 . \n[8] Iyengar, G., Clustering images using relative entropy for efficient retrieval, VLBV, 1998. \n[9] Katz, S.M . , Distribution of content words and phrases in text and language modeling, NLE, \n2,1996. \n[10] Kontkanen, P.T. et ai, Comparing Bayesian Model Class Selection Criteria by Discrete \nFinite Mixtures, ISIS'96 Conference, 1996. \n[II] Meila, M., Heckerman, D., An Experimental Comparison of Several Clustering and \nInitialization Methods, MSR-TR-98-06. \n[12] Nigam, K et aI, Learning to Classify Text from Labeled and Unlabeled Documents, AAAI, \n1998. \n[13] Pereira, F.C.N. et ai, Distributional clustering of English words, ACL,1993. \n[14] Rissanen, J., Stochastic Complexity in Statistical Inquiry. World\\ Scientific, 1989. \n[15] Rissanen, J., Ristad E., Unsupervised classification with stochastic complexity.\" The \nUS/Japan Conference on the Frontiers of Statistical Modeling,1992. \n[16] Singhal A. et aI, Pivoted Document Length Normalization, SIGIR, 1996. \n[17] Smyth, P., Clustering using Monte Carlo cross-validation, KDD, 1996. \n[18] Vaithyanathan, S. and Dom, B. Model Selection in Unsupervised Learning with \nApplications to Document Clustering. IBM Research Report RJ-I 0137 (95012) Dec. 14, 1998 . \n\n\f", "award": [], "sourceid": 1644, "authors": [{"given_name": "Shivakumar", "family_name": "Vaithyanathan", "institution": null}, {"given_name": "Byron", "family_name": "Dom", "institution": null}]}