{"title": "A Probabilistic Approach for Optimizing Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 571, "page_last": 578, "abstract": null, "full_text": "A Probabilistic Approach for Optimizing Spectral Clustering\n\n\n\nRong Jin , Chris Ding , Feng Kang Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Michigan State University, East Lansing , MI 48824\n\nAbstract\nSpectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named \"Soft Cut\". It improves the normalized cut algorithm by introducing soft membership, and can be efficiently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm.\n\n1\n\nIntroduction\n\nData clustering has been an active research area with a long history. Well-known clustering methods include the K-means methods (Hartigan & Wong., 1994), Gaussian Mixture Model (Redner & Walker, 1984), Probabilistic Latent Semantic Indexing (PLSI) (Hofmann, 1999), and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Recently, spectral clustering methods (Shi & Malik, 2000; Ng et al., 2001; Zha et al., 2002; Ding et al., 2001; Bach & Jordan, 2004)have attracted more and more attention given their promising performance in data clustering and simplicity in implementation. They treat the data clustering problem as a graph partitioning problem. In its simplest form, a minimum cut algorithm is used to minimize the weights (or similarities) assigned to the removed edges. To avoid unbalanced clustering results, different objectives have been proposed, including the ratio cut (Hagen & Kahng, 1991), normalized cut (Shi & Malik, 2000) and min-max cut (Ding et al., 2001). To reduce the computational complexity, most spectral clustering algorithms use the relaxation approach, which maps discrete cluster memberships into continuous real numbers. As a result, it is difficult to directly apply current spectral clustering algorithms to multiclass clustering problems. Various strategies (Shi & Malik, 2000; Ng et al., 2001; Yu & Shi, 2003) have been used to extend spectral clustering algorithms to multi-class clustering problems. One common approach is to first construct a low-dimension space for data representation using the smallest eigenvectors of a graph Laplacian that is constructed based on the pair wise similarity of data. Then, a standard clustering algorithm, such as the K-means method, is applied to cluster data points in the low-dimension space.\n\n\f\nOne problem with the above approach is how to determine the appropriate number of eigenvectors. A too small number of eigenvectors will lead to an insufficient representation of data, and meanwhile a too large number of eigenvectors will bring in a significant amount of noise to the data representation. Both cases will degrade the quality of clustering. Although it has been shown in (Ng et al., 2001) that the number of required eigenvectors is generally equal to the number of clusters, the analysis is valid only when data points of different clusters are well separated. As will be shown later, when data points are not well separated, the optimal number of eigenvectors can be different from the number of clusters. Another problem with the existing spectral clustering algorithms is that they are based on binary cluster membership and therefore are unable to express the uncertainty in data clustering. Compared to hard cluster membership, probabilistic membership is advantageous in that it is less likely to be trapped by local minimums. One example is the Bayesian clustering method (Redner & Walker, 1984), which is usually more robust than the K-means method because of its soft cluster memberships. It is also advantageous to use probabilistic memberships when the cluster memberships are the intermediate results and will be used for other processes, for example selective sampling in active learning (Jin & Si, 2004). In this paper, we present a new spectral clustering algorithm, named \"Soft Cut\", that explicitly addresses the above two problems. It extends the normalized cut algorithm by introducing probabilistic membership of data points. By encoding membership of multiple clusters into a set of probabilities, the proposed clustering algorithm can be applied directly to multi-class clustering problems. Our empirical studies with a variety of datasets have shown that the soft cut algorithm can substantially outperform the normalized cut algorithm for multi-class clustering. The rest paper is arranged as follows. Section 2 presents the related work. Section 3 describes the soft cut algorithm. Section 4 discusses the experimental results. Section 5 concludes this study with the future work.\n\n2\n\nRelated Work\n\nThe key idea of spectral clustering is to convert a clustering problem into a graph partitioning problem. Let n be the number of data points to be clustered. Let W = [wi,j ]nn be the weight matrix where each wi,j is the similarity between two data points. For the convenience of discussion, wi,i = 0 for all data points. Then, a clustering problem can be formulated into the minimum cut problem, i.e.,\nn\n\nq\n\n=\n\narg\n\nq{-1,1}n\n\nmin\n\ni\n\nwi,j (qi - qj )2 = qT Lq\n\n(1)\n\n,j = 1\n\nwhere q = (q1 , q2 , ..., qn ) is a vector for binary memberships and each qi can be either -1 or 1. L is the Laplacian matrix. It is defined as Ln = D - W, where D = [di,i ]nn is a diagonal matrix with each element di,i = i,j j =1 wi,j . Directly solving the problem in (1) requires combinatorial optimization, which is computationally expensive. Usually, a relaxation approach (Chung, 199n ) is used to replace the vector q {-1, 1}n with a vector 7 q Rn under the constraint i=1 qi = n. As a result of the relaxation, the approximate ^ ^2 solution to (1) is the second smallest eigenvector of Laplacian L. One problem with the minimum cut approach is that it does not take into account the size of clusters, which can lead to clusters of unbalanced sizes. To resolve this problem, several different criteria are proposed, including the ratio cut (Hagen & Kahng, 1991), normalized cut (Shi & Malik, 2000) and min-max cut (Ding et al., 2001). For example, in\n\n\f\nthe normalized cut algorithm, the following objective is used: Jn (q) = C+,- (q) C+,- (q) + D+ (q) D- (q) (2)\n\nn n n where C+,- (q) = i,j =1 wi,j (qi , +) (qj , -) and D = i=1 (qi , ) j =1 wi,j . In the above objective, the size of clusters, i.e., D , is used as the denominators to avoid clusters of too small size. Similar to the minimum cut approach, a relaxation approach is used to convert the problem in (2) into a eigenvector problem. For multi-class clustering, we can extend the objective in (2) into the following form: Jnorm\nmc (q)\n\n=\n\nzK z\n=1\n\n =z\n\nCz,z (q) Dz (q)\n\n(3)\n\nwhn re K is the number of clusters, n ectorn q {1, 2, ..., K }n , Cz,z = e v (qi , z ) (qj , z )wi,j , and Dz = j =1 (qi , z )wi,j . However, efficiently i,j =1 i=1 finding the solution that minimizes (3) is rather difficult. In particular, a simple relaxation method cannot be applied directly here. In the past, several heuristic approaches (Shi & Malik, 2000; Ng et al., 2001; Yu & Shi, 2003) have been proposed for finding approximate solutions to (3). One common strategy is to first obtain the K smallest (excluding the one with zero eigenvalue) eigenvectors of Laplacian L, and project data points onto the low-dimension space that is spanned by the K eigenvectors. Then, a standard clustering algorithm, such as the K-means method, is applied to cluster data points in this low-dimension space. In contrast to these approaches, the proposed spectral clustering algorithm deals with the multi-class clustering problem directly. It estimates the probabilities for each data point be in different clusters simultaneously. Through the probabilistic cluster memberships, the proposed algorithm will be less likely to be trapped by local minimums, and therefore will be more robust than the existing spectral clustering algorithms.\n\n3\n\nSpectral Clustering with Soft Membership\n\nIn this section, we describe a new spectral clustering algorithm, named \"Soft Cut\", which extends the normalized cut algorithm by introducing probabilistic cluster membership. In the following, we will present a formal description of the soft cut algorithm, followed by the procedure that efficiently optimizes the related optimization problem. 3.1 Algorithm Description\n\nK First, notice that Dz in (3) can be expanded as Dz = j =1 Ci,j . Thus, the objective function for multi-class clustering in (3) can be rewritten as: Jn\n Let Jn mc (q)\n\n=\n\nzK Cz,z (q) zK z K Cz,z (q) =K- Dz (q) Dz (q) =1 =1 \n=z mc , we can maximize Jn\n\n(4)\nmc .\n\nmc\n\n=\n\nCz,z (q) z =1 Dz (q) .\n\nTo extend the above objective function to a probabilistic framework, we introduce the probabilistic cluster membership. Let qz,i denote the probability for the i-th data point to be in the z -th cluster. Let matrix Q = [qz,i ]K n include all probabilities qz,i . Using the probabilistic notations, we can rewrite Cz,z and Dz as follows:\nn n\n\nK\n\nThus, instead of minimizing Jn\n\nCz,z (Q) =\n\ni\n\nqz,i qz ,j wi,j , Dz (Q) =\n\n,j = 1\n\ni\n\nqz,i wi,j\n\n(5)\n\n,j = 1\n\n\f\n Substituting the probabilistic expression for Cz,z and Dz into Jn mc , we have the following optimization problem for probabilistic spectral clustering: n zK i,j =1 qz ,i qz ,j wi,j n Q = arg min Jprob (Q) = arg max K n K n QR QR i,j =1 qz ,i wi,j =1\n\ns.t.i [1..n], z [1..K ] : qz,i 0,\n\nzK\n\nq z ,i = 1\n\n(6)\n\n=1\n\n3.2\n\nOptimization Procedure\n\nIn this subsection, we present a bound optimization algorithm (Salakhutdinov & Roweis, 2003) for efficiently finding the solution to (6). It maximizes the objective function in (6) iteratively. In each iteration, a concave lower bound is first constructed for the objective function based on the solution obtained from the previous iteration. Then, a new solution for the current iteration is obtained by maximizing the lower bound. The same procedure is repeated until the solution converges to a local maximum.\n Let Q = [qi,j ]K n be the probabilities obtained in the previous iteration, and Q = [qi,j ]K n be the probabilities for current iteration. Define\n\n(Q, Q ) = log\n\nJprob (Q) Jprob (Q )\n\nwhich is the logarithm of the ratio of the objective functions beitween two cionsecutive pi log(qi ) pi qi ) iterations. Using the convexity of logarithm function, i.e., log( for a pdf {pi }, we have (Q, Q ) lower bound by the following expression: K - K z Cz,z (Q) z Cz,z (Q ) (Q, Q ) = log log Dz (Q) Dz (Q ) =1 =1 l ( zK Cz,z (Q) Dz (Q) tz og 7) - log Cz,z (Q ) Dz (Q ) =1 where tz is defined as: tz =\nCz,z (Q ) Dz (Q ) K Cz ,z (Q ) z =1 Dz (Q )\n\n(8)\nC (Q)\n\nwhere si,j is defined as: z\n\nzz Now, the first term within the big bracket in (7), i.e., log Cz,,z (Q ) , can be further relaxed as: i n qz,i qz,j wi,j qz,i qz,j Cz,z (Q) log = log Cz,z (Q ) Cz,z (Q ) qz,i qz,j ,j = 1 in jn in 2 si,j log(qz,i ) - si,j log(qz,i qz,j ) (9) z z\n\n=1\n\n=1\n\n,j = 1\n\nsi,j z\n\n=\n\n qz,i qz,j wi,j Cz,z (Q )\n\n(10)\n\n\f\nMeanwhile, using the inequality log x x - 1, we have log following expression: log\n\nDz (Q) Dz (Q )\n\nupper bounded by the\n\nPutting together (7), (9), and (11), we have a concave lower bound for the objective function in (6), i.e., log Jprob (Q) log Jprob (Q ) + 0 (Q ) + 2 where 0 (Q ) is defined as: 0 (Q ) = -\n \n\njn in Dz (Q) wi,j Dz (Q) -1= -1 q z ,i Dz (Q ) Dz (Q ) Dz (Q ) =1 =1\n\n(11)\n\nzK i\n\nn\n\ntz si,j log qz,i - z\n\n= 1 ,j = 1\n\nzK i\n=1\n\nqz,i wi,j (12) Dz (Q ) ,j = 1\n\nn\n\nThe optimal solution that maximizes the lower bound in (12) can be computed by setting its derivative to zero, which leads to the following solution: n 2tz j =1 si,j z n (13) q z ,i = wi,j tz j =1 Dz (Q ) + i K where i is a Lagrangian multiplier that ensure z=1 qz,i = 1. It can be acquired by maximizing the following objective function: zK jn jn wi,j tz si,j log tz l(i ) = -i + 2 + i (14) z Dz (Q ) =1 =1 =1 Since the above objective function is concave, we can apply a standard numerical procedure, such as the Newton's method, to efficiently find the value for i .\n\nzK\n\nn\n\ntz\n\n=1\n\ni\n\n si,j wi,j log(qz,i qz,j ) + 1 z\n\n,j = 1\n\n4\n\nExperiment\n\nIn this section, we focus on examining the effectiveness of the proposed soft cut algorithm for multi-class clustering. In particular, we will address the following two research questions: 1. How effective is the proposed algorithm for data clustering? We compare the proposed soft cut algorithm to the normalized cut algorithm with various numbers of eigenvectors. 2. How robust is the proposed algorithm for data clustering? We evaluate the robustness of clustering algorithms by examining their variance across multiple trials. 4.1 Experiment Design\n\nDatasets In order to extensively examine the effectiveness of the proposed soft cut algorithm, a variety of datasets are used in this experiment. They are: Text documents that are extracted from the 20 newsgroups to form two five-class datasets, named as \"M5\" and \"L5\". Each class contain 100 document and there are totally 500 documents.\n\n\f\nDataset M5 L5 Pendigit Ribosome\n\nTable 1: Datasets Description Description #Class #Instance Text documents 5 500 Text documents 5 500 Pen-based handwritting 10 2000 Ribosome rDNA sequences 8 1907\n\n#Features 1000 1000 16 27617\n\n Pendigit that comes from the UCI data repository. It contains 2000 examples that belong to 10 different classes. Ribosomal sequences that are from RDP project (http://rdp.cme.msu.edu/index.jsp). It contains annotated rRNA sequences of ribosome for 2000 different bacteria that belong to 10 different phylum (e.g., classes). Table 1 provides the detailed information regarding each dataset. Evaluation metrics To evaluate the performance of different clustering algorithms, two different metrics are used: Clustering accuracy. For the datasets that have no more than five classes, clustering accuracy is used as the evaluation metric. To compute clustering accuracy, each automatically generated cluster is first aligned with a true class. The classification accuracy based on the alignment is then computed, and the clustering accuracy is defined as the maximum classification accuracy among all possible alignments. Normalized mutual information. For the datasets that have more than five classes, due to the expensive computation involved in finding the optimal alignment, we use the normalized mutual information (Banerjee et al., 2003) as the alternative evaluation metric. If Tu and Tl denote the cluster labels and true class labels assigned to data points, the normalized mutual information \"nmi\" is defined as nmi = 2I (Tu , Tl ) (H (Tu ) + H (Tl ))\n\nwhere I (Tu , Tl ) stands for the mutual information between clustering labels Tu and true class labels Tl . H (Tu ) and H (Tl ) are the entropy functions for Tu and Tl , respectively. Each experiment was run 10 times with different initialization of parameters. The averaged results together with their variance are used as the final evaluation metric. Implementation We follow the paper (Ng et al., 2001) for implementing the normalized cut algorithm. A cosine similarity is used to measure the affinity between any two data points. Both the EM algorithm and the Kmeans methods are used to cluster the data points that are projected into the low-dimension space spanned by the smallest eigenvectors of a graph Laplacian. 4.2 Experiment (I): Effectiveness of The Soft Cut Algorithm\n\nThe clustering results of both the soft cut algorithm and the normalized cut algorithm are summarized in Table 2. In addition to the Kmeans algorithm, we also apply the EM clustering algorithm to the normalized cut algorithm. In this experiment, the number of eigenvectors used for the normalized cut algorithms is equal to the number of clusters. First, comparing to both normalized cut algorithms, we see that the proposed clustering algorithm substantially outperform the normalized cut algorithms for all datasets. Second,\n\n\f\nTable 2: Clustering results for different clustering methods. Clustering accuracy is used for dataset \"L5\" and \"M5\" as the evaluation metric, and normalized mutual information is used for \"Pendigit\" and \"Ribosome\" . M5 L5 Pendigit Ribosome Soft Cut 89.2 1.3 69.2 2.7 56.3 3.8 69.7 2.9 Normalized Cut (Kmeans) 83.2 8.8 64.2 4.9 46.0 6.4 62.2 9.1 Normalized Cut (EM) 62.4 5.6 45.1 4.8 52.8 2.0 63.2 3.8\n\nTable 3: Clustering accuracy for normalized cut with embedding in eigenspace with K eigenvectors. K-means is used. #Eigenvector K K +1 K +2 K +3 K +4 K +5 K +6 K +7 K +8 M5 83.2 8.8 77.6 8.6 79.7 8.5 80.2 6.6 74.9 9.2 70.5 5.7 75.5 8.6 75.8 7.5 73.5 6.6 L5 64.1 4.9 69.6 6.7 64.1 5.7 61.4 5.8 59.1 4.7 66.1 4.7 61.9 4.7 59.7 5.6 61.2 4.7 Pendigit 46.0 6.4 43.3 9.1 41.6 9.3 42.9 9.6 47.5 3.7 39.2 9.3 43.4 8.3 46.8 7.3 49.8 8.9 Ribosome 62.2 9.1 65.9 5.8 63.4 4.8 67.2 7.6 60.7 8.4 63.9 8.2 63.5 10.4 56.6 10.7 54.3 7.2\n\ncomparing to the normalized cut algorithm using the Kmeans method, we see that the soft cut algorithm has smaller variance in its clustering results. This can be explained by the fact that the Kmeans algorithm uses binary cluster membership and therefore is likely to be trapped by local optimums. As indicated in Table 2, if we replace the Kmeans algoirthm with the EM algorithm in the normalized cut algorithm, the variance in clustering results is generally reduced but at the price of degradation in the performance of clustering. Based on the above observation, we conclude that the soft cut algorithm appears to be effective and robust for multi-class clustering. 4.3 Experiment (II): Normalized Cut using Different Numbers of Eigenvectors\n\nOne potential reason why the normalized cut algorithm perform worse than the proposed algorithm is that the number of clusters may not be the optimal number of eigenvectors. To examine this issue, we test the normalized cut algorithm with different number of eigenvectors. The Kmeans method is used for clustering the eigenvectors. The results of the normalized cut algorithm using different number of eigenvectors are summarized in Table 3. The best performance is highlighted by the bold fold. First, we clearly see that the best clustering results may not necessarily happen when the number of eigenvectors is exactly equal to the number of clusters. In fact, for three out of four cases, the best performance is achieved when the number of eigenvectors is larger than the number of clusters. This result indicates that the choice of numbers of eigenvectors can have a significant impact on the performance of clustering. Second, comparing the results in Table 3 to the results in Table 2, we see that the soft cut algorithm is still able to outperform the normalized cut algorithm even with the optimal number of eigenvectors. In general, since spectral clustering is originally designed for binary-class classification, it requires an extra step when it is extended to multi-class clustering problems. Hence, the resulting solutions are usually suboptimal. In contrast, the soft cut algorithm directly\n\n\f\ntargets on multi-class clustering problems, and thus is able to achieve better performance than the normalized cut algorithm.\n\n5\n\nConclusion\n\nIn this paper, we proposed a novel probabilistic algorithm for spectral clustering, called \"soft cut\" algorithm. It introduces probabilistic membership into the normalized cut algorithm and directly targets on the multi-class clustering problems. Our empirical studies with a number of datasets have shown that the proposed algorithm outperforms the normalized cut algorithm considerably. In the future, we plan to extend this work to other applications such as image segmentation.\n\nReferences\nBach, F. R., & Jordan, M. I. (2004). Learning spectral clustering. Advances in Neural Information Processing Systems 16. Banerjee, A., Dhillon, I., Ghosh, J., & Sra, S. (2003). Generative model-based clustering of directional data. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003). Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3, 9931022. Chung, F. (1997). Spectral graph theory. Amer. Math. Society. Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001). A min-max cut algorithm for graph partitioning and data clustering. Proc. IEEE Int'l Conf. Data Mining. Hagen, L., & Kahng, A. (1991). Fast spectral methods for ratio cut partitioning and clustering. Proceedings of IEEE International Conference on Computer Aided Design (pp. 1013). Hartigan, J., & Wong., M. (1994). A k-means clustering algorithm. Appl. Statist., 28, 100108. Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval (pp. 5057). Berkeley, California. Jin, R., & Si, L. (2004). A bayesian approach toward active learning for collaborative filtering. Proceedings of the 20th conference on Uncertainty in artificial intelligence (pp. 278285). Banff, Canada: AUAI Press. Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the em algorithm. SIAM Review, 26, 195239. Salakhutdinov, R., & Roweis, S. T. (2003). Adaptive overrelaxed bound optimization methods. Proceedings of the Twentieth International Conference (ICML 2003) (pp. 664671). Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888905. Yu, S. X., & Shi, J. (2003). Multiclass spectral clustering. Proceedings of Ninth IEEE International Conference on Computer Vision. Nice, France. Zha, H., He, X., Ding, C., Gu, M., & Simon, H. (2002). Spectral relaxation for k-means clustering. Advances in Neural Information Processing Systems 14.\n\n\f\n", "award": [], "sourceid": 2952, "authors": [{"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Feng", "family_name": "Kang", "institution": null}, {"given_name": "Chris", "family_name": "Ding", "institution": null}]}