{"title": "Iterative Double Clustering for Unsupervised and Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1025, "page_last": 1032, "abstract": null, "full_text": "Iterative Double Clustering for\n\nUnsupervised and Semi-Supervised\n\nLearning\n\nRan El-Yaniv\n\nOren Souroujon\n\nComputer Science Department\n\nTechnion - Israel Institute of Technology\n\n(rani,orenso)@cs.technion.ac.il\n\nAbstract\n\nWe present a powerful meta-clustering technique called Iterative Dou-\nble Clustering (IDC). The IDC method is a natural extension of the\nrecent Double Clustering (DC) method of Slonim and Tishby that ex-\nhibited impressive performance on text categorization tasks [12]. Us-\ning synthetically generated data we empirically \ufb02nd that whenever the\nDC procedure is successful in recovering some of the structure hidden\nin the data, the extended IDC procedure can incrementally compute\na signi\ufb02cantly more accurate classi\ufb02cation. IDC is especially advan-\ntageous when the data exhibits high attribute noise. Our simulation\nresults also show the e\ufb01ectiveness of IDC in text categorization prob-\nlems. Surprisingly, this unsupervised procedure can be competitive\nwith a (supervised) SVM trained with a small training set. Finally,\nwe propose a simple and natural extension of IDC for semi-supervised\nand transductive learning where we are given both labeled and unla-\nbeled examples.\n\n1\n\nIntroduction\n\nData clustering is a fundamental and challenging routine in information processing\nand pattern recognition.\nInformally, when we cluster a set of elements we attempt\nto partition it into subsets such that points in the same subset are more \\similar\" to\neach other than to points in other subsets. Typical clustering algorithms depend on a\nchoice of a similarity measure between data points [6], and a \\correct\" clustering result\ndepends on an appropriate choice of a similarity measure. However, the choice of a\n\\correct\" measure is an ill-de\ufb02ned task without a particular application at hand. For\ninstance, consider a hypothetical data set containing articles by each of two authors\nsuch that half of the articles authored by each author discusses one topic, and the\nother half discusses another topic. There are two possible dichotomies of the data\nwhich could yield two di\ufb01erent bi-partitions: one according to topic, and another,\naccording to writing style. When asked to cluster this set into two sub-clusters, one\ncannot successfully achieve the task without knowing the goal: Are we interested in\nclusters that re(cid:176)ect writing style or semantics? Therefore, without a suitable target at\n\n\fhand and a principled method for choosing a similarity measure suitable for the target,\nit can be meaningless to interpret clustering results.\n\nThe information bottleneck (IB) method of Tishby, Pereira and Bialek [8] is a recent\nframework that can sometimes provide an elegant solution to this problematic \\metric\nselection\" aspect of data clustering (see Section 2). The original IB method generates\na soft clustering assignments for the data. In [10], Slonim and Tishby developed a sim-\npli\ufb02ed \\hard\" variant of the IB clustering, where there is a hard assignment of points\nto their clusters. Employing this hard IB clustering, the same authors introduced an\ne\ufb01ective two-stage clustering procedure called Double Clustering (DC) [12]. An exper-\nimental study of DC on text categorization tasks [12] showed a consistent advantage\nof DC over other clustering methods. A striking \ufb02nding in [12] is that DC sometimes\neven attained results close to those of supervised learning.1\n\nIn this paper we present a powerful extension of the DC procedure which we term\nIterative Double Clustering (IDC). IDC performs iterations of DC and whenever the\n\ufb02rst DC iteration succeeds in extracting a meaningful structure of the data, a num-\nber of the next consecutive iterations can continually improve the clustering quality.\nThis continual improvement achieved by IDC is due to generation of progressively less\nnoisy data representations which reduce variance. Using synthetically generated data,\nwe study some properties of IDC. Not only that IDC can dramatically outperform\nDC whenever the data is noisy, our experiments indicate that IDC attains impressive\ncategorization results on text categorization tasks.\nIn particular, we show that our\nunsupervised IDC procedure is competitive with an SVM (and Naive Bayes) trained\nover a small sized training set. We also propose a natural extension of IDC for trans-\nductive semi-supervised transductive. Our preliminary empirical results indicate that\nour transductive IDC can yield e\ufb01ective text categorization.\n\n2\n\nInformation Bottleneck and Double Clustering\n\nWe consider a data set X of elements, each of which is a d-dimensional vector over\na set F of features. We focus on the case where feature values are non-negative real\nnumbers. For every element x = (f1; : : : ; fd) 2 X we consider the empirical conditional\ndistribution fp(fijx)g of features given x, where p(fijx) = fi= Pd\ni=1 fi: For instance,\nX can be a set of documents, each of which is represented as a vector of word-features\nwhere fi is the frequency of the ith word (in some \ufb02xed word enumeration). Thus,\nwe represent each element as a distribution over its features, and are interested in a\npartition of the data based on these feature conditional distributions. Given a prede-\ntermined number of clusters, a straightforward approach to cluster the data using the\nabove \\distributional representation\" would be to choose some (dis)similarity measure\nfor distributions (e.g. based on some Lp norm or some statistical measure such as the\nKL-divergence) and employ some \\plug-in\" clustering algorithm based on this measure\n(e.g. agglomerative algorithms). Perhaps due to feature noise, this simplistic approach\ncan result in mediocre results (see e.g. [12]).\n\nSuppose that our data is given via observations of a random variable S. In the infor-\nmation bottleneck (IB) method of Tishby et al. [8] we attempt to extract the essence\nof the data S using co-occurrence observations of S together with a target variable\nT . The goal is to extract a compressed representation ~S of S with minimum compro-\nmise of information content with respect to T . This way, T can direct us to extract\nmeaningful clustering from S where the meaning is determined by the target T . Let\nI(S; T ) = Ps2S;t2T p(s; t) log p(s;t)\np(s)p(t) , the mutual information between S and T [3].\n\n1Speci\ufb02cally, the DC method obtained in some cases accuracy close to that obtained by a\n\nnaive Bayes classi\ufb02er trained over a small sized sample [12].\n\n\fThe IB method attempts to compute p(~sjs), a \\soft\" assignment of a data point s to\nclusters ~s, so as to minimize I(S; ~S)\u00a1\ufb02I( ~S; T ), given the Markov condition T ! S ! ~S\n(i.e., T and ~S are conditionally independent given S). Here, \ufb02 is a Lagrange multiplier\nthat controls a constraint on I( ~S; T ) and thus, the tradeo\ufb01 between the desired com-\npression level and the predictive power of ~S with respect to T . As shown in [8], this\nminimization yields a system of coupled equations for the clustering mapping p(~sjs)\nin terms of the cluster representations p(tj~s) and the cluster weights p(~s). The paper\n[8] also presents an algorithm similar to deterministic annealing [9] for recovering a\nsolution for the coupled equations.\n\nSlonim and Tishby [10] proposed a simpli\ufb02ed IB approach for the computation of\n\\hard\" cluster assignments. In this hard IB variant, each data point s, represented by\nfp(tjs)gt, is associated with one centroid ~s. They also devised a greedy agglomerative\nclustering algorithm that starts with the trivial clustering, where each data point s is a\nsingle cluster; then, at each step, the algorithm merges the two clusters that minimize\nthe loss of mutual information I( ~S; T ). The reduction in I( ~S; T ) due to a merge of two\nclusters ~si and ~sj is shown to be\n\n(p(~si) + p(~sj))DJ S[p(tj~si); p(tj~sj)];\n\n(1)\n\nwhere, for any two distributions p(x) and q(x), with priors \u201ap and \u201aq, \u201ap + \u201aq = 1,\nDJ S[p(x); q(x)] is the Jensen-Shannon divergence (see [7, 4]),\n\nDJ S[p(x); q(x)] = \u201apDKL(pjj\n\np + q\n\n2\n\n) + \u201aqDKL(qjj\n\np + q\n\n2\n\n):\n\nHere, p+q\n2 denotes the distribution (p(x) + q(x))=2 and DKL(\u00a2jj\u00a2) is the Kullbak-Leibler\ndivergence [3]. This agglomerative algorithm is of course only locally optimal, since at\neach step it greedily merges the two most similar clusters. Another disadvantage of\nthis algorithm is its time complexity of O(n2) for a data set of n elements (see [12] for\ndetails).\n\nThe IB method can be viewed as a meta-clustering procedure that, given observations\nof the variables S and T (via their empirical co-occurrence samples p(s; t)), attempts to\ncluster s-elements represented as distributions over t-elements. Using the merging cost\nof equation (1) one can approximate IB clustering based on other \\plug-in\" vectorial\nclustering routines applied within the simplex containing the s-elements distributional\nrepresentations.\n\nDC [12] is a two-stage procedure where during the \ufb02rst stage we IB-cluster features\nrepresented as distributions over elements, thus generating feature clusters. During\nthe second stage we IB-cluster elements represented as distributions over the feature\nclusters (a more formal description follows). For instance, considering a document\nclustering domain, in the \ufb02rst stage we cluster words as distributions over documents\nto obtain word clusters. Then in the second stage we cluster documents as distributions\nover word clusters, to obtain document clusters.\n\nIntuitively, the \ufb02rst stage in DC generates more coarse pseudo features (i.e.\nfeature\ncentroids), which can reduce noise and sparseness that might be exhibited in the orig-\ninal feature values. Then, in the second stage, elements are clustered as distributions\nover the \\distilled\" pseudo features, and therefore can generate more accurate element\nclusters. As reported in [12], this DC two-stage procedure outperforms various other\nclustering approaches as well as DC variants applied with other dissimilarity measures\n(such as the variational distance) di\ufb01erent from the optimal JS-divergence of Equa-\ntion (1). It is most striking that in some cases, the accuracy achieved by DC was close\nto that achieved by a supervised Naive Bayes classi\ufb02er.\n\n\f3\n\nIterative Double Clustering (IDC)\n\nDenote by IBN (T jS) the clustering result,\ninto N clusters, of the IB hard clus-\ntering procedure when the data is S and the target variable is T (see Section 2).\nFor instance, if T represents documents and S represents words, the application of\nIBN (T = documentsjS = words) will cluster the words, represented as distributions\nover the documents, into N clusters. Using the notation of our problem setup, with X\ndenoting the data and F denoting the features, Figure 1 provides a pseudo-code of the\nIDC meta-clustering algorithm, which clusters X into N ~X clusters. Note that the DC\nprocedure is simply an application of IDC with k = 1.\n\nThe code of Figure 1 requires to specify k, the number of IDC iterations to run, N ~X ,\nthe number of element clusters (e.g. the desired number of of document clusters) and\nN ~F , the number of feature clusters to use during each iteration. In the experiments\nreported below we always assumed that we know the correct N ~X . Our experiments\nshow that the algorithm is not too sensitive to an overestimate of N ~F . Note that the\nchoice of these parameters is the usual model order selection problem. Perhaps the \ufb02rst\nquestion regarding k (number of iterations) to ask is whether or not IDC converges to\na steady state (e.g. where two consecutive iterations generate identical partitions).\nUnfortunately, a theoretical understanding of this convergence issue is left open in this\npaper. In most of our experiments IDC converged after a small number of iterations.\nIn all the experiments reported below we used a \ufb02xed k = 15.\n\nInput:\nX (input data)\nN ~X (number of element clusters)\nN ~F (number of feature clusters to use)\nk (number of iterations)\nInitialize: S \u02c6 F , T \u02c6 X,\nloop fk timesg\n\nThe \\hard\" IB-clustering originally pre-\nsented by [12] uses an agglomerative pro-\ncedure as its underlying clustering algo-\nrithm (see Section 2). The \\soft\" IB [8]\napplies a deterministic annealing clus-\ntering [9] as its underlying procedure.\nAs already discussed, the IB method\ncan be viewed as meta-clustering which\ncan employ many vectorial clustering\nroutines. We implemented IDC us-\ning several routines including agglom-\nerative clustering and deterministic an-\nnealing.\nSince both these algorithms\nare computationally intensive, we also\nimplemented IDC using a simple fast\nalgorithm called Add-C proposed by\nGuedalia et al.\n[5]. Add-C is an online\ngreedy clustering algorithm with linear\nrunning time and can be viewed as a simple online approximation of k-means. For this\nreason, all the results reported below were computed using Add-C (whose description\nis omitted, for lack of space, see [5] for details). For obtaining a better approximation\nto the IB method we of course used the JS-divergence of (1) as our cost measure.\n\nFigure 1: Pseudo-code for IDC\n\nN \u02c6 N ~F\n~F \u02c6 IBN (T jS)\nN \u02c6 N ~X , S \u02c6 X, T \u02c6 ~F\n~X \u02c6 IBN (T jS)\nS \u02c6 F , T \u02c6 ~X\n\nend loop\nOutput ~X\n\nFollowing [12] we chose to evaluate the performance of IDC with respect to a labeled\ndata set. Speci\ufb02cally, we count the number of classi\ufb02cation errors made by IDC as\nobtained from labeled data.\n\nIn order to better understand the properties of IDC, we \ufb02rst examined it within a\ncontrolled setup of synthetically generated data points whose feature values were gen-\nerated by d-dimensional Gaussian distributions (for d features) of the form N (\u201e; \u00a7),\nwhere \u00a7 = (cid:190)2I, with (cid:190) constant. In order to simulate di\ufb01erent sources, we assigned\ndi\ufb01erent \u201e values (from a given constant range) to each combination of source and\nfeature. Speci\ufb02cally, for data simulating m classes and jF j features, jF j \u00a3 m di\ufb01er-\n\n\fent distributions were selected. We introduced feature noise by distorting each entry\nwith value v by adding a random sample from N (0; (\ufb01 \u00a2 v)2), where \ufb01 is the \\noise\namplitude\" (resulting negative values were rounded to zero). In \ufb02gure 2(a), we plot\nthe average accuracy of 10 runs of IDC. As can be seen, at low level noise amplitudes\nIDC attains perfect accuracy. When the noise amplitude increases, both IDC and DC\ndeteriorate but the multiple rounds of IDC can better resist the extra noise. After\nobserving the large accuracy gain between DC and IDC at a speci\ufb02c interval of noise\namplitude within the feature noise setup, we set the noise amplitude to values in that\ninterval and examined the behavior of the IDC run in more detail. Figure 2(b) shows\na typical trace of the accuracy obtained at each of the 20 iterations of an IDC run over\nnoisy data. This learning curve shows a quick improvement in accuracy during the \ufb02rst\nfew rounds, and then reaches a plateau.\n\nFollowing [12] we used the 20 Newsgroups (NG20) [1] data set to evaluate IDC on real,\nlabeled data. We chose several subsets of NG20 with various degrees of di\u2013culty. In the\n\ufb02rst set of experiments we used the following four newsgroups (denoted as NG4), two\nof which deal with sports subjects: \u2018rec.sport.baseball\u2019, \u2018rec.sport.hockey\u2019, \u2018alt.atheism\u2019\nand \u2018sci.med\u2019. In these experiments we tested some basic properties of IDC. In all the\nexperiments reported in this section we performed the following preprocessing: We\nlowered the case of all letters, \ufb02ltered out low frequency words which appeared up to\n(and including) 3 times in the entire set and \ufb02ltered out numerical and non-alphabetical\ncharacters. Of course we also stripped o\ufb01 newsgroup headers which contain the class\nlabels.\n\nIn Figure 2(c) we display accuracy vs. number of feature clusters (N ~F ). The accuracy\ndeteriorates when N ~F is too small and we see a slight negative trend when it increases.\nWe performed an additional experiment which tested the performance using very large\nnumbers of feature clusters. Indeed, these results indicate that after a plateau in the\nrange of 10-20 there is a minor negative trend in the accuracy level. Thus, with respect\nto this data set, the IDC algorithm is not too sensitive to an overestimation of the\nnumber N ~F of feature clusters.\nOther experiments over the NG4 data set con\ufb02rmed the results of [12] that the JS-\ndivergence dissimilarity measure of Equation (1) outperforms other measures, such as\nthe variational distance (L1 norm), the KL-divergence, the square-Euclidean distance\nand the \u2018cosine\u2019 distance. Details of all these experiments will be presented in the full\nversion of the paper.\n\nIn the next set of experiments we tested IDC\u2019s performance on the same newsgroup\nsubsets used in [12]. Table 1(a) compares the accuracy achieved by DC to the the last\n(15th) round of IDC with respect to all data sets described in [12]. Results of DC were\ntaken from [12] where DC is implemented using the agglomerative routine.\n\nTable 1(b) displays a preliminary comparison of IDC with the results of a Naive Bayes\n(NB) classi\ufb02er (reported in [11]) and a support vector machine (SVM). In each of\nthe 5 experiments the supervised classi\ufb02ers were trained using 25 documents per class\nand tested on 475 documents per class. The input for the unsupervised IDC was 500\nunlabeled documents per class. As can be seen, IDC outperforms in this setting both\nthe naive Bayes learner and the SVM.\n\n4 Learning from Labeled and Unlabeled Examples\n\nIn this section, we present a natural extension of IDC for semi-supervised transductive\nlearning that can utilize both labeled and unlabeled data. In transductive learning, the\ntesting is done on the unlabeled examples in the training data, while in semi-supervised\n\n\fNewsgroup DC IDC-15\n\nBinary1\nBinary2\nBinary3\nM ulti51\nM ulti52\nM ulti53\nM ulti101\nM ulti102\nM ulti103\nAverage\n\n0.70\n0.68\n0.75\n0.59\n0.58\n0.53\n0.35\n0.35\n0.35\n0.54\n\n0.85\n0.83\n0.80\n0.86\n0.88\n0.86\n0.56\n0.49\n0.55\n0.74\n\nData Set\nCOMP (5)\n\nSCIENCE (4)\nPOLITICS (3)\nRELIGION (3)\n\nSPORT (2)\nAverage\n\nNB SVM IDC-15\n0.50\n0.73\n0.67\n0.55\n0.75\n0.64\n\n0.50\n0.79\n0.78\n0.60\n0.89\n0.71\n\n0.51\n0.68\n0.76\n0.78\n0.78\n0.70\n\nIDC-1\n\n0.34\n0.44\n0.42\n0.38\n0.76\n0.47\n\nTable 1: Left: Accuracy of DC vs. IDC on most of the data sets described in [12]. DC\nresults are taken from [12]; Right: Accuracy of Naive Bayes (NB) and SVM classi\ufb02ers vs.\nIDC on some of the data sets described in [11]. The IDC-15 column shows \ufb02nal accuracy\nachieved at iteration 15 of IDC; the IDC-1 column shows \ufb02rst iteration accuracy. The NB\nresults are taken from [11]. The SVM results were produced using the LibSVM package [2]\nwith its default parameters.\nIn all cases the SVM was trained and tested using the same\ntraining/test set sizes as described in [11] (25 documents per newsgroup for training and 475\nfor testing; the number of unlabeled documents fed to IDC was 500 per newsgroup). The\nnumber of newsgroups in each hyper-category is speci\ufb02ed in parenthesis (e.g. COMP contains\n5 newsgroups).\n\ninductive learning it is done on previously unseen data. Here we only deal with the\ntransductive case. In the full version of the paper we will present a semi-supervised\ninductive learning version of IDC.\n\nFor motivating the transductive IDC, consider a data set X that has emerged from a\nstatistical mixture which includes several sources (classes). Let C be a random variable\nindicating the class of a random point. During the \ufb02rst iteration of a standard IDC we\ncluster the features F so as to preserve I(F; X). Typically, X contains predictive infor-\nmation about the classes C. In cases where I(X; C) is su\u2013ciently large, we expect that\nthe feature clusters ~F will preserve some information about C as well. Having available\nsome labeled data points, we may attempt to generate feature clusters ~F which pre-\nserve more information about class labels. This leads to the following straightforward\nidea. During the \ufb02rst IB-stage of the IDC \ufb02rst iteration, we cluster the features F as\ndistributions over class labels (given by the labeled data). This phase results in feature\nclusters ~F . Then we continue as usual; that is, in the second IB-phase of the \ufb02rst IDC\niteration we cluster X, represented as distributions over ~F . Subsequent IDC iterations\nuse all the unlabeled data.\n\nIn Figure 2(d) we show the accuracy obtained by DC and IDC in categorizing 5 news-\ngroups as a function of the training (labeled) set size. For instance, we see that when\nthe algorithm has 10 documents available from each class it can categorize the entire\nunlabeled set, containing 90 unlabeled documents in each of the classes, with accuracy\nof about 80%. The benchmark accuracy of IDC with no labeled examples obtained\nabout 73%.\n\nIn Figure 2(e) we see the accuracy obtained by DC and transductive IDC trained with\na constant set of 50 labeled documents, on di\ufb01erent unlabeled (test) sample sizes. The\ngraph shows that the accuracy of DC signi\ufb02cantly degrades, while IDC manages to\nsustain an almost constant high accuracy.\n\n\f5 Concluding Remarks\n\nOur contribution is threefold. First, we present a natural extension of the successful\ndouble clustering algorithm of [12]. Empirical evidence indicates that our new iterative\nDC algorithm has distinct advantages over DC, especially in noisy settings. Second,\nwe applied the unsupervised IDC on text categorization problems which are typically\ndealt with by supervised learning algorithms. Our results indicate that it is possible to\nachieve performance competitive to supervised classi\ufb02ers that were trained over small\nsamples. Finally, we present a natural extension of IDC that allows for transductive\nlearning. Our preliminary empirical evaluation of this scheme over text categorization\nappears to be promising.\n\nA number of interesting questions are left for future research. First, it would be of\ninterest to gain better theoretical understanding of several issues: the generalization\nproperties of DC and IDC, the convergence of IDC to a steady state and precise condi-\ntions on attribute noise settings within which IDC is advantageous. Second, it would\nbe important to test the empirical performance of IDC with respect to di\ufb01erent prob-\nlem domains. Finally, we believe it would be of great interest to better understand and\ncharacterize the performance of transductive IDC in settings having both labeled and\nunlabeled data.\n\nAcknowledgements\n\nWe thank Naftali Tishby and Noam Slonim for helpful discussions and for providing us with\nthe detailed descriptions of the NG20 data sets used in their experiments. We also thank\nRon Meir, Yiftach Ravid and the anonymous referees for their constructive comments. This\nresearch was supported by the Israeli Ministry of Science\n\nReferences\n[1] 20 newsgroup data set. http://www.ai.mit.edu/~jrennie/20 newsgroups/.\n[2] Libsvm. http://www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[3] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons,\n\nInc., 1991.\n\n[4] R. El-Yaniv, S. Fine, and N. Tishby. Agnostic classi\ufb02cation of markovian sequences. In\n\nNIPS97, 1997.\n\n[5] I.D. Guedalia, M. London, and M. Werman. A method for on-line clustering of non-\n\nstationary data. Neural Computation, 11:521{540, 1999.\n\n[6] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall, New Jersey,\n\n1988.\n\n[7] J. Lin. Divergence measures based on the shannon entropy.\n\nIEEE Transactions on\n\nInformation Theory, 37(1):145{151, 1991.\n\n[8] F.C. Pereira N. Tishby and W. Bialek. Information bottleneck method. In 37-th Allerton\n\nConference on Communication and Computation, 1999.\n\n[9] K. Rose. Deterministic annealing for clustering, compression, classi\ufb02cation, regression\n\nand related optimization problems. Proceedings of the IEEE, 86(11):2210{2238, 1998.\n\n[10] N. Slonim and N. Tishby. Agglomerative information bottleneck. In NIPS99, 1999.\n\n[11] N. Slonim and N. Tishby. The power of word clustering for text classi\ufb02cation. To appear\n\nin the European Colloquium on IR Research, ECIR, 2001.\n\n[12] Noam Slonim and Naftali Tishby. Document clustering using word clusters via the infor-\n\nmation bottleneck method. In ACM SIGIR 2000, 2000.\n\n\f100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0\n0\n\n0.2\n\n0.4\n\nFirst Iteration Accuracy\nLast Iteration Accuracy\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.8\n\n1\n\n0.6\n1.4\nFeature Noise Amplitude\n\n1.2\n\n(a)\n\n1.6\n\n1.8\n\n2\n\n30\n0\n\n2\n\n4\n\n6\n\n14\n\n16\n\n18\n\n20\n\n10\n\n8\n12\nIDC Iteration\n\n(b)\n\ny\nc\na\nr\nu\nc\nc\nA\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n10\n\n20\n\nFirst Iteration Accuracy\nLast Iteration Accuracy\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\nNumber Of Feature Clusters\n\nFirst Iteration Accuracy\nLast Iteration Accuracy\n\n30\n\n40\n\nTraining Set Size\n\n50\n\n60\n\n70\n\n(c)\n\n(d)\n\nFirst Iteration Accuracy\nLast Iteration Accuracy\n\n0\n100\n\n150\n\n200\n\n300\n\n250\nTest Set Size\n\n350\n\n400\n\n450\n\n500\n\n(e)\n\nFigure 2: (a) Average accuracy over 10 trials for di\ufb01erent amplitudes of proportional feature\nnoise. Data set: A synthetically generated sample of 200 500-dimensional elements in 4\nclasses. (b) A trace of a single IDC run. The x-axis is the number of IDC iterations and\nthe y-axis is accuracy achieved in each iteration. Data set: Synthetically generated sample of\n500, 400-dimensional elements in 5 classes; Noise: Proportional feature noise with \ufb01 = 1:0;\n(c) Average accuracy (10 trials) for di\ufb01erent numbers of feature clusters. Data set: NG4. (d)\nAverage accuracy of (10 trials of) transductive categorization of 5 newsgroups. Sample size:\n80 documents per class, X-axis is training set size. Upper curve shows trans. IDC-15 and\nlower curve is trans. IDC-1. (e) Average accuracy of (10 trials of) transductive categorization\nof 5 newsgroups. Sample size: constant training set size of 50 documents from each class.\nThe x-axis counts the number of unlabeled samples to be categorized. Upper curve is trans.\nIDC-15 and lower curve is trans. IDC-1. Each error bar (in all graphs) speci\ufb02es one std.\n\n\f", "award": [], "sourceid": 1979, "authors": [{"given_name": "Ran", "family_name": "El-Yaniv", "institution": null}, {"given_name": "Oren", "family_name": "Souroujon", "institution": null}]}