{"title": "Semi-supervised Learning by Entropy Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 536, "abstract": null, "full_text": " Semi-supervised Learning\n by Entropy Minimization\n\n\n\n Yves Grandvalet Yoshua Bengio\n Heudiasyc, CNRS/UTC Dept. IRO, Universite de Montreal\n 60205 Compi`egne cedex, France Montreal, Qc, H3C 3J7, Canada\n grandval@utc.fr bengioy@iro.umontreal.ca\n\n\n Abstract\n\n\n We consider the semi-supervised learning problem, where a decision rule\n is to be learned from labeled and unlabeled data. In this framework, we\n motivate minimum entropy regularization, which enables to incorporate\n unlabeled data in the standard supervised learning. Our approach in-\n cludes other approaches to the semi-supervised problem as particular or\n limiting cases. A series of experiments illustrates that the proposed solu-\n tion benefits from unlabeled data. The method challenges mixture mod-\n els when the data are sampled from the distribution class spanned by the\n generative model. The performances are definitely in favor of minimum\n entropy regularization when generative models are misspecified, and the\n weighting of unlabeled data provides robustness to the violation of the\n \"cluster assumption\". Finally, we also illustrate that the method can also\n be far superior to manifold learning in high dimension spaces.\n\n\n\n1 Introduction\n\nIn the classical supervised learning classification framework, a decision rule is to be learned\nfrom a learning set Ln = {xi, yi}ni , where each example is described by a pattern\n =1 xi X\nand by the supervisor's response yi = {1, . . . , K }. We consider semi-supervised\nlearning, where the supervisor's responses are limited to a subset of Ln.\n\nIn the terminology used here, semi-supervised learning refers to learning a decision rule on\nX from labeled and unlabeled data. However, the related problem of transductive learning,\ni.e. of predicting labels on a set of predefined patterns, is addressed as a side issue. Semi-\nsupervised problems occur in many applications where labeling is performed by human\nexperts. They have been receiving much attention during the last few years, but some\nimportant issues are unresolved [10].\n\nIn the probabilistic framework, semi-supervised learning can be modeled as a missing data\nproblem, which can be addressed by generative models such as mixture models thanks\nto the EM algorithm and extensions thereof [6].Generative models apply to the joint den-\nsity of patterns and class (X, Y ). They have appealing features, but they also have major\ndrawbacks. Their estimation is much more demanding than discriminative models, since\nthe model of P (X, Y ) is exhaustive, hence necessarily more complex than the model of\n\n This work was supported in part by the IST Programme of the European Community, under the\nPASCAL Network of Excellence IST-2002-506778. This publication only reflects the authors' views.\n\n\f\nP (Y |X). More parameters are to be estimated, resulting in more uncertainty in the es-\ntimation process. The generative model being more precise, it is also more likely to be\nmisspecified. Finally, the fitness measure is not discriminative, so that better models are\nnot necessarily better predictors of class labels. These difficulties have lead to proposals\naiming at processing unlabeled data in the framework of supervised classification [1, 5, 11].\nHere, we propose an estimation principle applicable to any probabilistic classifier, aiming\nat making the most of unlabeled data when they are beneficial, while providing a control\non their contribution to provide robustness to the learning scheme.\n\n2 Derivation of the Criterion\n\n2.1 Likelihood\n\nWe first recall how the semi-supervised learning problem fits into standard supervised\nlearning by using the maximum (conditional) likelihood estimation principle. The learning\nset is denoted Ln = {xi, zi}ni , where\n =1 z {0, 1}K denotes the dummy variable rep-\nresenting the actually available labels (while y represents the precise and complete class\ninformation): if xi is labeled k, then zik = 1 and zi = 0 for = k; if xi is unlabeled,\nthen zi = 1 for = 1, . . . , K.\n\nWe assume that labeling is missing at random, that is, for all unlabeled examples,\nP (z|x, k) = P (z|x, ), for any (k, ) pair, which implies\n\n zkP (k|\n P ( x)\n k |x, z) = . (1)\n K z P ( |\n =1 x)\n\n\nAssuming independent examples, the conditional log-likelihood of (Z|X) on the observed\nsample is then\n n K\n\n L(; Ln) = log zikfk(xi; ) + h(zi) , (2)\n i=1 k=1\n\nwhere h(z), which does not depend on P (X, Y ), is only affected by the missingness mech-\nanism, and fk(x; ) is the model of P (k|x) parameterized by .\n\nThis criterion is a concave function of fk(xi; ), and for simple models such as the ones\nprovided by logistic regression, it is also concave in , so that the global solution can\nbe obtained by numerical optimization. Maximizing (2) corresponds to maximizing the\ncomplete likelihood if no assumption whatsoever is made on P (X) [6].\n\nProvided fk(xi; ) sum to one, the likelihood is not affected by unlabeled data: unlabeled\ndata convey no information. In the maximum a posteriori (MAP) framework, Seeger re-\nmarks that unlabeled data are useless regarding discrimination when the priors on P (X)\nand P (Y |X) factorize [10]: observing x does not inform about y, unless the modeler\nassumes so. Benefitting from unlabeled data requires assumptions of some sort on the re-\nlationship between X and Y . In the Bayesian framework, this will be encoded by a prior\ndistribution. As there is no such thing like a universally relevant prior, we should look for\nan induction bias exploiting unlabeled data when the latter is known to convey information.\n\n\n2.2 When Are Unlabeled Examples Informative?\n\nTheory provides little support to the numerous experimental evidences [5, 7, 8] showing\nthat unlabeled examples can help the learning process. Learning theory is mostly developed\nat the two extremes of the statistical paradigm: in parametric statistics where examples are\nknown to be generated from a known class of distribution, and in the distribution-free Struc-\ntural Risk Minimization (SRM) or Probably Approximately Correct (PAC) frameworks.\nSemi-supervised learning, in the terminology used here, does not fit the distribution-free\nframeworks: no positive statement can be made without distributional assumptions, as for\n\n\f\nsome distributions P (X, Y ) unlabeled data are non-informative while supervised learning\nis an easy task. In this regard, generalizing from labeled and unlabeled data may differ\nfrom transductive inference.\n\nIn parametric statistics, theory has shown the benefit of unlabeled examples, either for spe-\ncific distributions [9], or for mixtures of the form P (x) = pP (x|1) + (1 - p)P (x|2)\nwhere the estimation problem is essentially reduced to the one of estimating the mixture\nparameter p [4]. These studies conclude that the (asymptotic) information content of un-\nlabeled examples decreases as classes overlap.1 Thus, the assumption that classes are well\nseparated is sensible if we expect to take advantage of unlabeled examples.\n\nThe conditional entropy H(Y |X) is a measure of class overlap, which is invariant to the\nparameterization of the model. This measure is related to the usefulness of unlabeled data\nwhere labeling is indeed ambiguous. Hence, we will measure the conditional entropy of\nclass labels conditioned on the observed variables\n\n H(Y |X, Z) = -EXY Z[log P (Y |X, Z)] , (3)\n\nwhere EX denotes the expectation with respect to X.\n\nIn the Bayesian framework, assumptions are encoded by means of a prior on the model\nparameters. Stating that we expect a high conditional entropy does not uniquely define the\nform of the prior distribution, but the latter can be derived by resorting to the maximum\nentropy principle.2 Let (, ) denote the model parameters of P (X, Y, Z); the maximum\nentropy prior verifying E[H(Y |X, Z)] = c, where the constant c quantifies how small\nthe entropy should be on average, takes the form\n\n P (, ) exp (-H(Y |X, Z))) , (4)\n\nwhere is the positive Lagrange multiplier corresponding to the constant c.\n\nComputing H(Y |X, Z) requires a model of P (X, Y, Z) whereas the choice of the diagno-\nsis paradigm is motivated by the possibility to limit modeling to conditional probabilities.\nWe circumvent the need of additional modeling by applying the plug-in principle, which\nconsists in replacing the expectation with respect to (X, Z) by the sample average. This\nsubstitution, which can be interpreted as \"modeling\" P (X, Z) by its empirical distribution,\nyields\n n K\n 1\n Hemp(Y |X, Z; Ln) = - P (k|\n n xi, zi) log P (k|xi, zi) . (5)\n i=1 k=1\n\nThis empirical functional is plugged in (4) to define an empirical prior on parameters ,\nthat is, a prior whose form is partly defined from data [2].\n\n\n2.3 Entropy Regularization\n\nRecalling that fk(x; ) denotes the model of P (k|x), the model of P (k|x, z) (1) is\ndefined as follows: zkfk(\n g x; )\n k (x, z; ) = .\n K z f (\n =1 x; )\n\nFor labeled data, gk(x, z; ) = zk, and for unlabeled data, gk(x, z; ) = fk(x; ).\nFrom now on, we drop the reference to parameter in fk and gk to lighten notation. The\n\n 1This statement, given explicitly by [9], is also formalized, though not stressed, by [4], where\nthe Fisher information for unlabeled examples at the estimate ^\n p is clearly a measure of the overlap\n\nbetween class conditional densities: Iu(^\n p) = R (P (x|1)-P (x|2))2 d\n ^\n pP ( x.\n x|1)+(1- ^\n p)P (x|2)\n 2Here, maximum entropy refers to the construction principle which enables to derive distributions\nfrom constraints, not to the content of priors regarding entropy.\n\n\f\nMAP estimate is the maximizer of the posterior distribution, that is, the maximizer of\n C(, ; Ln) = L(; Ln) - Hemp(Y |X, Z; Ln)\n n K n K\n\n = log zikfk(xi) + gk(xi, zi) log gk(xi, zi) , (6)\n i=1 k=1 i=1 k=1\n\nwhere the constant terms in the log-likelihood (2) and log-prior (4) have been dropped.\nWhile L(; Ln) is only sensitive to labeled data, Hemp(Y |X, Z; Ln) is only affected by\nthe value of fk(x) on unlabeled data.\n\nNote that the approximation Hemp (5) of H (3) breaks down for wiggly functions fk() with\nabrupt changes between data points (where P (X) is bounded from below). As a result, it is\nimportant to constrain fk() in order to enforce the closeness of the two functionals. In the\nfollowing experimental section, we imposed a smoothness constraint on fk() by adding to\nthe criterion C (6) a penalizer with its corresponding Lagrange multiplier .\n\n3 Related Work\n\nSelf-Training Self-training [7] is an iterative process, where a learner imputes the labels\nof examples which have been classified with confidence in the previous step. Amini et al.\n[1] analyzed this technique and shown that it is equivalent to a version of the classification\nEM algorithm, which minimizes the likelihood deprived of the entropy of the partition. In\nthe context of conditional likelihood with labeled and unlabeled examples, the criterion is\n\n n K K\n\n log zikfk(xi) + gk(xi) log gk(xi) ,\n i=1 k=1 k=1\n\nwhich is recognized as an instance of the criterion (6) with = 1.\nSelf-confident logistic regression [5] is another algorithm optimizing the criterion for =\n1. Using smaller values is expected to have two benefits: first, the influence of unlabeled\nexamples can be controlled, in the spirit of the EM- [8], and second, slowly increasing\n defines a scheme similar to deterministic annealing, which should help the optimization\nprocess to avoid poor local minima of the criterion.\n\nMinimum entropy methods Minimum entropy regularizers have been used in other con-\ntexts to encode learnability priors (e.g. [3]). In a sense, Hemp can be seen as a poor's man\nway to generalize this approach to continuous input spaces. This empirical functional was\nalso used by Zhu et al. [13, Section 6] as a criterion to learn weight function parameters in\nthe context of transduction on manifolds for learning.\n\nInput-Dependent Regularization Our criterion differs from input-dependent regular-\nization [10, 11] in that it is expressed only in terms of P (Y |X, Z) and does not involve\nP (X). However, we stress that for unlabeled data, the regularizer agrees with the complete\nlikelihood provided P (X) is small near the decision surface. Indeed, whereas a genera-\ntive model would maximize log P (X) on the unlabeled data, our criterion minimizes the\nconditional entropy on the same points. In addition, when the model is regularized (e.g.\nwith weight decay), the conditional entropy is prevented from being too small close to the\ndecision surface. This will favor putting the decision surface in a low density area.\n\n4 Experiments\n4.1 Artificial Data\n\nIn this section, we chose a simple experimental setup in order to avoid artifacts stemming\nfrom optimization problems. Our goal is to check to what extent supervised learning can\nbe improved by unlabeled examples, and if minimum entropy can compete with generative\nmodels which are usually advocated in this framework.\n\n\f\nThe minimum entropy regularizer is applied to the logistic regression model. It is compared\nto logistic regression fitted by maximum likelihood (ignoring unlabeled data) and logistic\nregression with all labels known. The former shows what has been gained by handling\nunlabeled data, and the latter provides the \"crystal ball\" performance obtained by guessing\ncorrectly all labels. All hyper-parameters (weight-decay for all logistic regression models\nplus the parameter (6) for minimum entropy) are tuned by ten-fold cross-validation.\n\nMinimum entropy logistic regression is also compared to the classic EM algorithm for\nGaussian mixture models (two means and one common covariance matrix estimated by\nmaximum likelihood on labeled and unlabeled examples, see e.g. [6]). Bad local maxima\nof the likelihood function are avoided by initializing EM with the parameters of the true\ndistribution when the latter is a Gaussian mixture, or with maximum likelihood parameters\non the (fully labeled) test sample when the distribution departs from the model. This ini-\ntialization advantages EM, since it is guaranteed to pick, among all local maxima of the\nlikelihood, the one which is in the basin of attraction of the optimal value. Furthermore,\nthis initialization prevents interferences that may result from the \"pseudo-labels\" given to\nunlabeled examples at the first E-step. In particular, \"label switching\" (i.e. badly labeled\nclusters) is avoided at this stage.\n\n\nCorrect joint density model In the first series of experiments, we consider two-class\nproblems in an 50-dimensional input space. Each class is generated with equal probability\nfrom a normal distribution. Class 1 is normal with mean (aa . . . a) and unit covariance\nmatrix. Class 2 is normal with mean -(aa . . . a) and unit covariance matrix. Parameter\na tunes the Bayes error which varies from 1 % to 20 % (1 %, 2.5 %, 5 %, 10 %, 20 %).\nThe learning sets comprise nl labeled examples, (nl = 50, 100, 200) and nu unlabeled\nexamples, (nu = nl (1, 3, 10, 30, 100)). Overall, 75 different setups are evaluated, and\nfor each one, 10 different training samples are generated. Generalization performances are\nestimated on a test set of size 10 000.\n\nThis benchmark provides a comparison for the algorithms in a situation where unlabeled\ndata are known to convey information. Besides the favorable initialization of the EM al-\ngorithm to the optimal parameters, EM benefits from the correctness of the model: data\nwere generated according to the model, that is, two Gaussian subpopulations with identical\ncovariances. The logistic regression model is only compatible with the joint distribution,\nwhich is a weaker fulfillment than correctness.\n\nAs there is no modeling bias, differences in error rates are only due to differences in estima-\ntion efficiency. The overall error rates (averaged over all settings) are in favor of minimum\nentropy logistic regression (14.1 0.3 %). EM (15.6 0.3 %) does worse on average than\nlogistic regression (14.9 0.3 %). For reference, the average Bayes error rate is 7.7 % and\nlogistic regression reaches 10.4 0.1 % when all examples are labeled.\n\nFigure 1 provides more informative summaries than these raw numbers. The plots repre-\nsent the error rates (averaged over nl) versus Bayes error rate and the nu/nl ratio. The\nfirst plot shows that, as asymptotic theory suggests [4, 9], unlabeled examples are mostly\ninformative when the Bayes error is low. This observation validates the relevance of the\nminimum entropy assumption. This graph also illustrates the consequence of the demand-\ning parametrization of generative models. Mixture models are outperformed by the simple\nlogistic regression model when the sample size is low, since their number of parameters\ngrows quadratically (vs. linearly) with the number of input features.\n\nThe second plot shows that the minimum entropy model takes quickly advantage of un-\nlabeled data when classes are well separated. With nu = 3nl, the model considerably\nimproves upon the one discarding unlabeled data. At this stage, the generative models do\nnot perform well, as the number of available examples is low compared to the number of\nparameters in the model. However, for very large sample sizes, with 100 times more unla-\n\n\f\n 15\n\n 40\n\n\n 30\n\n 10\n 20\n\n Test Error (%) Test Error (%)\n 10\n\n\n 5\n 5 10 15 20 1 3 10 30 100\n Bayes Error (%) Ratio n /n\n u l\n\n\n\n\nFigure 1: Left: test error vs. Bayes error rate for nu/nl = 10; right: test error vs. nu/nl\nratio for 5 % Bayes error (a = 0.23). Test errors of minimum entropy logistic regression ()\nand mixture models (+). The errors of logistic regression (dashed), and logistic regression\nwith all labels known (dash-dotted) are shown for reference.\n\n\n\nbeled examples than labeled examples, the generative approach eventually becomes more\naccurate than the diagnosis approach.\n\n\nMisspecified joint density model In a second series of experiments, the setup is slightly\nmodified by letting the class-conditional densities be corrupted by outliers. For each class,\nthe examples are generated from a mixture of two Gaussians centered on the same mean:\na unit variance component gathers 98 % of examples, while the remaining 2 % are gener-\nated from a large variance component, where each variable has a standard deviation of 10.\nThe mixture model used by EM is slightly misspecified since it is a simple Gaussian mix-\nture. The results, displayed in the left-hand-side of Figure 2, should be compared with the\nright-hand-side of Figure 1. The generative model dramatically suffers from the misspec-\nification and behaves worse than logistic regression for all sample sizes. The unlabeled\nexamples have first a beneficial effect on test error, then have a detrimental effect when\nthey overwhelm the number of labeled examples. On the other hand, the diagnosis models\nbehave smoothly as in the previous case, and the minimum entropy criterion performance\nimproves.\n\n\n 20 30\n\n 25\n\n 15 20\n\n 15\n\n 10 10\n Test Error (%) Test Error (%)\n\n 5\n\n 5 0\n 1 3 10 30 100 1 3 10 30 100\n Ratio n /n Ratio n /n\n u l u l\n\n\n\n\nFigure 2: Test error vs. nu/nl ratio for a = 0.23. Average test errors for minimum entropy\nlogistic regression () and mixture models (+). The test error rates of logistic regression\n(dotted), and logistic regression with all labels known (dash-dotted) are shown for refer-\nence. Left: experiment with outliers; right: experiment with uninformative unlabeled data.\n\n\nThe last series of experiments illustrate the robustness with respect to the cluster assump-\ntion, by testing it on distributions where unlabeled examples are not informative, and where\na low density P (X) does not indicate a boundary region. The data is drawn from two Gaus-\nsian clusters like in the first series of experiment, but the label is now independent of the\nclustering: an example x belongs to class 1 if x2 > x1 and belongs to class 2 otherwise:\n\n\f\nthe Bayes decision boundary is now separates each cluster in its middle. The mixture model\nis unchanged. It is now far from the model used to generate data. The right-hand-side plot\nof Figure 1 shows that the favorable initialization of EM does not prevent the model to be\nfooled by unlabeled data: its test error steadily increases with the amount of unlabeled data.\nOn the other hand, the diagnosis models behave well, and the minimum entropy algorithm\nis not distracted by the two clusters; its performance is nearly identical to the one of train-\ning with labeled data only (cross-validation provides values close to zero), which can be\nregarded as the ultimate performance in this situation.\n\n\nComparison with manifold transduction Although our primary goal is to infer a deci-\nsion function, we also provide comparisons with a transduction algorithm of the \"manifold\nfamily\". We chose the consistency method of Zhou et al. [12] for its simplicity. As sug-\ngested by the authors, we set = 0.99 and the scale parameter 2 was optimized on test\nresults [12]. The results are reported in Table 1. The experiments are limited due to the\nmemory requirements of the consistency method in our naive MATLAB implementation.\n\n\nTable 1: Error rates (%) of minimum entropy (ME) vs. consistency method (CM), for\na = 0.23, nl = 50, and a) pure Gaussian clusters b) Gaussian clusters corrupted by outliers\nc) class boundary separating one Gaussian cluster\n nu 50 150 500 1500\n a) ME 10.8 1.5 9.8 1.9 8.8 2.0 8.3 2.6\n a) CM 21.4 7.2 25.5 8.1 29.6 9.0 26.8 7.2\n b) ME 8.5 0.9 8.3 1.5 7.5 1.5 6.6 1.5\n b) CM 22.0 6.7 25.6 7.4 29.8 9.7 27.7 6.8\n c) ME 8.7 0.8 8.3 1.1 7.2 1.0 7.2 1.7\n c) CM 51.6 7.9 50.5 4.0 49.3 2.6 50.2 2.2\n\n\nThe results are extremely poor for the consistency method, whose error is way above min-\nimum entropy, and which does not show any sign of improvement as the sample of unla-\nbeled data grows. Furthermore, when classes do not correspond to clusters, the consistency\nmethod performs random class assignments. In fact, our setup, which was designed for\nthe comparison of global classifiers, is extremely defavorable to manifold methods, since\nthe data is truly 50-dimensional. In this situation, local methods suffer from the \"curse\nof dimensionality\", and many more unlabeled examples would be required to get sensible\nresults. Hence, these results mainly illustrate that manifold learning is not the best choice\nin semi-supervised learning for truly high dimensional data.\n\n4.2 Facial Expression Recognition\n\nWe now consider an image recognition problem, consisting in recognizing seven (balanced)\nclasses corresponding to the universal emotions (anger, fear, disgust, joy, sadness, surprise\nand neutral). The patterns are gray level images of frontal faces, with standardized posi-\ntions. The data set comprises 375 such pictures made of 140 100 pixels.\n\nWe tested kernelized logistic regression (Gaussian kernel), its minimum entropy version,\nnearest neigbor and the consistency method. We repeatedly (10 times) sampled 1/10 of\nthe dataset for providing the labeled part, and the remainder for testing. Although (, 2)\nwere chosen to minimize the test error, the consistency method performed poorly with\n63.81.3 % test error (compared to 86 % error for random assignments). Nearest-neighbor\nget similar results with 63.1 1.3 % test error, and Kernelized logistic regression (ignoring\nunlabeled examples) improved to reach 53.61.3 %. Minimum entropy kernelized logistic\nregression regression achieves 52.0 1.9 % error (compared to about 20 % errors for\nhuman on this database). The scale parameter chosen for kernelized logistic regression\n(by ten-fold cross-validation) amount to use a global classifier. Again, the local methods\n\n\f\nfail. This may be explained by the fact that the database contains several pictures of each\nperson, with different facial expressions. Hence, local methods are likely to pick the same\nidentity instead of the same expression, while global methods are able to learn the relevant\ndirections.\n\n5 Discussion\n\nWe propose to tackle the semi-supervised learning problem in the supervised learning\nframework by using the minimum entropy regularizer. This regularizer is motivated by the-\nory, which shows that unlabeled examples are mostly beneficial when classes have small\noverlap. The MAP framework provides a means to control the weight of unlabeled exam-\nples, and thus to depart from optimism when unlabeled data tend to harm classification.\n\nOur proposal encompasses self-learning as a particular case, as minimizing entropy in-\ncreases the confidence of the classifier output. It also approaches the solution of transduc-\ntive large margin classifiers in another limiting case, as minimizing entropy is a means to\ndrive the decision boundary from learning examples.\n\nThe minimum entropy regularizer can be applied to both local and global classifiers. As a\nresult, it can improve over manifold learning when the dimensionality of data is effectively\nhigh, that is, when data do not lie on a low-dimensional manifold. Also, our experiments\nsuggest that the minimum entropy regularization may be a serious contender to generative\nmodels. It compares favorably to these mixture models in three situations: for small sample\nsizes, where the generative model cannot completely benefit from the knowledge of the\ncorrect joint model; when the joint distribution is (even slightly) misspecified; when the\nunlabeled examples turn out to be non-informative regarding class probabilities.\n\nReferences\n\n [1] M. R. Amini and P. Gallinari. Semi-supervised logistic regression. In 15th European Confer-\n ence on Artificial Intelligence, pages 390394. IOS Press, 2002.\n\n [2] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, New York, 2 edition,\n 1985.\n\n [3] M. Brand. Structure learning in conditional probability models via an entropic prior and param-\n eter extinction. Neural Computation, 11(5):11551182, 1999.\n\n [4] V. Castelli and T. M. Cover. The relative value of labeled and unlabeled samples in pat-\n tern recognition with an unknown mixing parameter. IEEE Trans. on Information Theory,\n 42(6):21022117, 1996.\n\n [5] Y. Grandvalet. Logistic regression for partial labels. In 9th Information Processing and Man-\n agement of Uncertainty in Knowledge-based Systems IPMU'02, pages 19351941, 2002.\n\n [6] G. J. McLachlan. Discriminant analysis and statistical pattern recognition. Wiley, 1992.\n\n [7] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Ninth\n International Conference on Information and Knowledge Management, pages 8693, 2000.\n\n [8] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and\n unlabeled documents using EM. Machine learning, 39(2/3):135167, 2000.\n\n [9] T. J. O'Neill. Normal discrimination with unclassified observations. Journal of the American\n Statistical Association, 73(364):821826, 1978.\n\n[10] M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for Adaptive\n and Neural Computation, University of Edinburgh, 2002.\n\n[11] M. Szummer and T. S. Jaakkola. Information regularization with partially labeled data. In\n Advances in Neural Information Processing Systems 15. MIT Press, 2003.\n\n[12] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, and B. Scholkopf. Learning with local and\n global consistency. In Advances in Neural Information Processing Systems 16, 2004.\n\n[13] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and\n harmonic functions. In 20th Int. Conf. on Machine Learning, pages 912919, 2003.\n\n\f\n", "award": [], "sourceid": 2740, "authors": [{"given_name": "Yves", "family_name": "Grandvalet", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}