{"title": "The Value of Labeled and Unlabeled Examples when the Model is Imperfect", "book": "Advances in Neural Information Processing Systems", "page_first": 1361, "page_last": 1368, "abstract": null, "full_text": "The Value of Labeled and Unlabeled Examples when\n\nthe Model is Imperfect\n\nKaushik Sinha\n\nOhio State University\nColumbus, OH 43210\n\nMikahil Belkin\n\nOhio State University\nColumbus, OH 43210\n\nDept. of Computer Science and Engineering\n\nDept. of Computer Science and Engineering\n\nsinhak@cse.ohio-state.edu\n\nmbelkin@cse.ohio-state.edu\n\nAbstract\n\nSemi-supervised learning, i.e. learning from both labeled and unlabeled data has\nreceived signi(cid:2)cant attention in the machine learning literature in recent years.\nStill our understanding of the theoretical foundations of the usefulness of unla-\nbeled data remains somewhat limited. The simplest and the best understood sit-\nuation is when the data is described by an identi(cid:2)able mixture model, and where\neach class comes from a pure component. This natural setup and its implications\nware analyzed in [11, 5]. One important result was that in certain regimes, labeled\ndata becomes exponentially more valuable than unlabeled data.\nHowever, in most realistic situations, one would not expect that the data comes\nfrom a parametric mixture distribution with identi(cid:2)able components. There have\nbeen recent efforts to analyze the non-parametric situation, for example, (cid:147)cluster(cid:148)\nand (cid:147)manifold(cid:148) assumptions have been suggested as a basis for analysis. Still,\na satisfactory and fairly complete theoretical understanding of the nonparametric\nproblem, similar to that in [11, 5] has not yet been developed.\nIn this paper we investigate an intermediate situation, when the data comes from a\nprobability distribution, which can be modeled, but not perfectly, by an identi(cid:2)able\nmixture distribution. This seems applicable to many situation, when, for example,\na mixture of Gaussians is used to model the data. the contribution of this paper is\nan analysis of the role of labeled and unlabeled data depending on the amount of\nimperfection in the model.\n\n1 Introduction\n\nIn recent years semi-supervised learning, i.e. learning from labeled and unlabeled data, has drawn\nsigni(cid:2)cant attention. The ubiquity and easy availability of unlabeled data together with the increased\ncomputational power of modern computers, make the paradigm attractive in various applications,\nwhile connections to natural learning make it also conceptually intriguing. See [15] for a survey on\nsemi-supervised learning.\nFrom the theoretical point of view, semi-supervised learning is simple to describe. Suppose the data\nis sampled from the joint distribution p(x; y), where x is a feature and y is the label. The unlabeled\ndata comes from the marginal distribution p(x). Thus the the usefulness of unlabeled data is tied\nto how much information about joint distribution can be extracted from the marginal distribution.\nTherefore, in order to make unlabeled data useful, an assumption on the connection between these\ndistributions needs to be made.\n\n1\n\n\fIn the non-parametric setting several such assumptions have been recently proposed, including the\nthe cluster assumption and its re(cid:2)nement, the low-density separation assumption [7, 6], and the\nmanifold assumption [3]. These assumptions relate the shape of the marginal probability distribution\nto class labels. The low-density separation assumption states that the class boundary passes through\nthe low density regions, while the manifold assumption proposes that the proximity of the points\nshould be measured along the data manifold. However, while these assumptions has motivated\nseveral algorithms and have been shown to hold empirically, few theoretical results on the value of\nunlabeled data in the non-parametric setting are available so far. We note the work of Balcan and\nBlum ([2]), which attempts to unify several frameworks by introducing a notion of compatibility\nbetween labeled and unlabeled data. In a slightly different setting some theoretical results are also\navailable for co-training ([4, 8]).\nFar more complete results are available in the parametric setting. There one assumes that the dis-\ntribution p(x; y) is a mixture of two parametric distribution p1 and p2, each corresponding to a\ndifferent class. Such mixture is called identi(cid:2)able, if parameters of each component can be uniquely\ndetermined from the marginal distribution p(x). The study of usefulness of unlabeled data under this\nassumption was undertaken by Castelli and Cover ([5]) and Ratsaby and Venkatesh ([11]). Among\nseveral important conclusions from their study was the fact under a certain range of conditions,\nlabeled data is exponentially more important for approximating the Bayes optimal classi(cid:2)er than\nunlabeled data. Roughly speaking, unlabeled data may be used to identify the parameters of each\nmixture component, after which the class attribution can be established exponentially fast using only\nfew labeled examples.\nWhile explicit mixture modeling is of great theoretical and practical importance, in many applica-\ntions there is no reason to believe that the model provides a precise description of the phenomenon.\nOften it is more reasonable to think that our models provide a rough approximation to the underlying\nprobability distribution, but do not necessarily represent it exactly. In this paper we investigate the\nlimits of usefulness of unlabeled data as a function of how far the best (cid:2)tting model strays from the\nunderlying probability distribution.\nThe rest of the paper is structured as follows: we start with an overview of the results available for\nidenti(cid:2)able mixture models together with some extensions of these results. We then describe how\nthe relative value of labeled and unlabeled data changes when the true distribution is a perturbation\nof a parametric model. Finally we discuss various regimes of usability for labeled and unlabeled\ndata and represent our (cid:2)ndings in Fig 1.\n\n2 Relative Value of Labeled and Unlabeled Examples\nOur analysis is conducted in the standard classi(cid:2)cation framework and studies the behavior Perror(cid:0)\nPBayes, where Perror is probability of misclassi(cid:2)cation for a given classi(cid:2)er and PBayes is the\nclassi(cid:2)cation error of the optimal classi(cid:2)er. The quantity Perror (cid:0) PBayes is often referred to as the\nexcess probability of error and expresses how far our classi(cid:2)er is from the best possible.\nIn what follows, we review some theoretical results that describe behavior of the excess error prob-\nability as a function of the number of labeled and unlabeled examples. We will denote number of\nlabeled examples by l and the number of unlabeled examples by u. We omit certain minor techni-\ncal details to simplify the exposition. The classi(cid:2)er, for which Perror is computed is based on the\nunderlying model.\nTheorem 2.1. (Ratsaby and Venkatesh [11]) In a two class identi(cid:2)able mixture model, let the\nequiprobable class densities p1(x); p2(x) be d-dimensional Gaussians with unit covariance matri-\n\nces. Then for suf(cid:2)ciently small (cid:15) > 0 and arbitrary (cid:14) > 0, given l = O(cid:0)log (cid:14)(cid:0)1(cid:1) labeled and\nu = O(cid:16) d2\n(cid:15)3(cid:14) (d log (cid:15)(cid:0)1 + log (cid:14)(cid:0)1)(cid:17) unlabeled examples respectively, with con(cid:2)dence at least 1 (cid:0) (cid:14),\nprobability of error Perror (cid:20) PBayes(1 + c(cid:15)) for some positive constant c.\nSince the mixture is identi(cid:2)able, parameters can be estimated from unlabeled examples alone. La-\nbeled examples are not required for this purpose. Therefore, unlabeled examples are used to estimate\nthe mixture and hence the two decision regions. Once the decision regions are established, labeled\nexamples are used to label them. An equivalent form of the above result in terms of labeled and\n\nunlabeled examples is Perror (cid:0) PBayes = O(cid:0) d\n\nu1=3(cid:1) + O (exp((cid:0)l)). For a (cid:2)xed dimension d, this\n\n2\n\n\findicates that labeled examples are exponentially more valuable than the unlabeled examples in re-\nducing the excess probability of error, however, when d is not (cid:2)xed, higher dimensions slower these\nrates.\nIndependently, Cover and Castelli provided similar results in a different setting under Bayesian\nframework.\nTheorem 2.2. (Cover and Castelli [5]) In a two class mixture model, let p1(x); p2(x) be the para-\nmetric class densities and let h((cid:17)) be the prior over the unknown mixing parameter (cid:17). Then\n\nu(cid:19) + expf(cid:0)Dl + o(l)g\n\nPerror (cid:0) PBayes = O(cid:18) 1\nwhere D = (cid:0) logf2p(cid:17)(1 (cid:0) (cid:17))R pp1(x)p2(x)dxg\nIn their framework, Cover and Castelli [5] assumed that parameters of individual class densities\nare known, however the associated class labels and mixing parameter are unknown. Under such\nassumption their result shows that the above rate is obtained when l3+(cid:15)u(cid:0)1 ! 0 as l + u ! 1. In\nparticular this implies that, if ue(cid:0)Dl ! 0 and l = o(u) the excess error is essentially determined by\nthe number of unlabeled examples. On the other hand if u grows faster than eDl, then excess error\nis determined by the number of labeled examples. For detailed explanation of the above statements\nsee pp-2103 [5]. The effect of dimensionality is not captured in their result.\nBoth results indicate that if the parametric model assumptions are satis(cid:2)ed, labeled examples are\nexponentially more valuable than unlabeled examples in reducing the excess probability of error.\nIn this paper we investigate the situation when the parametric model assumptions are only satis(cid:2)ed\nto a certain degree of precision, which seems to be a natural premise in a variety of practical settings.\nIt is interesting to note that uncertainty can appear for different reasons. One source of uncertainty\nis a lack of examples, which we call Type-A. Imperfection of the model is another source of uncer-\ntainty, which we will refer to as Type-B.\n\n(cid:15) Type-A uncertainty for perfect model with imperfect information: Individual class\ndensities follow the assumed parametric model. Uncertainty results from (cid:2)niteness of ex-\namples. Perturbation size speci(cid:2)es how well parameters of the individual class densities\ncan be estimated from (cid:2)nite data.\n(cid:15) Type-B uncertainty for imperfect model: Individual class densities does not follow the\nassumed parametric model. Perturbation size speci(cid:2)es how well the best (cid:2)tting model can\napproximate the underlying density.\n\nBefore proceeding further, we describe our model and notations. We take the instance space X (cid:26) Rd\nwith labels f(cid:0)1; 1g. True class densities are always represented by p1(x) and p2(x) respectively. In\ncase of Type-A uncertainty they are simply p1(xj(cid:18)1) and p2(xj(cid:18)2). In case of Type-B uncertainty\np1(x); p2(x) are perturbations of two d-dimensional densities from a parametric family F. We will\ndenote the mixing parameter by t and the individual parametric class densities by f1(xj(cid:18)1); f2(xj(cid:18)2)\nrespectively and the resulting mixture density as tf1(xj(cid:18)1) + (1 (cid:0) t)f2(xj(cid:18)2). We will show some\nspeci(cid:2)c results when F consists of spherical Gaussian distributions with unit covariance matrix and\n2. In such a case (cid:18)1; (cid:18)2 2 Rd represent the means of the corresponding densities and the\nt = 1\nmixture density is indexed by a 2d dimensional vector (cid:18) = [(cid:18)1; (cid:18)2]. The class of such mixtures is\nidenti(cid:2)able and hence using unlabeled examples alone, (cid:18) can be estimated by ^(cid:18) 2 R2d. By jj (cid:1) jj we\nrepresent the standard Euclidean norm in Rd and by jj (cid:1) jj d\n2 ;2 the Sobolev norm. Note that for some\n(cid:15) > 0, jj (cid:1) jj d\n< (cid:15). We will frequently use the following term\nto represent the optimal number of labeled\nL(a; t; e) =\nexamples for correctly classifying estimated decision regions with high probability (as will be clear\nin the next section) where, t represents mixing parameter, e represents perturbation size and a is an\ninteger variable and A; B are constants.\n\n2 ;2 < (cid:15) implies jj (cid:1) jj1 < (cid:15) and jj (cid:1) jj1\n(t(cid:0)Ae)(1(cid:0)2p(PBayes +Be)(1(cid:0)PBayes(cid:0)Be))\n\nlog( a\n(cid:14) )\n\n3\n\n\f2.1 Type-A Uncertainty : Perfect Model Imperfect Information\nDue to (cid:2)niteness of unlabeled examples, density parameters can not be estimated arbitrarily close to\nthe true parameters in terms of Euclidean norm. Clearly, how close they can be estimated depends\non the number of unlabeled examples used u, dimension d and con(cid:2)dence probability (cid:14). Thus,\nType-I uncertainty inherently gives rise to a perturbation size de(cid:2)ned by (cid:15)1(u; d; (cid:14)) such that, a (cid:2)xed\nu de(cid:2)nes a perturbation size (cid:15)1(d; (cid:14)). Because of this perturbation, estimated decision regions differ\nfrom the true decision regions. From [11] it is clear that only very few labeled examples are good\nenough to label these two estimated decision regions reasonably well with high probability. Let\nsuch a number of labeled examples be l(cid:3). But what happens if the number of labeled examples\navailable is greater than l(cid:3)? Since the individual densities follow the parametric model exactly,\nthese extra labeled examples can be used to estimate the density parameters and hence the decision\nregions. However, using a simple union bound it can be shown ([10]) that the asymptotic rate for\n\nl log( d\n\nexamples if we want to represent the rate at which excess probability of error reduces as a function\nof the number of labeled examples, it is clear that initially the error reduces exponentially fast in\n\nconvergence of such estimation procedure is O(cid:18)q d\nnumber of labeled examples (following [11]) but then it reduces only at a rate O(cid:18)q d\n\n(cid:14) )(cid:19). Thus, provided we have u unlabeled\n\nProvided we use the following strategy, this extends the result of\nbelow.\nWe adopt the following strategy to utilize labeled and unlabeled examples in order to learn a classi-\n(cid:2)cation rule.\nStrategy 1:\n\n[11] as given in the Theorem\n\n(cid:14) )(cid:19).\n\nl log( d\n\n1. Given u unlabeled examples, and con(cid:2)dence probability (cid:14) > 0 use maximum likelihood\nestimation method to learn the parameters of the mixture model such that the estimates\n\n2. Use l(cid:3) labeled examples to label the estimated decision regions with probability of incorrect\n\nu1=3(cid:1) close to the actual parameters with probability at\n\n^(cid:18)1; ^(cid:18)2 are only (cid:15)1(u; d; (cid:14)) = O(cid:3)(cid:0) d\n4.\nleast 1 (cid:0) (cid:14)\n4.\nlabeling no greater than (cid:14)\nprobability at least 1 (cid:0) (cid:14)\n2.\n\n3. If l > l(cid:3) examples are available use them to estimate the individual density parameters with\n\nTheorem 2.3. Let the model be a mixture of two equiprobable d dimensional spherical Gaussians\np1(xj(cid:18)1); p2(xj(cid:18)2) having unit covariance matrices and means (cid:18)1; (cid:18)2 2 Rd. For any arbitrary\n1 > (cid:14) > 0, if strategy 1 is used with u unlabeled examples then there exists a perturbation size\n(cid:15)1(u; d; (cid:14)) > 0 and positive constants A; B such that using l (cid:20) l(cid:3) = L(24; 0:5; (cid:15)1) labeled exam-\nples, Perror (cid:0) PBayes reduces exponentially fast in the number of labeled examples with probability\n2 ). If more labeled examples l > l(cid:3) are provided then with probability at least (1(cid:0) (cid:14)\n2 ),\nat least (1(cid:0) (cid:14)\nPerror (cid:0) PBayes asymptotically converges to zero at a rate O(cid:18)q d\n(cid:14) )(cid:19) as l ! 1. If we\nrepresent the reduction rate of this excess error(Perror (cid:0) PBayes) as a function of labeled examples\nRee(l), then this can be compactly represented as,\n\nl log( d\n\nRee(l) =8<\n:\n\nO (exp((cid:0)l))\nO(cid:18)q d\nl log( d\n\nif l (cid:20) l(cid:3)\n(cid:14) )(cid:19) if l > l(cid:3)\n\nAfter using l(cid:3) labeled examples Perror = PBayes + O((cid:15)1).\n2.2 Type-B Uncertainty: Imperfect Model\nIn this section we address the main question raised in this paper. Here the individual class densities\ndo not follow the assumed parametric model exactly but are a perturbed version of the assumed\nmodel. The uncertainty in this case is speci(cid:2)ed by the perturbation size (cid:15)2 which roughly indicates\nby what extent the true class densities differ form that of the best (cid:2)tting parametric model densities.\n\n4\n\n\fd\n\n2 ;2 (cid:20) (cid:15)2;jjf2 (cid:0) p2jj d\n\npl(cid:17) as l ! 1.\n\n2 ;2 (cid:20) (cid:15)2 and jjp2 (cid:0) f2jj d\n\nFor any mixing parameter t 2 (0; 1) let us consider a two class mixture model with individual class\ndensities p1(x); p2(x) respectively. Suppose the best knowledge available about this mixture model\nis that individual class densities approximately follow some parametric form from a class F. We\nassume that best approximations of p1; p2 within F are f1(xj(cid:18)1); f2(xj(cid:18)2) respectively, such that\nfor i 2 f1; 2g; (fi (cid:0) pi) are in Sobolev class H\n2 and there exists a perturbation size (cid:15)2 > 0 such\nthat jjp1 (cid:0) f1jj d\n2 ;2 (cid:20) (cid:15)2. Here, the Sobolev norm is used as a smoothness\ncondition and implies that true densities are smooth and not (cid:147)too different(cid:148) from the best (cid:2)tting\nparametric model densities and in particular, if jjfi (cid:0) pijj d\n2 ;2 (cid:20) (cid:15)2 then jjfi (cid:0) pijj1 (cid:20) (cid:15)2 and\njjfi (cid:0) pijj1 (cid:20) (cid:15)2.\nWe (cid:2)rst show that due to the presence of this perturbation size, even complete knowledge of the\nbest (cid:2)tting model parameters does not help in learning optimal classi(cid:2)cation rule in the following\nsense. In the absence of any perturbation, complete knowledge of model parameters implies that\nthe decision boundary and hence two decision regions are explicitly known but not their labels.\nThus, using only a very small number of labeled examples Perror reduces exponentially fast in the\nnumber of labeled examples to PBayes as number of labeled examples increases. However, due to\nthe presence of perturbation size, Perror reduces exponentially fast in number of labeled examples\nonly up to PBayes +O((cid:15)2). Since beyond this, parametric model assumptions do not hold due to the\npresence of perturbation size, some non parametric technique must be used to estimate the actual\ndecision boundary. For any such nonparametric technique Perror now reduces at a much slower rate.\nThis trend is roughly what the following theorem says. Here f1; f2 are general parametric densities\nnot necessarily Gaussians. In what follows we assume that p1; p2 2 C1 and hence convergence rate\nfor non parametric classi(cid:2)cation (see [14]) is O(cid:16) 1\npl(cid:17). Slower rate results if in(cid:2)nite differentiability\ncondition is not satis(cid:2)ed.\nTheorem 2.4. In a two class mixture model with individual class densities p1(x); p2(x) and mixing\nparameter t 2 (0; 1), let the mixture density of best (cid:2)tting parametric model be tf1(xj(cid:18)1) + (1 (cid:0)\nt)f2(xj(cid:18)2) where f1; f2 belongs to some parametric class F and true densities p1; p2 are perturbed\nversion of f1; f2. For a perturbation size (cid:15)2 > 0, if jjf1 (cid:0) p1jj d\n2 ;2 (cid:20) (cid:15)2\nand (cid:18)1; (cid:18)2 are known then for any 0 < (cid:14) < 1, there exists positive constants A; B such that for\nl (cid:20) l(cid:3) = L(6; t; (cid:15)2) labeled example, Perror (cid:0) PBayes reduces exponentially fast in the number of\nlabeled examples with probability at least (1 (cid:0) (cid:14)). If more labeled examples l > l(cid:3) are provided\nPerror (cid:0) PBayes asymptotically converges to zero at a rate O(cid:16) 1\nAfter using l(cid:3) labeled examples Perror = PBayes + O ((cid:15)2). Thus, from the above theorem it can\nbe thought that as labeled examples are added, initially the excess error reduces at a very fast rate\n(exponentially in the number of labeled examples) until Perror (cid:0) PBayes = O ((cid:15)2). After that\nthe excess error reduces only polynomially fast in the number of labeled examples. In proving of\nthe above theorem we used (cid:2)rst order Taylor series approximation to get an crude upper bound for\ndecision boundary movement. However, in case of a speci(cid:2)c class of parametric densities such a\ncrude approximation may not be necessary. In particular, as we show next, if the best (cid:2)tting model\nis a mixture of spherical Gaussians where the boundary is linear hyperplane, explicit upper bound\nof boundary movement can be found. In the following, we assume the class F to be a class of d\ndimensional spherical Gaussians with identity covariance matrix. However, the true model is an\nequiprobable mixture of perturbed versions of these individual class densities. As before, given\nu unlabeled examples and l labeled examples we want a strategy to learn a classi(cid:2)cation rule and\nanalyze the effect of these examples and also of perturbation size (cid:15)2 in reducing excess probability\nof error.\nOne option to achieve this task is to use the unlabeled examples to estimate the true mixture density\n2 p2, however number of unlabeled examples required to estimate mixture density using non\n2 p1 + 1\n1\nparametric kernel density estimation is exponential to the number of dimensions [10]. Thus, for\nhigh dimensional data this is not an attractive option and also such an estimate does not provide any\nclue as to where the decision boundary is. A better option will be to use the unlabeled examples\nto estimate the best (cid:2)tting Gaussians within F. Number of unlabeled examples needed to estimate\nsuch a mixture of Gaussians is only polynomial in the number of dimensions [10] and it is easy\nto show that the distance between the Bayesian decision function and the decision function due to\nGaussian approximation is at most (cid:15)2 away in jj:jj d\n5\n\n2 ;2 norm sense.\n\n\fNow suppose we use the following strategy to use labeled and unlabeled examples.\nStrategy 2:\n\n1. Assume the examples are distributed according to a mixture of equiprobable Gaussians\nwith unit covariance matrices and apply maximum likelihood estimation method to (cid:2)nd the\nbest Gaussian approximation of the densities.\n\n2. Use small number of labeled examples l(cid:3) to label the two approximate decision regions\n\ncorrectly with high probability.\n\n3. If more (l > l(cid:3)) labeled examples are available, use them to learn a better decision function\n\nusing some nonparametric technique.\n\n2 f1(xj(cid:18)1) + 1\n\nTheorem 2.5. In a two class mixture model with equiprobable class densities p1(x); p2(x), let the\nmixture density of the best (cid:2)tting parametric model be 1\n2 f2(xj(cid:18)2) where f1; f2 are\nd dimensional spherical Gaussians with means (cid:18)1; (cid:18)2 2 Rd and p1; p2 are perturbed version of\nf1; f2, such that for a perturbation size (cid:15)2 > 0, jjf1 (cid:0) p1jj d\n2 ;2 (cid:20) (cid:15)2. For\nany (cid:15) > 0 and 0 < (cid:14) < 1, there exists positive constants A; B such that if strategy 2 is used\n(cid:14) )(cid:17) unlabeled and l(cid:3) = L (0:5; 12; ((cid:15) + (cid:15)2)) labeled examples\nwith u = O(cid:16) d2\nthen for l (cid:20) l(cid:3), Perror (cid:0) PBayes reduces exponentially fast in the number of labeled examples\nwith probability at least (1 (cid:0) (cid:14)). If more labeled examples l > l(cid:3) are provided, Perror (cid:0) PBayes\nasymptotically converges to zero at most at a rate O(cid:16) 1\npl(cid:17) as l ! 1. If we represent the reduction\nrate of this excess error (Perror (cid:0) PBayes) as a function of labeled examples as Ree(l), then this\ncan compactly represented as,\n\n2 ;2 (cid:20) (cid:15)2;jjf2 (cid:0) p2jj d\n\n(cid:15)3(cid:14) (d log 1\n\n(cid:15) + log 1\n\nRee(l) =( O (exp((cid:0)l))\n\nO(cid:16) 1\npl(cid:17)\n\nif l (cid:20) l(cid:3)\nif l > l(cid:3)\n\nAfter using l(cid:3) labeled examples, Perror = PBayes + O((cid:15) + (cid:15)2). Note that when number of unla-\nbeled examples is in(cid:2)nite, parameters of the best (cid:2)tting model can be estimated arbitrarily well, i.e.,\n(cid:15) ! 0 and hence Perror (cid:0) PBayes reduces exponentially fast in the number of labeled examples\nuntil Perror (cid:0) PBayes = O((cid:15)2). On the other hand if (cid:15) = O((cid:15)2), Perror (cid:0) PBayes still reduces\nexponentially fast in the number of labeled examples until Perror (cid:0) PBayes = O((cid:15)2). This implies\nthat O((cid:15)2) close estimate of parameters of the best (cid:2)tting model is (cid:147)good(cid:148) enough. A more precise\nestimate of parameters of the best (cid:2)tting model using more unlabeled examples does not help reduc-\ning Perror (cid:0) PBayes at the same exponential rate beyond Perror (cid:0) PBayes = O((cid:15)2). The following\nCorollary states this important fact.\nCorollary 2.6. For a perturbation size (cid:15)2 > 0, let the best (cid:2)tting model for a mixture of equiprobable\ndensities be a mixture of equiprobable d dimensional spherical Gaussians with unit covariance\n(cid:14) )(cid:17) unlabeled examples parameters of the best\nmatrices.\n(cid:2)tting model can be estimated O((cid:15)2) close in Euclidean norm sense, then any additional unlabeled\nexamples u > u(cid:3) does not help in reducing the excess error.\n\nIf using u(cid:3) = O(cid:16) d2\n\n2(cid:14) (d log 1\n\n+ log 1\n\n(cid:15)3\n\n(cid:15)2\n\n3 Discussion on different rates of convergence\nIn this section we discuss the effect of perturbation size (cid:15)2 on the behavior of Perror (cid:0) PBayes and\nits effect on controlling the value of labeled and unlabeled examples. Different combinations of\nnumber of labeled and unlabeled examples give rise to four different regions where Perror (cid:0) PBayes\nbehaves differently as shown in Figure 1 where the x axis corresponds to the number of unlabeled\nexamples and the y axis corresponds to the number of labeled examples.\nLet u(cid:3) be the number of unlabeled examples required to estimate the parameters of the best (cid:2)tting\nmodel O((cid:15)2) close in Euclidean norm sense. Using O(cid:3) notation to hide the log factors, according to\nTheorem 2.5, u(cid:3) = O(cid:3)(cid:16) d3\n2(cid:17). When u > u(cid:3), unlabeled examples have no role to play in reducing\nPerror (cid:0) PBayes as shown in region II and part of III in Figure 1. For u (cid:20) u(cid:3), unlabeled examples\nbecomes useful only in region I and IV. When u(cid:3) unlabeled examples are available to estimate the\nparameters of the best (cid:2)tting model O((cid:15)2) close, let the number of labeled examples required to\n\n(cid:15)3\n\n6\n\n\flabel the estimated decision regions so that Perror (cid:0) PBayes = O((cid:15)2) be l(cid:3). The (cid:2)gure is just for\ngraphical representation of different regions where Perror (cid:0) PBayes reduces at different rates.\n\nl\n\nl(cid:3)1\n\nl(cid:3)\n\nI V : O(cid:16) 1pl(cid:17)\n\nNon-parametric methods\n\n(cid:19)\n\nI I I :(cid:18)O(cid:3)(cid:16)p d\n\nl (cid:17)+O(cid:3)(cid:16) d\n\nu1=3 (cid:17)\n\n2\n\nI : O(exp((cid:0)l)) + O(cid:3)(cid:0) d\nu1=3(cid:1)\n\nI I : O(exp((cid:0)l))\nu(cid:3) = O(cid:3)(cid:16) d3\n2(cid:17)\n\n(cid:15)3\n\nu\n\nFigure 1: The Big Picture. Behavior of Perror (cid:0) PBayes for different labeled and unlabeled exam-\nples\n\n1\n\n1\n\nu\n\nu\n\n3.1 Behavior of Perror (cid:0) PBayes in Region-I\nIn this region u (cid:20) u(cid:3) unlabeled examples estimate the decision regions and l(cid:3)u labeled examples,\nwhich depends on u, are required to correctly label these estimated regions. Perror(cid:0)PBayes reduces\nat a rate O (exp((cid:0)l)) + O(cid:16) d\n3(cid:17) for u < u(cid:3) and l < l(cid:3)u. This rate can be interpreted as the rate\nat which unlabeled examples estimate the parameters of the best (cid:2)tting model and rate at which\nlabeled examples correctly label these estimated decision regions. However, for small u estimation\nof the decision regions will be bad and and corresponding l(cid:3)u > l(cid:3). Instead of using these large\nnumber labeled examples to label poorly estimated decision regions, they can instead be used to\nestimate the parameters of the best (cid:2)tting model and as will be seen next, this is precisely what\nhappens in region III. Thus in region I, l is restricted to l < l(cid:3) and Perror (cid:0) PBayes reduces at a rate\n3(cid:17).\nexp ((cid:0)O(l)) + O(cid:16) d\n3.2 Behavior of Perror (cid:0) PBayes in Region-II\nIn this section l (cid:20) l(cid:3) and u > u(cid:3). As shown in Corollary 2.6, using u(cid:3) unlabeled examples\nparameters of the best (cid:2)tting model can be estimated O((cid:15)2) close in Euclidean norm sense and more\nprecise estimate of the best (cid:2)tting model parameters using more unlabeled examples u > u(cid:3) does\nnot help reducing Perror (cid:0) PBayes. Thus, unlabeled examples have no role to play in this region\nand for small number of labeled examples l (cid:20) l(cid:3), Perror (cid:0) PBayes reduces at a rate O (exp((cid:0)l)).\n3.3 Behavior of Perror (cid:0) PBayes in Region-III\nIn this region u (cid:20) u(cid:3) and hence model parameters have not been estimated O((cid:15)2) close to the\nparameters of the best (cid:2)tting model. Thus, in some sense model assumptions are still valid and\nthere is a scope for better estimation of the parameters. Number of labeled examples available in\nthis region is greater than what is required for mere labeling the estimated decision regions using\nu unlabeled examples and hence these excess labeled examples can be used to estimate the model\nparameters. Note that once the parameters have been estimated O((cid:15)2) close to the parameters of the\nbest (cid:2)tting model using labeled examples, parametric model assumptions are no longer valid. If l(cid:3)1\nis the number of such labeled examples, then in this region l(cid:3) < l (cid:20) l(cid:3)1. Also note that depending\non number of unlabeled examples u (cid:20) u(cid:3), l(cid:3), and l(cid:3)1 are not (cid:2)xed numbers but will depend on\nu. In presence of labeled examples alone, using Theorem 2.3, Perror (cid:0) PBayes reduces at a rate\nO(cid:3)(cid:18)q d\nl(cid:19). Since parameters are being estimated both using labeled and unlabeled examples, the\n\n7\n\n\feffective rate at which Perror (cid:0) PBayes reduces at this region can be thought of as the mean of the\ntwo.\n\n3.4 Behavior of Perror (cid:0) PBayes in Region-IV\nIn this region when u > u(cid:3); l > l(cid:3) and when u (cid:20) u(cid:3); l > l(cid:3)1. In either case, since the parameters of\nthe best (cid:2)tting model have been estimated O((cid:15)2) close to the parameters of the best (cid:2)tting model,\nparametric model assumptions are also no longer valid and excess labeled examples must be used in\nnonparametric way. For nonparametric classi(cid:2)cation technique either one of the two basic families\nof classi(cid:2)ers, plug-in classi(cid:2)ers or empirical risk minimization (ERM) classi(cid:2)ers can be used [13,\n9]. A nice discussion on the rate and fast rate of convergence of both these types of classi(cid:2)ers\ncan be found in [1, 12]. The general convergence rate i.e.\nthe rate at which expected value of\n(Perror (cid:0) PBayes) reduces is of the order O(l(cid:0)(cid:12)) as l ! 1 where (cid:12) > 0 is some exponent and is\ntypically (cid:12) (cid:20) 0:5. Also it was shown in [14] that under general conditions this bound can not be\nimproved in a minimax sense. In particular it was shown that if the true densities belong to C 1 class\nthen this rate is O( 1\n). However, if in(cid:2)nite differentiability condition is not satis(cid:2)ed then this rate\npl\nis much slower.\nAcknowledgements This work was supported by NSF Grant No 0643916.\n\nReferences\n[1] J. Y. Audibert and A. Tsybakov. Fast convergence rate for plug-in estimators under margin\n\nconditions. In Unpublished manuscript, 2005.\n\n[2] M-F. Balcan and A. Blum. A PAC-style model for learning from labeled and unlabeled data.\n\nIn 18th Annual Conference on Learning Theory, 2005.\n\n[3] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine Learn-\n\ning, 56, Invited, Special Issue on Clustering:209(cid:150)239, 2004.\n\n[4] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In 11th\n\nAnnual Conference on Learning Theory, 1998.\n\n[5] V. Castelli and T. M. Cover. The relative values of labeled and unlabeld samples in pat-\nIEEE Trans. Information Theory,\n\ntern recognition with an unknown mixing parameters.\n42((6):2102(cid:150)2117, 1996.\n\n[6] O. Chapelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learning. NIPS,\n\n15, 2002.\n\n[7] O. Chapelle and A. Zien. Semi-supervised classi(cid:2)cation by low density separation. In 10th\n\nInternational Workshop on Arti(cid:2)cial Intelligence and Statistics, 2005.\n\n[8] S. Dasgupta, M. L. Littman, and D. McAllester. PAC generalization bounds for co-training.\n\nNIPS, 14, 2001.\n\n[9] L. Devroye, L. Gyor(cid:2), and G. Lugosi. A probabilistic theory of pattern recognition. Springer,\n\nNew York, Berlin, Heidelberg, 1996.\n\n[10] J. Ratsaby. The complexity of learning from a mixture of labeled and unlabeled examples. In\n\nPhd Thesis, 1994.\n\n[11] J. Ratsaby and S. S. Venkatesh. Learning from a mixture of labeled and unlabeled examples\n\nwith parametric side information. In 8th Annual Conference on Learning Theory, 1995.\n\n[12] A. B. Tsybakov. Optimal aggregation of classi(cid:2)ers in statistical learning. Ann. Statist.,\n\n32(1):135(cid:150)166, 1996.\n\n[13] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.\n[14] Y. Yang. Minimax nonparametric classi(cid:2)cation- part I: Rates of convergence, part II: Model\n\nselection for adaptation. IEEE Trans. Inf. Theory, 45:2271(cid:150)2292, 1999.\n\n[15] X. Zhu. Semi-supervised literature survey. Technical Report 1530, Department of Computer\n\nScience, University of Wisconsin Madison, December 2006.\n\n8\n\n\f", "award": [], "sourceid": 1003, "authors": [{"given_name": "Kaushik", "family_name": "Sinha", "institution": null}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": null}]}