{"title": "Unlabeled data: Now it helps, now it doesn't", "book": "Advances in Neural Information Processing Systems", "page_first": 1513, "page_last": 1520, "abstract": "Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundancy of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing the situations in which unlabeled data can help have met with little success, and sometimes appear to conflict with each other and intuition. In this paper, we attempt to bridge the gap between practice and theory of semi-supervised learning. We develop a rigorous framework for analyzing the situations in which unlabeled data can help and quantify the improvement possible using finite sample error bounds. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.", "full_text": "Unlabeled data: Now it helps, now it doesn\u2019t\n\nAarti Singh, Robert D. Nowak\u2217\n\nXiaojin Zhu\u2020\n\nDepartment of Electrical and Computer Engineering Department of Computer Sciences\nUniversity of Wisconsin - Madison\n\nUniversity of Wisconsin - Madison\n\nMadison, WI 53706\n\nMadison, WI 53706\n\n{singh@cae,nowak@engr}.wisc.edu\n\njerryzhu@cs.wisc.edu\n\nAbstract\n\nEmpirical evidence shows that in favorable situations semi-supervised learning\n(SSL) algorithms can capitalize on the abundance of unlabeled training data to\nimprove the performance of a learning task, in the sense that fewer labeled train-\ning data are needed to achieve a target error bound. However, in other situations\nunlabeled data do not seem to help. Recent attempts at theoretically character-\nizing SSL gains only provide a partial and sometimes apparently con\ufb02icting ex-\nplanations of whether, and to what extent, unlabeled data can help. In this paper,\nwe attempt to bridge the gap between the practice and theory of semi-supervised\nlearning. We develop a \ufb01nite sample analysis that characterizes the value of un-\nlabeled data and quanti\ufb01es the performance improvement of SSL compared to\nsupervised learning. We show that there are large classes of problems for which\nSSL can signi\ufb01cantly outperform supervised learning, in \ufb01nite sample regimes\nand sometimes also in terms of error convergence rates.\n\n1 Introduction\nLabeled data can be expensive, time-consuming and dif\ufb01cult to obtain in many applications. Semi-\nsupervised learning (SSL) aims to capitalize on the abundance of unlabeled data to improve learning\nperformance. Empirical evidence suggests that in certain favorable situations unlabeled data can\nhelp, while in other situations it does not. As a result, there have been several recent attempts\n[1, 2, 3, 4, 5, 6] at developing a theoretical understanding of semi-supervised learning. It is well-\naccepted that unlabeled data can help only if there exists a link between the marginal data distribution\nand the target function to be learnt. Two common types of links considered are the cluster assump-\ntion [7, 3, 4] which states that the target function is locally smooth over subsets of the feature space\ndelineated by some property of the marginal density (but may not be globally smooth), and the man-\nifold assumption [4, 6] which assumes that the target function lies on a low-dimensional manifold.\nKnowledge of these sets, which can be gleaned from unlabeled data, simplify the learning task.\nHowever, recent attempts at characterizing the amount of improvement possible under these links\nonly provide a partial and sometimes apparently con\ufb02icting (for example, [4] vs. [6]) explanations\nof whether or not, and to what extent semi-supervised learning helps. In this paper, we bridge the\ngap between these seemingly con\ufb02icting views and develop a minimax framework based on \ufb01nite\nsample bounds to identify situations in which unlabeled data help to improve learning. Our results\nquantify both the amount of improvement possible using SSL as well as the the relative value of\nunlabeled data.\n\nWe focus on learning under a cluster assumption that is formalized in the next section, and estab-\nlish that there exist nonparametric classes of distributions, denoted PXY , for which the decision\nsets (over which the target function is smooth) are discernable from unlabeled data. Moreover,\nwe show that there exist clairvoyant supervised learners that, given perfect knowledge of the de-\ncision sets denoted by D, can signi\ufb01cantly outperform any generic supervised learner fn in these\n\n\u2217Supported in part by the NSF grants CCF-0353079, CCF-0350213, and CNS-0519824.\n\u2020Supported in part by the Wisconsin Alumni Research Foundation.\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Two separated high density sets with different labels that (b) cannot be discerned if the\nsample size is too small, but (c) can be estimated if sample density is high enough.\n\nE[R(bfD,n)] < inf fn supPXY\n\nclasses. That is, if R denotes a risk of interest, n denotes the labeled data sample size, bfD,n denotes\nthe clairvoyant supervised learner, and E denotes expectation with respect to training data, then\nE[R(fn)]. Based on this, we establish that there also exist\nsupPXY\nsemi-supervised learners, denoted bfm,n, that use m unlabeled examples in addition to the n labeled\nexamples in order to estimate the decision sets, which perform as well as bfD,n, provided that m\ngrows appropriately relative to n. Speci\ufb01cally, if the error bound for bfD,n decays polynomially (ex-\n\nponentially) in n, then the number of unlabeled data m needs to grow polynomially (exponentially)\nwith the number of labeled data n. We provide general results for a broad range of learning problems\nusing \ufb01nite sample error bounds. Then we examine a concrete instantiation of these general results\nin the regression setting by deriving minimax lower bounds on the performance of any supervised\n\nlearner and compare that to upper bounds on the errors of bfD,n and bfm,n.\n\nIn their seminal papers, Castelli and Cover [8, 9] suggested that in the classi\ufb01cation setting the\nmarginal distribution can be viewed as a mixture of class conditional distributions. If this mixture is\nidenti\ufb01able, then the classi\ufb01cation problem may reduce to a simple hypothesis testing problem for\nwhich the error converges exponentially fast in the number of labeled examples. The ideas in this\npaper are similar, except that we do not require identi\ufb01ability of the mixture component densities,\nand show that it suf\ufb01ces to only approximately learn the decision sets over which the label is smooth.\nMore recent attempts at theoretically characterizing SSL have been relatively pessimistic. Rigollet\n[3] establishes that for a \ufb01xed collection of distributions satisfying a cluster assumption, unlabeled\ndata do not provide an improvement in convergence rate. A similar argument was made by Lafferty\nand Wasserman [4], based on the work of Bickel and Li [10], for the manifold case. However, in\na recent paper, Niyogi [6] gives a constructive example of a class of distributions supported on a\nmanifold whose complexity increases with the number of labeled examples, and he shows that the\nerror of any supervised learner is bounded from below by a constant, whereas there exists a semi-\nsupervised learner that can provide an error bound of O(n\u22121/2), assuming in\ufb01nite unlabeled data.\nIn this paper, we bridge the gap between these seemingly con\ufb02icting views. Our arguments can\nbe understood by the simple example shown in Fig. 1, where the distribution is supported on two\ncomponent sets separated by a margin \u03b3 and the target function is smooth over each component.\nGiven a \ufb01nite sample of data, these decision sets may or may not be discernable depending on the\nsampling density (see Fig. 1(b), (c)). If \u03b3 is \ufb01xed (this is similar to \ufb01xing the class of cluster-based\ndistributions in [3] or the manifold in [4, 10]), then given enough labeled data a supervised learner\ncan achieve optimal performance (since, eventually, it operates in regime (c) of Fig. 1). Thus, in this\nexample, there is no improvement due to unlabeled data in terms of the rate of error convergence for\na \ufb01xed collection of distributions. However, since the true separation between the component sets\nis unknown, given a \ufb01nite sample of data, there always exists a distribution for which these sets are\nindiscernible (e.g., \u03b3 \u2192 0). This perspective is similar in spirit to the argument in [6]. We claim\nthat meaningful characterizations of SSL performance and quanti\ufb01cations of the value of unlabeled\ndata require \ufb01nite sample error bounds, and that rates of convergence and asymptotic analysis may\nnot capture the distinctions between SSL and supervised learning. Simply stated, if the component\ndensity sets are discernable from a \ufb01nite sample size m of unlabeled data but not from a \ufb01nite sample\nsize n < m of labeled data, then SSL can provide better performance than supervised learning. We\nalso show that there are certain plausible situations in which SSL yields rates of convergence that\ncannot be achieved by any supervised learner.\n\n2\n\n\f(2)\n\ng (x )\n1\n\n2\n\n(1)\n\ng (x )\n1\n\n2\n\n(2)\n\ng (x )\n1\n\n1\n\n(1)\n\ng (x )\n1\n\n1\n\nx2\n\n\u03b3\n\nx1\n\n\u03b3 positive\n\nx2\n\n\u03b3\n\nx1\n\n\u03b3 negative\n\n(2)\n\ng (x )\n1\n\n2\n\n(2)\n\ng (x )\n1\n\n1\n\n(1)\n\ng (x )\n1\n\n2\n\n(1)\n\ng (x )\n1\n\n1\n\nFigure 2: Margin \u03b3 measures the minimum width of a decision set or separation between the support\nsets of the component marginal mixture densities. The margin is positive if the component support\nsets are disjoint, and negative otherwise.\n\n2 Characterization of model distributions under the cluster assumption\nBased on the cluster assumption [7, 3, 4], we de\ufb01ne the following collection of joint distributions\nPXY (\u03b3) = PX \u00d7 PY |X indexed by a margin parameter \u03b3. Let X, Y be bounded random variables\nwith marginal distribution PX \u2208 PX and conditional label distribution PY |X \u2208 PY |X, supported\non the domain X = [0, 1]d.\nThe marginal density p(x) = PK\nk=1 akpk(x) is the mixture of a \ufb01nite, but unknown, number of\nk=1, where K < \u221e. The unknown mixing proportions ak \u2265 a > 0 and\ncomponent densities {pk}K\nPK\nk=1 ak = 1. In addition, we place the following assumptions on the mixture component densities:\n1. pk is supported on a unique compact, connected set Ck \u2286 X with Lipschitz boundaries. Speci\ufb01-\ncally, we assume the following form for the component support sets: (See Fig. 2 for d=2 illustration.)\n\nk (x1, . . . , xd\u22121) \u2264 xd \u2264 g(2)\n\nCk = {x \u2261 (x1, . . . , xd) \u2208 X : g(1)\nk (\u00b7), g(2)\n\nk (x1, . . . , xd\u22121)},\nk (\u00b7) are d \u2212 1 dimensional Lipschitz functions with Lipschitz constant L.1\n\nwhere g(1)\n2. pk is bounded from above and below, 0 < b \u2264 pk \u2264 B.\n3. pk is H\u00a8older-\u03b1 smooth on Ck with H\u00a8older constant K1 [12, 13].\nLet the conditional label density on Ck be denoted by pk(Y |X = x). Thus, a labeled training\npoint (X, Y ) is obtained as follows. With probability ak, X is drawn from pk and Y is drawn from\npk(Y |X = x). In the supervised setting, we assume access to n labeled data L = {Xi, Yi}n\ni=1\ndrawn i.i.d according to PXY \u2208 PXY (\u03b3), and in the semi-supervised setting, we assume access to\nm additional unlabeled data U = {Xi}m\nk=1 or their\nLet D denote the collection of all non-empty sets obtained as intersections of {Ck}K\ncomplements {Cc\nk that does not lie in the support of the marginal\ndensity. Observe that |D| \u2264 2K, and in practical situations the cardinality of D is much smaller\nas only a few of the sets are non-empty. The cluster assumption is that the target function will be\nsmooth on each set D \u2208 D, hence the sets in D are called decision sets. At this point, we do not\nconsider a speci\ufb01c target function.\nThe collection PXY is indexed by a margin parameter \u03b3, which denotes the minimum width of\na decision set or separation between the component support sets Ck. The margin \u03b3 is assigned a\npositive sign if there is no overlap between components, otherwise it is assigned a negative sign as\nillustrated in Figure 2. Formally, for j, k \u2208 {1, . . . , K}, let\n\ni=1 drawn i.i.d according to PX \u2208 PX.\n\nk=1Cc\n\nk}K\n\nk=1, excluding the set \u2229K\n\ndjk := min\n\np,q\u2208{1,2}kg(p)\n\nj \u2212 g(q)\n\nk k\u221e j 6= k,\n\ndkk := kg(1)\n\nk \u2212 g(2)\n\nk k\u221e.\n\nThen the margin is de\ufb01ned as\n\n\u03b3 = \u03c3 \u00b7\n\nmin\n\nj,k\u2208{1,...,K}\n\ndjk,\n\nwhere\n\n\u03c3 =(cid:26) 1\n\nif Cj \u2229 Ck = \u2205 \u2200j 6= k\n\n.\n\n\u22121 otherwise\n\n1This form is a slight generalization of the boundary fragment class of sets which is used as a common\ntool for analysis of learning problems [11]. Boundary fragment sets capture the salient characteristics of more\ngeneral decision sets since, locally, the boundaries of general sets are like fragments in a certain orientation.\n\n3\n\n\f3 Learning Decision Sets\nIdeally, we would like to break a given learning task into separate subproblems on each D \u2208 D since\nthe target function is smooth on each decision set. Note that the marginal density p is also smooth\nwithin each decision set, but exhibits jumps at the boundaries since the component densities are\nbounded away from zero. Hence, the collection D can be learnt from unlabeled data as follows:\n1) Marginal density estimation \u2014 The procedure is based on the sup-norm kernel density estimator\nproposed in [14]. Consider a uniform square grid over the domain X = [0, 1]d with spacing 2hm,\nwhere hm = \u03ba0 ((log m)2/m)1/d and \u03ba0 > 0 is a constant. For any point x \u2208 X , let [x] denote the\nclosest point on the grid. Let G denote the kernel and Hm = hmI, then the estimator of p(x) is\n\n1\nmhd\nm\n\nmXi=1\n\nbp(x) =\n\nG(H \u22121\n\nm (Xi \u2212 [x])).\n\nkzj\u2212zj+1k \u2264 2\u221adhm, and for all points that satisfy kzi\u2212zjk \u2264 hm log m, |bp(zi)\u2212bp(zj)| \u2264 \u03b4m :=\n\n2) Decision set estimation \u2014 Two points x1, x2 \u2208 X are said to be connected, denoted by x1 \u2194 x2,\nif there exists a sequence of points x1 = z1, z2, . . . , zl\u22121, zl = x2 such that z2, . . . , zl\u22121 \u2208 U,\n(log m)\u22121/3. That is, there exists a sequence of 2\u221adhm-dense unlabeled data points between x1 and\nx2 such that the marginal density varies smoothly along the sequence. All points that are pairwise\nconnected specify an empirical decision set. This decision set estimation procedure is similar in\nspirit to the semi-supervised learning algorithm proposed in [15]. In practice, these sequences only\nneed to be evaluated for the test and labeled training points.\n\nThe following lemma shows that if the margin is large relative to the average spacing m\u22121/d between\nunlabeled data points, then with high probability, two points are connected if and only if they lie in\nthe same decision set D \u2208 D, provided the points are not too close to the decision boundaries. The\nproof sketch of the lemma and all other results are deferred to Section 7.\nLemma 1. Let \u2202D denote the boundary of D and de\ufb01ne the set of boundary points as\n\nB = {x :\n\nz\u2208\u222aD\u2208D \u2202D kx \u2212 zk \u2264 2\u221adhm}.\n\ninf\n\nIf |\u03b3| > Co(m/(log m)2)\u22121/d, where Co = 6\u221ad\u03ba0, then for all p \u2208 PX, all pairs of points\nx1, x2 \u2208 supp(p) \\ B and all D \u2208 D, with probability > 1 \u2212 1/m,\n\nx1, x2 \u2208 D if and only if\n\nx1 \u2194 x2\n\nfor large enough m \u2265 m0, where m0 depends only on the \ufb01xed parameters of the class PXY (\u03b3).\n4 SSL Performance and the Value of Unlabeled Data\nWe now state our main result that characterizes the performance of SSL relative to supervised learn-\ning and follows as a corollary to the lemma stated above. Let R denote a risk of interest and\nE(bf ) = R(bf ) \u2212 R\u2217, where R\u2217 is the in\ufb01mum risk over all possible learners.\nCorollary 1. Assume that the excess risk E is bounded. Suppose there exists a clairvoyant super-\nvised learner bfD,n, with perfect knowledge of the decision sets D, for which the following \ufb01nite\nsample upper bound holds\nThen there exists a semi-supervised learner bfm,n such that if |\u03b3| > Co(m/(log m)2)\u22121/d,\n\nE[E(bfD,n)] \u2264 \u01eb2(n).\n+ n(cid:18) m\n\n(log m)2(cid:19)\u22121/d! .\n\nPXY (\u03b3)\n\nE[E(bfm,n)] \u2264 \u01eb2(n) + O 1\n\nm\n\nThis result captures the essence of the relative characterization of semi-supervised and supervised\nlearning for the margin based model distributions. It suggests that if the sets D are discernable\nusing unlabeled data (the margin is large enough compared to average spacing between unlabeled\ndata points), then there exists a semi-supervised learner that can perform as well as a supervised\nlearner with clairvoyant knowledge of the decision sets, provided m \u226b n so that (n/\u01eb2(n))d =\n\nsup\n\nPXY (\u03b3)\n\nsup\n\n4\n\n\fO(m/(log m)2) implying that the additional term in the performace bound for bfm,n is negligible\n\ncompared to \u01eb2(n). This indicates that if \u01eb2(n) decays polynomially (exponentially) in n, then m\nneeds to grow polynomially (exponentially) in n.\nFurther, suppose that the following \ufb01nite sample lower bound holds for any supervised learner:\n\ninf\nfn\n\nsup\n\nPXY (\u03b3)\n\nE[E(fn)] \u2265 \u01eb1(n).\n\nIf \u01eb2(n) < \u01eb1(n), then there exists a clairvoyant supervised learner with perfect knowledge of the\ndecision sets that outperforms any supervised learner that does not have this knowledge. Hence,\nCorollary 1 implies that SSL can provide better performance than any supervised learner provided\n(i) m \u226b n so that (n/\u01eb2(n))d = O(m/(log m)2), and (ii) knowledge of the decision sets simpli\ufb01es\nthe supervised learning task, so that \u01eb2(n) < \u01eb1(n). In the next section, we provide a concrete\napplication of this result in the regression setting. As a simple example in the binary classi\ufb01cation\nsetting, if p(x) is supported on two disjoint sets and if P (Y = 1|X = x) is strictly greater than\n1/2 on one set and strictly less than 1/2 on the other, then perfect knowledge of the decision sets\nreduces the problem to a hypothesis testing problem for which \u01eb2(n) = O(e\u2212\u03b6 n), for some constant\n\u03b6 > 0. However, if \u03b3 is small relative to the average spacing n\u22121/d between labeled data points,\nthen \u01eb1(n) = cn\u22121/d where c > 0 is a constant. This lower bound follows from the minimax lower\nbound proofs for regression in the next section. Thus, an exponential improvement is possible using\nsemi-supervised learning provided m grows exponentially in n.\n\n5 Density-adaptive Regression\nLet Y denote a continuous and bounded random variable. Under squared error loss, the target\n\nfunction is f (x) = E[Y |X = x], and E(bf ) = E[(bf (X) \u2212 f (X))2]. Recall that pk(Y |X = x)\nis the conditional density on the k-th component and let Ek denote expectation with respect to the\ncorresponding conditional distribution. The regression function on each component is fk(x) =\nEk[Y |X = x] and we assume that for k = 1, . . . , K\n1. fk is uniformly bounded, |fk| \u2264 M .\n2. fk is H\u00a8older-\u03b1 smooth on Ck with H\u00a8older constant K2.\nThis implies that the overall regression function f (x) is piecewise H\u00a8older-\u03b1 smooth; i.e., it is\nH\u00a8older-\u03b1 smooth on each D \u2208 D, except possibly at the component boundaries. 2 Since a H\u00a8older-\u03b1\nsmooth function can be locally well-approximated by a Taylor polynomial, we propose the follow-\ning semi-supervised learner that performs local polynomial \ufb01ts within each empirical decision set,\nthat is, using training data that are connected as per the de\ufb01nition in Section 3. While a spatially\nuniform estimator suf\ufb01ces when the decision sets are discernable, we use the following spatially\nadaptive estimator proposed in Section 4.1 of [12]. This ensures that when the decision sets are\nindiscernible using unlabeled data, the semi-supervised learner still achieves an error bound that is,\nup to logarithmic factors, no worse than the minimax lower bound for supervised learners.\n\nnXi=1\n\nf \u2032\u2208\u0393\n\ni=1\n\nand\n\n(Yi \u2212 f \u2032(Xi))21x\u2194Xi + pen(f \u2032)\n\nbfm,n,x(\u00b7) = arg min\nbfm,n(x) \u2261 bfm,n,x(x)\nHere 1x\u2194Xi is the indicator of x \u2194 Xi and \u0393 denotes a collection of piecewise polynomials\nof degree [\u03b1] (the maximal integer < \u03b1) de\ufb01ned over recursive dyadic partitions of the domain\nX = [0, 1]d with cells of sidelength between 2\u2212\u2308log(n/ log n)/(2\u03b1+d)\u2309 and 2\u2212\u2308log(n/ log n)/d\u2309. The\npenalty term pen(f \u2032) is proportional to log(Pn\n1x\u2194Xi) #f \u2032, where #f \u2032 denotes the number\nof cells in the recursive dyadic partition on which f \u2032 is de\ufb01ned.\nIt is shown in [12] that this\nestimator yields a \ufb01nite sample error bound of n\u22122\u03b1/(2\u03b1+d) for H\u00a8older-\u03b1 smooth functions, and\nmax{n\u22122\u03b1/(2\u03b1+d), n\u22121/d} for piecewise H\u00a8older-\u03b1 functions, ignoring logarithmic factors.\nUsing these results from [12] and Corollary 1, we now state \ufb01nite sample upper bounds on the semi-\nsupervised learner (SSL) described above. Also, we derive \ufb01nite sample minimax lower bounds on\nthe performance of any supervised learner (SL). Our main results are summarized in the following\ntable, for model distributions characterized by various values of the margin parameter \u03b3. A sketch\n\n2If the component marginal densities and regression functions have different smoothnesses, say \u03b1 and \u03b2,\n\nthe same analysis holds except that f (x) is H\u00a8older-min(\u03b1, \u03b2) smooth on each D \u2208 D.\n\n5\n\n\fof the derivations of the results is provided in Section 7.3. Here we assume that dimension d \u2265\n2\u03b1/(2\u03b1 \u2212 1).\nIf d < 2\u03b1/(2\u03b1 \u2212 1), then the supervised learning error due to to not resolving\nthe decision sets (which behaves like n\u22121/d) is smaller than error incurred in estimating the target\nfunction itself (which behaves like n\u22122\u03b1/(2\u03b1+d)). Thus, when d < 2\u03b1/(2\u03b1 \u2212 1), the supervised\nregression error is dominated by the error in smooth regions and there appears to be no bene\ufb01t to\nusing a semi-supervised learner. In the table, we suppress constants and log factors in the bounds,\nand also assume that m \u226b n2d so that (n/\u01eb2(n))d = O(m/(log m)2). The constants co and Co\nonly depend on the \ufb01xed parameters of the class PXY (\u03b3) and do not depend on \u03b3.\n\nMargin range\n\n\u03b3\n\nSSL upper bound\n\nSL lower bound\n\nSSL helps\n\n\u01eb2(n)\n\n\u01eb1(n)\n\nCo( m\n\n(log m)2 )\u22121/d\n\n\u03b3 \u2265 \u03b30\n\n(log m)2 )\u22121/d\n\n\u03b3 \u2265 con\u22121/d\ncon\u22121/d > \u03b3 \u2265 Co( m\n(log m)2 )\u22121/d > \u03b3 \u2265 \u2212Co( m\n(log m)2 )\u22121/d > \u03b3\n\u2212\u03b30 > \u03b3\n\n\u2212Co( m\n\nn\u22122\u03b1/(2\u03b1+d)\nn\u22122\u03b1/(2\u03b1+d)\nn\u22122\u03b1/(2\u03b1+d)\n\nn\u22121/d\n\nn\u22122\u03b1/(2\u03b1+d)\nn\u22122\u03b1/(2\u03b1+d)\n\nn\u22122\u03b1/(2\u03b1+d)\nn\u22122\u03b1/(2\u03b1+d)\n\nn\u22121/d\nn\u22121/d\nn\u22121/d\nn\u22121/d\n\nNo\nNo\nYes\nNo\nYes\nYes\n\nIf \u03b3 is large relative to the average spacing between labeled data points n\u22121/d, then a supervised\nlearner can discern the decision sets accurately and SSL provides no gain. However, if \u03b3 > 0 is small\nrelative to n\u22121/d, but large with respect to the spacing between unlabeled data points m\u22121/d, then\nthe proposed semi-supervised learner provides improved error bounds compared to any supervised\nlearner. If |\u03b3| is smaller than m\u22121/d, the decision sets are not discernable with unlabeled data and\nSSL provides no gain. However, notice that the performance of the semi-supervised learner is no\nworse than the minimax lower bound for supervised learners. In the \u03b3 < 0 case, if \u2212\u03b3 larger than\nm\u22121/d, then the semi-supervised learner can discern the decision sets and achieves smaller error\nbounds, whereas these sets cannot be as accurately discerned by any supervised learner. For the\noverlap case (\u03b3 < 0), supervised learners are always limited by the error incurred due to averaging\nacross decision sets (n\u22121/d). In particular, for the collection of distributions with \u03b3 < \u2212\u03b30, a faster\nrate of error convergence is attained by SSL compared to SL, provided m \u226b n2d.\n6 Conclusions\nIn this paper, we develop a framework for evaluating the performance gains possible with semi-\nsupervised learning under a cluster assumption using \ufb01nite sample error bounds. The theoretical\ncharacterization we present explains why in certain situations unlabeled data can help to improve\nlearning, while in other situations they may not. We demonstrate that there exist general situations\nunder which semi-supervised learning can be signi\ufb01cantly superior to supervised learning in terms\nof achieving smaller \ufb01nite sample error bounds than any supervised learner, and sometimes in terms\nof a better rate of error convergence. Moreover, our results also provide a quanti\ufb01cation of the rela-\ntive value of unlabeled to labeled data. While we focus on the cluster assumption in this paper, we\nconjecture that similar techniques can be applied to quantify the performance of semi-supervised\nlearning under the manifold assumption as well. In particular, we believe that the use of minimax\nlower bounding techniques is essential because many of the interesting distinctions between super-\nvised and semi-supervised learning occur only in \ufb01nite sample regimes, and rates of convergence\nand asymptotic analyses may not capture the complete picture.\n\n7 Proofs\nWe sketch the main ideas behind the proofs here, please refer to [13] for details. Since the component\ndensities are bounded from below and above, de\ufb01ne pmin := b mink ak \u2264 p(x) \u2264 B =: pmax.\n7.1 Proof of Lemma 1\nFirst, we state two relatively straightforward results about the proposed kernel density estimator.\nTheorem 1 (Sup-norm density estimation of non-boundary points). Consider the kernel density\n\nestimator bp(x) proposed in Section 3. If the kernel G satis\ufb01es supp(G) = [\u22121, 1]d, 0 < G \u2264\nGmax < \u221e, R[\u22121,1]d G(u)du = 1 and R[\u22121,1]d ujG(u)du = 0 for 1 \u2264 j \u2264 [\u03b1], then for all\n\n6\n\n\fp \u2208 PX, with probability at least 1 \u2212 1/m,\n\nsup\n\nx\u2208supp(p)\\B |p(x) \u2212bp(x)| = O(cid:18)hmin(1,\u03b1)\n\nm\n\n+qlog m/(mhd\n\nm)(cid:19) =: \u01ebm.\n\ni=1 G(H \u22121\n\nNotice that \u01ebm decreases with increasing m. A detailed proof can be found in [13].\nCorollary 2 (Empirical density of unlabeled data). Under the conditions of Theorem 1, for all\np \u2208 PX and large enough m, with probability > 1 \u2212 1/m, for all x \u2208 supp(p) \\ B, \u2203 Xi \u2208 U s.t.\nkXi \u2212 xk \u2264 \u221adhm.\nProof. From Theorem 1, for all x \u2208 supp(p) \\ B, bp(x) \u2265 p(x) \u2212 \u01ebm \u2265 pmin \u2212 \u01ebm > 0, for m\nm (Xi \u2212 x)) > 0, and \u2203Xi \u2208 U within \u221adhm of x.\nsuf\ufb01ciently large. This impliesPm\nUsing the density estimation results, we now show that if |\u03b3| > 6\u221adhm, then for all p \u2208 PX, all\npairs of points x1, x2 \u2208 supp(p)\\B and all D \u2208 D, for large enough m, with probability > 1\u22121/m,\nwe have x1, x2 \u2208 D if and only if x1 \u2194 x2. We establish this in two steps:\n1. x1 \u2208 D, x2 6\u2208 D \u21d2 x1 6\u2194 x2 : Since x1 and x2 belong to different decision sets, all sequences\nconnecting x1 and x2 through unlabeled data points pass through a region where either (i) the density\nis zero and since the region is at least |\u03b3| > 6\u221adhm wide, there cannot exist a sequence as de\ufb01ned\nin Section 3 such that kzj \u2212 zj+1k \u2264 2\u221adhm, or (ii) the density is positive. In the latter case,\nthe marginal density p(x) jumps by at least pmin one or more times along all sequences connecting\nx1 and x2. Suppose the \ufb01rst jump occurs where decision set D ends and another decision set\n6= D begins (in the sequence). Then since D\u2032 is at least |\u03b3| > 6\u221adhm wide, by Corollary 2\nD\u2032\nfor all sequences connecting x1 and x2 through unlabeled data points, there exist points zi, zj in the\nsequence that lie in D \\ B, D\u2032 \\ B, respectively, and kzi \u2212 zjk \u2264 hm log m. Since the density on\neach decision set is H\u00a8older-\u03b1 smooth, we have |p(zi) \u2212 p(zj)| \u2265 pmin \u2212 O((hm log m)min(1,\u03b1)).\nSince zi, zj 6\u2208 B, using Theorem 1, |bp(zi) \u2212bp(zj)| \u2265 |p(zi) \u2212 p(zj)| \u2212 2\u01ebm > \u03b4m for large enough\nm. Thus, x1 6\u2194 x2.\n2. x1, x2 \u2208 D \u21d2 x1 \u2194 x2 : Since D has width at least |\u03b3| > 6\u221adhm, there exists a region of width\n> 2\u221adhm contained in D \\ B, and Corollary 2 implies that with probability > 1 \u2212 1/m, there exist\nsequence(s) contained in D \\ B connecting x1 and x2 through 2\u221adhm-dense unlabeled data points.\nSince the sequence is contained in D and the density on D is H\u00a8older-\u03b1 smooth, we have for all points\nzi, zj in the sequence that satisfy kzi \u2212 zjk \u2264 hm log m, |p(zi) \u2212 p(zj)| \u2264 O((hm log m)min(1,\u03b1)).\nSince zi, zj 6\u2208 B, using Theorem 1, |bp(zi) \u2212bp(zj)| \u2264 |p(zi) \u2212 p(zj)| + 2\u01ebm \u2264 \u03b4m for large enough\nm. Thus, x1 \u2194 x2.\n7.2 Proof of Corollary 1\n1) \u2264 1/m. Let \u21262 denote the\nLet \u21261 denote the event under which Lemma 1 holds. Then P (\u2126c\nevent that the test point X and training data X1, . . . , Xn \u2208 L don\u2019t lie in B. Then P (\u2126c\n2) \u2264\n(n + 1)P (B) \u2264 (n + 1)pmaxvol(B) = O(nhm). The last step follows from the de\ufb01nition of the set\nB and since the boundaries of the support sets are Lipschitz, K is \ufb01nite, and hence vol(B) = O(hm).\nNow observe that bfD,n essentially uses the clairvoyant knowledge of the decision sets D to\ndiscern which labeled points X1, . . . , Xn are in the same decision set as X. Condition-\ning on \u21261, \u21262, Lemma 1 implies that X, Xi \u2208 D iff X \u2194 Xi. Thus, we can de\ufb01ne a\nsemi-supervised learner bfm,n to be the same as bfD,n except that instead of using clairvoyant\nknowledge of whether X, Xi \u2208 D, bfm,n is based on whether X \u2194 Xi.\nIt follows that\nE[E(bfD,n)], and since the excess risk is bounded:\nE[E(bfm,n)|\u21261, \u21262] + O (1/m + nhm) .\n\nE[E(bfm,n)|\u21261, \u21262] = supPXY (\u03b3)\nE[E(bfm,n)] \u2264 supPXY (\u03b3)\n\n7.3 Density adaptive Regression results\n\nsupPXY (\u03b3)\nsupPXY (\u03b3)\n\n(cid:3)\n\n(cid:3)\n\n1) Semi-Supervised Learning Upper Bound: The clairvoyant counterpart of bfm,n(x) is given as\nbfD,n(x) \u2261 bfD,n,x(x), where bfD,n,x(\u00b7) = arg minf \u2032\u2208\u0393Pn\ni=1(Yi\u2212 f \u2032(Xi))21x,Xi\u2208D +pen(f \u2032), and\nis a standard supervised learner that performs piecewise polynomial \ufb01t on each decision set, where\nnPn\nthe regression function is H\u00a8older-\u03b1 smooth. Let nD = 1\n\n1Xi\u2208D. It follows [12] that\n\nd+2\u03b1 .\n\ni=1\n\nE[(f (X) \u2212 bfD,n(X))21X\u2208D|nD] \u2264 C (nD/ log nD)\u2212 2\u03b1\n\n7\n\n\fSince E[(f (X) \u2212 bfD,n(X))2] = PD\u2208D\nE[(f (X) \u2212 bfD,n(X))21X\u2208D]P (D), taking expecta-\ntion over nD \u223cBinomial(n, P (D)) and summing over all decision sets recalling that |D| is a\n\ufb01nite constant, the overall error of bfD,n scales as n\u22122\u03b1/(2\u03b1+d), ignoring logarithmic factors. If\n|\u03b3| > Co(m/(log m)2)\u22121/d, using Corollary 1, the same performance bound holds for bfm,n pro-\nvided m \u226b n2d. See [13] for further details. If |\u03b3| < Co(m/(log m)2)\u22121/d, the decision sets are\nnot discernable using unlabeled data. Since the regression function is piecewise H\u00a8older-\u03b1 smooth\non each empirical decision set, Using Theorem 9 in [12] and similar analysis, an upper bound of\nmax{n\u22122\u03b1/(2\u03b1+d), n\u22121/d} follows, which scales as n\u22121/d when d \u2265 2\u03b1/(2\u03b1 \u2212 1).\n2) Supervised Learning Lower Bound: The formal minimax proof requires construction of a \ufb01nite\nsubset of distributions in PXY (\u03b3) that are the hardest cases to distinguish based on a \ufb01nite number\nof labeled data n, and relies on a Hellinger version of Assouad\u2019s Lemma (Theorem 2.10 (iii) in [16]).\nComplete details are given in [13]. Here we present the simple intuition behind the minimax lower\nbound of n\u22121/d when \u03b3 < con\u22121/d. In this case the decision boundaries can only be localized\nto an accuracy of n\u22121/d, the average spacing between labeled data points. Since the boundaries\nare Lipschitz, the expected volume that is incorrectly assigned to any decision set is > c1n\u22121/d,\nwhere c1 > 0 is a constant. Thus, if the expected excess risk at a point that is incorrectly assigned\nto a decision set can be greater than a constant c2 > 0, then the overall expected excess risk is\n> c1c2n\u22121/d. This is the case for both regression and binary classi\ufb01cation. If \u03b3 > con\u22121/d, the\ndecision sets can be accurately discerned from the labeled data alone. In this case, it follows that\nthe minimax lower bound is equal to the minimax lower bound for H\u00a8older-\u03b1 smooth regression\nfunctions, which is cn\u22122\u03b1/(d+2\u03b1), where c > 0 is a constant [17].\nReferences\n[1] Balcan, M.F., Blum, A.: A PAC-style model for learning from labeled and unlabeled data.\n\nIn: 18th\n\nAnnual Conference on Learning Theory, COLT. (2005)\n\n[2] K\u00a8a\u00a8ari\u00a8ainen, M.: Generalization error bounds using unlabeled data.\n\nLearning Theory, COLT. (2005)\n\nIn: 18th Annual Conference on\n\n[3] Rigollet, P.: Generalization error bounds in semi-supervised classi\ufb01cation under the cluster assumption.\n\nJournal of Machine Learning Research 8 (2007) 1369\u20131392\n\n[4] Lafferty, J., Wasserman, L.: Statistical analysis of semi-supervised regression. In: Advances in Neural\n\nInformation Processing Systems 21, NIPS. (2007) 801\u2013808\n\n[5] Ben-David, S., Lu, T., Pal, D.: Does unlabeled data provably help? worst-case analysis of the sample\ncomplexity of semi-supervised learning. In: 21st Annual Conference on Learning Theory, COLT. (2008)\nSome theoretical analy-\nTechnical Report TR-2008-01, Computer Science Department, University of Chicago. URL\n\nManifold regularization and semi-supervised learning:\n\nses.\nhttp://people.cs.uchicago.edu/\u223cniyogi/papersps/ssminimax2.pdf (2008)\n\n[6] Niyogi, P.:\n\n[7] Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh,\n\nUK. URL http://www.dai.ed.ac.uk/\u223cseeger/papers.html (2000)\n\n[8] Castelli, V., Cover, T.M.: On the exponential value of labeled samples. Pattern Recognition Letters 16(1)\n\n(1995) 105\u2013111\n\n[9] Castelli, V., Cover, T.M.: The relative value of labeled and unlabeled samples in pattern recognition.\n\nIEEE Transactions on Information Theory 42(6) (1996) 2102\u20132117\n\n[10] Bickel, P.J., Li, B.: Local polynomial regression on unknown manifolds. In: IMS Lecture NotesMono-\ngraph Series, Complex Datasets and Inverse Problems: Tomography, Networks and Beyond. Volume 54.\n(2007) 177\u2013186\n\n[11] Korostelev, A.P., Tsybakov, A.B.: Minimax Theory of Image Reconstruction. Springer, NY (1993)\n[12] Castro, R., Willett, R., Nowak, R.:\n\nin regression via active learning.\nTechnical Report ECE-05-03, ECE Department, University of Wisconsin - Madison. URL\nhttp://www.ece.wisc.edu/\u223cnowak/ECE-05-03.pdf (2005)\n\nFaster\n\nrates\n\n[13] Singh, A., Nowak, R., Zhu, X.:\n\nsemi-supervised learning.\nTechnical Report ECE-08-03, ECE Department, University of Wisconsin - Madison. URL\nhttp://www.ece.wisc.edu/\u223cnowak/SSL TR.pdf (2008)\n\nFinite sample analysis of\n\n[14] Korostelev, A., Nussbaum, M.: The asymptotic minimax constant for sup-norm loss in nonparametric\n\ndensity estimation. Bernoulli 5(6) (1999) 1099\u20131118\n\n[15] Chapelle, O., Zien, A.: Semi-supervised classi\ufb01cation by low density separation. In: Tenth International\n\nWorkshop on Arti\ufb01cial Intelligence and Statistics. (2005) 57\u201364\n\n[16] Tsybakov, A.B.: Introduction a l\u2019estimation non-parametrique. Springer, Berlin Heidelberg (2004)\n[17] Stone, C.J.: Optimal rates of convergence for nonparametric estimators. The Annals of Statistics 8(6)\n\n(1980) 1348\u20131360\n\n8\n\n\f", "award": [], "sourceid": 889, "authors": [{"given_name": "Aarti", "family_name": "Singh", "institution": null}, {"given_name": "Robert", "family_name": "Nowak", "institution": null}, {"given_name": "Jerry", "family_name": "Zhu", "institution": null}]}