{"title": "Sufficient Conditions for Agnostic Active Learnable", "book": "Advances in Neural Information Processing Systems", "page_first": 1999, "page_last": 2007, "abstract": "We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufficient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the classification boundary and the underlying distribution are smooth to a finite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are infinitely smooth, the improvement is exponential.", "full_text": "Suf\ufb01cient Conditions for Agnostic Active Learnable\n\nKey Laboratory of Machine Perception, MOE,\n\nSchool of Electronics Engineering and Computer Science,\n\nLiwei Wang\n\nPeking University,\n\nwanglw@cis.pku.edu.cn\n\nAbstract\n\nWe study pool-based active learning in the presence of noise, i.e. the agnostic set-\nting. Previous works have shown that the effectiveness of agnostic active learning\ndepends on the learning problem and the hypothesis space. Although there are\nmany cases on which active learning is very useful, it is also easy to construct\nexamples that no active learning algorithm can have advantage. In this paper, we\npropose intuitively reasonable suf\ufb01cient conditions under which agnostic active\nlearning algorithm is strictly superior to passive supervised learning. We show\nthat under some noise condition, if the Bayesian classi\ufb01cation boundary and the\nunderlying distribution are smooth to a \ufb01nite order, active learning achieves poly-\nnomial improvement in the label complexity; if the boundary and the distribution\nare in\ufb01nitely smooth, the improvement is exponential.\n\n1 Introduction\n\nActive learning addresses the problem that the algorithm is given a pool of unlabeled data drawn\ni.i.d. from some underlying distribution. The algorithm can then pay for the label of any example\nin the pool. The goal is to learn an accurate classi\ufb01er by requesting as few labels as possible. This\nis in contrast with the standard passive supervised learning, where the labeled examples are chosen\nrandomly.\n\nThe simplest example that demonstrates the potential of active learning is to learn the optimal thresh-\nold on an interval. If there exists a perfect threshold separating the two classes (i.e. there is no noise),\n\u0001 ) labels to learn an \u0001-accurate classi\ufb01er, while passive learn-\nthen binary search only needs O(ln 1\ning requires O( 1\n\u0001 ) labels. Another encouraging example is to learn a homogeneous linear separator\nfor data uniformly distributed on the unit sphere of Rd. In this case active learning can still give\nexponential savings in the label complexity [Das05].\n\nHowever, there are also very simple problems that active learning does not help at all. Suppose the\ninstances are uniformly distributed on [0, 1], and the positive class could be any interval on [0, 1].\nAny active learning algorithms needs O( 1\n\u0001 ) label requests to learn an \u0001-accurate classi\ufb01er [Han07].\nThere is no improvement over passive learning. All above are noise-free (realizable) problems. Of\nmore interest and more realistic is the agnostic setting, where the class labels can be noisy so that\nthe best classi\ufb01er in the hypothesis space has a non-zero error \u03bd. For agnostic active learning, there\nis no active learning algorithm that can always reduce label requests due to a lower bound \u2126( \u03bd2\n\u00012 ) for\nthe label complexity [Kaa06].\n\nIt is known that whether active learning helps or not depends on the distribution of the instance-label\npairs and the hypothesis space. Thus a natural question would be that under what conditions is active\nlearning guaranteed to require fewer labels than passive learning.\n\n1\n\n\fIn this paper we propose intuitively reasonable suf\ufb01cient conditions under which active learning\nachieves lower label complexity than that of passive learning. Speci\ufb01cally, we focus on the A2 al-\ngorithm [BAL06] which works in the agnostic setting. Earlier work has discovered that the label\ncomplexity of A2 can be upper bounded by a parameter of the hypothesis space and the data dis-\ntribution called disagreement coef\ufb01cient [Han07]. This parameter often characterizes the intrinsic\ndif\ufb01culty of the learning problem. By an analysis of the disagreement coef\ufb01cient we show that, un-\nder some noise condition, if the Bayesian classi\ufb01cation boundary and the underlying distribution are\nsmooth to a \ufb01nite order, then A2 gives polynomial savings in the label complexity; if the boundary\nand the distribution are in\ufb01nitely smooth, A2 gives exponential savings.\n\n1.1 Related Works\n\nOur work is closely related to [CN07], in which the authors proved sample complexity bounds for\nproblems with smooth classi\ufb01cation boundary under Tsybakov\u2019s noise condition [Tsy04]. They also\nassumed that the distribution of the instances is bounded from above and below. The main difference\nto our work is that their analysis is for the membership-query setting [Ang88], in which the learning\nalgorithm can choose any point in the instance space and ask for its label; while the pool-based\nmodel analyzed here assumes the algorithm can only request labels of the instances it observes.\n\nAnother related work is due to Friedman [Fri09]. He introduced a different notion of smoothness\nand showed that this guarantees exponential improvement for active learning. But his work focused\non the realizable case and does not apply to the agnostic setting studied here.\nSoon after A2, Dasgupta, Hsu and Monteleoni [DHM07] proposed an elegant agnostic active learn-\ning algorithm. It reduces active learning to a series of supervised learning problems. If the hy-\npothesis space has a \ufb01nite VC dimension, it has a better label complexity than A2. However, this\nalgorithm relies on the normalized uniform convergence bound for the VC class. It is not known\nwhether it holds for more general hypothesis space such as the smooth boundary class analyzed in\nthis paper. (For recent advances on this topic, see [GKW03].) It is left as an open problem whether\nour results apply to this algorithm by re\ufb01ned analysis of the normalized bounds.\n\n2 Preliminaries\nLet X be an instance space, D a distribution over X \u00d7 {\u22121, 1}. Let H be the hypothesis space, a\nset of classi\ufb01ers from X to {\u00b11}. Denote DX the marginal of D over X . In our active learning\nmodel, the algorithm has access to a pool of unlabeled examples from DX . For any unlabeled point\n(cid:80)\nx, the algorithm can ask for its label y, which is generated from the conditional distribution at x.\nThe error of a hypothesis h according to D is erD(h) = Pr(x,y)\u223cD(h(x) (cid:54)= y). The empirical error\non a \ufb01nite sample S is erS(h) = 1|S|\n(x,y)\u2208S I[h(x) (cid:54)= y], where I is the indicator function. We\nuse h\u2217 denote the best classi\ufb01er in H. That is, h\u2217 = arg minh\u2208H erD(h). Let \u03bd = erD(h\u2217). Our\ngoal is to learn a \u02c6h \u2208 H with error rate at most \u03bd + \u0001, where \u0001 is a prede\ufb01ned parameter.\nA2 is the \ufb01rst rigorous agnostic active learning algorithm. A description of the algorithm is given\nin Fig.1. It was shown that A2 is never much worse than passive learning in terms of the label\ncomplexity. The key observation that A2 can be superior to passive learning is that, since our goal is\nto choose an \u02c6h such that erD(\u02c6h) \u2264 erD(h\u2217) + \u0001, we only need to compare the errors of hypotheses.\nTherefore we can just request labels of those x on which the hypotheses under consideration have\ndisagreement.\nTo do this, the algorithm keeps track of two spaces. One is the current version space Vi, consisting\nof hypotheses that with statistical con\ufb01dence are not too bad compared to h\u2217. To achieve such a\nstatistical guarantee, the algorithm must be provided with a uniform convergence bound over the\nhypothesis space. That is, with probability at least 1 \u2212 \u03b4 over the draw of sample S according to D,\nhold simultaneously for all h \u2208 H, where the lower bound LB(S, h, \u03b4) and upper bound\nU B(S, h, \u03b4) can be computed from the empirical error erS(h). The other space is the region of\ndisagreement DIS(Vi), which is the set of all x \u2208 X for which there are hypotheses in Vi that\ndisagree on x. Formally, for any V \u2282 H,\n\nLB(S, h, \u03b4) \u2264 erD(h) \u2264 U B(S, h, \u03b4),\n\nDIS(V ) = {x \u2208 X : \u2203h, h(cid:48) \u2208 V, h(x) (cid:54)= h(cid:48)(x)}.\n\n2\n\n\fInput: concept space H, accuracy parameter \u0001 \u2208 (0, 1), con\ufb01dence parameter \u03b4 \u2208 (0, 1);\nOutput: classi\ufb01er \u02c6h \u2208 H;\n(\u03bb depends on H and the problem, see Theorem 5) ;\nLet \u02c6n = 2(2 log2\n\u0001 + ln 1\nLet \u03b4(cid:48) = \u03b4/\u02c6n ;\nV0 \u2190 H, S0 \u2190 \u2205, i \u21900, j1 \u21900, k \u21901 ;\nwhile \u2206(Vi)(minh\u2208Vi U B(Si, h, \u03b4(cid:48)) \u2212 minh\u2208Vi LB(Si, h, \u03b4(cid:48))) > \u0001 do\n\n\u03b4 ) log2\n\n2\n\u0001\n\n\u03bb\n\nVi+1 \u2190 {h \u2208 Vi : LB(Si, h, \u03b4(cid:48)) \u2264 minh(cid:48)\u2208Vi U B(Si, h(cid:48), \u03b4(cid:48))};\ni \u2190i+1;\nif \u2206(Vi) < 1\n\n2\u2206(Vjk) then\n\nk \u2190 k + 1; jk \u2190 i;\nend\ni \u2190 Rejection sample 2i\u2212jk samples x from D satisfying x \u2208 DIS(Vi);\nS(cid:48)\nSi \u2190 {(x, y = label(x)) : x \u2208 S(cid:48)\ni};\nend\nReturn \u02c6h= argminh\u2208ViU B(Si, h, \u03b4(cid:48)).\n\nAlgorithm 1: The A2 algorithm (this is the version in [Han07])\n\nThe volume of DIS(V ) is denoted by \u2206(V ) = PrX\u223cDX (X \u2208 DIS(V )). Requesting labels of\nthe instances from DIS(Vi) allows A2 require fewer labels than passive learning. Hence the key\nissue is how fast \u2206(Vi) reduces. This process, and in turn the label complexity of A2, are nicely\ncharacterized by the disagreement coef\ufb01cient \u03b8 introduced in [Han07].\n\nDe\ufb01nition 1 Let \u03c1(\u00b7,\u00b7) be the pseudo-metric on a hypothesis space H induced by DX . That is, for\nh, h(cid:48) \u2208 H, \u03c1(h, h(cid:48)) = PrX\u223cDX (h(X) (cid:54)= h(cid:48)(X)). Let B(h, r) = {h(cid:48) \u2208 H: \u03c1(h, h(cid:48)) \u2264 r}. The\ndisagreement coef\ufb01cient \u03b8(\u0001) is\n\n\u03b8(\u0001) = sup\nr\u2265\u0001\n\nPrX\u223cDX (X \u2208 DIS(B(h\u2217, r)))\n\nr\n\n,\n\n(1)\n\nwhere h\u2217 = arg minh\u2208H erD(h).\n\nNote that \u03b8 depends on H and D, and 1 \u2264 \u03b8(\u0001) \u2264 1\n\u0001 .\n\n3 Main Results\n\nAs mentioned earlier, whether active learning helps or not depends on the distribution and the hy-\npothesis space. There are simple examples such as learning intervals for which active learning has\nno advantage. However, these negative examples are more or less \u201carti\ufb01cial\u201d. It is important to\nunderstand whether problems with practical interest are actively learnable or not. In this section we\nprovide intuitively reasonable conditions under which the A2 algorithm is strictly superior to passive\nlearning. Our main results (Theorem 11 and Theorem 12) show that if the learning problem has a\nsmooth Bayes classi\ufb01cation boundary, and the distribution DX has a density bounded by a smooth\nfunction, then under some noise condition A2 saves label requests. It is a polynomial improvement\nfor \ufb01nite smoothness, and exponential for in\ufb01nite smoothness.\n\nIn Section 3.1 we formally de\ufb01ne the smoothness and introduce the hypothesis space, which contains\nsmooth classi\ufb01ers. We show a uniform convergence bound of order O(n\u22121/2) for this hypothesis\nspace. This bound determines U B(S, h, \u03b4) and LB(S, h, \u03b4) in A2. Section 3.2 is the main technical\npart, where we give upper bounds for the disagreement coef\ufb01cient of smooth problems. In Section\n3.3 we show that under some noise condition, there is a sharper bound for the label complexity in\nterms of the disagreement coef\ufb01cient. These lead to our main results.\n\n3\n\n\f3.1 Smoothness\nLet f be a function de\ufb01ned on \u2126 \u2282 Rd. For any vector k = (k1,\u00b7\u00b7\u00b7 , kd) of d nonnegative integers,\nlet |k| =\n\n(cid:80)d\n\ni=1 ki. De\ufb01ne the K-norm as\n(cid:107)f(cid:107)K := max\n|k|\u2264K\u22121\n\nsup\nx\u2208\u2126\n\n|Dkf(x)| + max\n|k|=K\u22121\n\nDkf(x) \u2212 Dkf(x(cid:48))\n\n(cid:107)x \u2212 x(cid:48)(cid:107)\n\n,\n\nsup\nx,x(cid:48)\u2208\u2126\n\n(2)\n\nwhere\n\nis the differential operator.\n\nDk =\n\n\u2202|k|\n\n\u2202k1x1 \u00b7\u00b7\u00b7 \u2202kdxd\n\n,\n\nDe\ufb01nition 2 (Finite Smooth Functions) A function f is said to be Kth order smooth with respect\nto a constant C, if (cid:107)f(cid:107)K \u2264 C. The set of Kth order smooth functions is de\ufb01ned as\n\n(3)\nThus Kth order smooth functions have uniformly bounded partial derivatives up to order K \u22121, and\nthe K \u2212 1th order partial derivatives are Lipschitz.\n\nC := {f : (cid:107)f(cid:107)K \u2264 C}.\nF K\n\nDe\ufb01nition 3 (In\ufb01nitely Smooth Functions) A function f is said to be in\ufb01nitely smooth with respect\nto a constant C, if (cid:107)f(cid:107)K \u2264 C for all nonnegative integers K. The set of in\ufb01nitely smooth functions\nis denoted by F \u221e\nC .\n\nWith the de\ufb01nitions of smoothness, we introduce the hypothesis space we use in the A2 algorithm.\nDe\ufb01nition 4 (Hypotheses with Smooth Boundaries) A set of hypotheses HK\nC de\ufb01ned on [0, 1]d+1\nis said to have Kth order smooth boundaries, if for every h \u2208 HK\nC , the classi\ufb01cation boundary is\na Kth order smooth function on [0, 1]d. To be precise, let x = (x1, x2, . . . , xd+1) \u2208 [0, 1]d+1. The\nclassi\ufb01cation boundary is the graph of function xd+1 = f(x1, . . . , xd), where f \u2208 F K\nC . Similarly,\na hypothesis space H\u221e\nC the\nclassi\ufb01cation boundary is the graph an in\ufb01nitely smooth function on [0, 1]d.\n\nC is said to have in\ufb01nitely smooth boundaries, if for every h \u2208 H\u221e\n\nPrevious results on the label complexity of A2 assumes the hypothesis space has \ufb01nite VC di-\nmension. The goal is to ensure a O(n\u22121/2) uniform convergence bound so that U B(S, h, \u03b4) \u2212\nLB(S, h, \u03b4) = O(n\u22121/2). The hypothesis space HK\nC do not have \ufb01nite VC dimensions.\nCompared with the VC class, HK\nC are exponentially larger in terms of the covering num-\nbers [vdVW96]. But uniform convergence bound still holds for HK\nC under a broad class of\ndistributions. The following theorem is a consequence of some known results in empirical processes.\nTheorem 5 For any distribution D over [0, 1]d+1 \u00d7 {\u22121, 1}, whose marginal distribution DX on\n[0, 1]d+1 has a density upper bounded by a constant M, and any 0 < \u03b4 \u2264 \u03b40 (\u03b40 is a constant), with\nprobability at least 1 \u2212 \u03b4 over the draw of the training set S of n examples,\n\nC and H\u221e\n\nC and H\u221e\n\nC and H\u221e\n\n(cid:115)\n\nlog 1\n\u03b4\n\n,\n\n|erD(h) \u2212 erS(h)| \u2264 \u03bb\n(4)\nC provided K > d (or K = \u221e). Here \u03bb is a constant depending\n\nholds simultaneously for all h \u2208 HK\nonly on d, K, C and M.\nProof It can be seen, from Corollary 2.7.3 in [vdVW96] that the bracketing numbers N[ ] of HK\nC\nsatis\ufb01es log N[ ](\u0001,HK\nK ). Since K > d, then there exist constants c1, c2 such\nthat\n\n(cid:195)\nC , L2(DX )) = O(( 1\n\n(cid:33)\n\n\u0001 ) 2d\n\nn\n\n(cid:181)\n\n(cid:182)\n\nPD\n\nsup\nh\u2208HK\n\nC\n\n|er(h) \u2212 erS(h)| \u2265 t\n\n\u2264 c1 exp\n\n\u2212 nt2\nc2\n\nfor all nt2 \u2265 t0, where t0 is some constant (see Theorem 5.11 and Lemma 5.10 of [vdG00]). Let\n\u03b4 = c1 exp\n\n, the theorem follows.\n\n\u2212 nt2\n\n(cid:179)\n\n(cid:180)\n\nc2\n\n4\n\n\f(cid:113)\n\n(cid:113)\n\nNow we can determine U B(S, h, \u03b4) and LB(S, h, \u03b4) for A2 by simply letting U B(S, h, \u03b4) =\nerS(h) + \u03bb\n\nn and LB(S, h, \u03b4) = erS(h) \u2212 \u03bb\n\nn , where S is of size n.\n\nln 1\n\u03b4\n\nln 1\n\u03b4\n\n3.2 Disagreement Coef\ufb01cient\n\nThe disagreement coef\ufb01cient \u03b8 plays an important role for the label complexity of active learning\nalgorithms. In fact previous negative examples for which active learning does not work are all the\nresults of large \u03b8. For instance the interval learning problem, \u03b8(\u0001) = 1\n\u0001 , which leads to the same\nlabel complexity as passive learning. In the following two theorems we show that the disagreement\ncoef\ufb01cient \u03b8(\u0001) for smooth problems is small.\nC . If the distribution DX has a density p(x1, . . . , xd+1)\nTheorem 6 Let the hypothesis space be HK\nsuch that there exists a Kth order smooth function g(x1, . . . , xd+1) and two constants 0 < \u03b1 \u2264\n\u03b2 such that \u03b1g(x1, . . . , xd+1) \u2264 p(x1, . . . , xd+1) \u2264 \u03b2g(x1, . . . , xd+1) for all (x1, . . . , xd+1) \u2208\n[0, 1]d+1, then \u03b8(\u0001) = O\n\n(cid:162) d\n\n(cid:179)(cid:161)\n\n(cid:180)\n\nK+d\n\n.\n\n1\n\u0001\n\nTheorem 7 Let the hypothesis space be H\u221e\nC . If the distribution DX has a density p(x1, . . . , xd+1)\nsuch that there exist an in\ufb01nitely smooth function g(x1, . . . , xd) and two constants 0 < \u03b1 \u2264 \u03b2 such\nthat \u03b1g(x1, . . . , xd) \u2264 p(x1, . . . , xd+1) \u2264 \u03b2g(x1, . . . , xd) for all (x1, . . . , xd+1) \u2208 [0, 1]d+1, then\n\u03b8(\u0001) = O(logd( 1\n\n\u0001 )).\n\nThe key points in the theorems are: the classi\ufb01cation boundaries are smooth; and the density is\nbounded from above and below by constants times a smooth function. These two conditions include\na large class of learning problems. Note that the density itself is not necessarily smooth. We just\nrequire the density does not change too rapidly.\nThe intuition behind the two theorems above is as follows. Let fh\u2217(x) and fh(x) be the classi\ufb01cation\nboundaries of h\u2217 and h, and suppose \u03c1(h, h\u2217) is small, where \u03c1(h, h\u2217) = Prx\u223cDX (h(x) (cid:54)= h\u2217(x))\nis the pseudo metric. If the classi\ufb01cation boundaries and the density are all smooth, then the two\nboundaries have to be close to each other everywhere. That is, |fh(x) \u2212 ff\u2217(x)| is small uniformly\nfor all x. Hence only the points close to the classi\ufb01cation boundary of h\u2217 can be in DIS(B(h\u2217, \u0001)),\nwhich leads to a small disagreement coef\ufb01cient.\n\nThe proofs of Theorem 6 and Theorem 7 rely on the following two lemmas.\n\n[0,1]d |\u03a6(x)|dx \u2264 r. If there exists a Kth\nLemma 8 Let \u03a6 be a function de\ufb01ned on [0, 1]d and\norder smooth function \u02dc\u03a6 and 0 < \u03b1 \u2264 \u03b2 such that \u03b1|\u02dc\u03a6(x)| \u2264 |\u03a6(x)| \u2264 \u03b2|\u02dc\u03a6(x)| for all x \u2208 [0, 1]d,\nthen (cid:107)\u03a6(cid:107)\u221e = O(r\n\nK+d ), where (cid:107)\u03a6(cid:107)\u221e = supx\u2208[0,1]d |\u03a6(x)|.\n\nK+d ) = O(r \u00b7 ( 1\n\nK\n\nr ) d\n\n[0,1]d |\u03a6(x)|dx \u2264 r. If there exists an in\ufb01nitely\nLemma 9 Let \u03a6 be a function de\ufb01ned on [0, 1]d and\nsmooth function \u02dc\u03a6 and 0 < \u03b1 \u2264 \u03b2 such that \u03b1|\u02dc\u03a6(x)| \u2264 |\u03a6(x)| \u2264 \u03b2|\u02dc\u03a6(x)| for all x \u2208 [0, 1]d, then\n(cid:107)\u03a6(cid:107)\u221e = O(r \u00b7 logd( 1\nr ))\n\n(cid:82)\n\n(cid:82)\n\nWe will brie\ufb02y describe the ideas of the proofs of these two lemmas in the Appendix. The formal\nproofs are given in the supplementary \ufb01le.\nProof of Theorem 6 First of all, since we focus on binary classi\ufb01cation, DIS(B(h\u2217, r)) can be\nwritten equivalently as\n\nDIS(B(h\u2217, r)) = {x \u2208 X , \u2203h \u2208 B(h\u2217, r), s.t. h(x) (cid:54)= h\u2217(x)}.\n\nConsider any h \u2208 B(h\u2217, r). Let fh, fh\u2217 \u2208 F K\nand h\u2217 respectively. If r is suf\ufb01ciently small, we must have\n\nC be the corresponding classi\ufb01cation boundaries of h\n\n\u03c1(h, h\u2217) = Pr\nX\u223cDX\n\n(h(X) (cid:54)= h\u2217(X)) =\n\np(x1, . . . , xd+1)dxd+1\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n\n(cid:90) fh(x1,...,xd)\n\nfh\u2217 (x1,...,xd)\n\n(cid:90)\n\ndx1 . . . dxd\n\n[0,1]d\n\n5\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175) .\n\n\fDenote\n\n\u03a6h(x1, . . . , xd) =\n\np(x1, . . . , xd+1)dxd+1.\n\n(cid:90) fh(x1,...,xd)\n\nfh\u2217 (x1,...,xd)\n\nWe assert that there is a Kth order smooth function \u02dc\u03a6h(x1, . . . , xd) and two constants 0 < u \u2264 v\nsuch that u|\u02dc\u03a6h| \u2264 |\u03a6h| \u2264 v|\u02dc\u03a6h|. To see this, remember that fh and fh\u2217 are Kth order smooth\nfunctions; and the density p is upper and lower bounded by constants times a Kth order smooth\nfunction g(x1, . . . , xd+1); and note that \u02dc\u03a6h(x1, . . . , xd) =\nfh\u2217 (x1,...,xd) g(x1, . . . , xd+1)dxd+1 is\na Kth order smooth function. The latter is easy to check by taking derivatives. By Lemma 8, we have\n(cid:107)\u03a6h(cid:107)\u221e = O(r \u00b7 ( 1\nwe have suph\u2208B(h\u2217,r) (cid:107)\u03a6h(cid:107)\u221e = O(r \u00b7 ( 1\nNow consider the region of disagreement of B(h\u2217, r). Clearly DIS(B(h\u2217, r)) = \u222ah\u2208B(h\u2217,r){x :\nh(x) (cid:54)= h\u2217(x)}. Hence\n\n(cid:82) |\u03a6h| = \u03c1(h, h\u2217) \u2264 r. Because this holds for all h \u2208 B(h\u2217, r),\n\n(cid:82) fh(x1,...,xd)\n\nK+d ), since\n\nr ) d\n\nr ) d\n\nK+d ).\n\n(cid:161)\n\nx \u2208 \u222ah\u2208B(h\u2217,r){x : h(x) (cid:54)= h\u2217(x)}(cid:162)\n(cid:195)\n\n(cid:33)\n\n(cid:182) d\n\n(cid:181)\n\nPr\nX\u223cDX\n\n(x \u2208 DIS(B(h\u2217, r))) = Pr\nX\u223cDX\n\n(cid:90)\n\n\u2264 2\n\nsup\n\nh\u2208B(h\u2217,r)\n\n[0,1]d\n\n(cid:107)\u03a6h(cid:107)\u221edx1 . . . dxd = O\n\nr \u00b7\n\nThe theorem follows by the de\ufb01nition of \u03b8(\u0001).\nTheorem 7 can be proved similarly by using Lemma 9.\n\n1\nr\n\nK+d\n\n.\n\n(cid:182)\n\n3.3 Label Complexity\n\n(cid:181)\n\n(cid:181)\n\n(cid:182)\n\u03bd2\n\u00012 + 1\n\n(cid:181)\n\n(cid:182)\n\nO\n\n\u03b82\n\nIt was shown in [Han07] that the label complexity of A2 is\n1\n\u0001\n\n(5)\nwhere \u03bd = minh\u2208H erD(h). When \u0001 \u2265 \u03bd, our previous results on the disagreement coef\ufb01cient\nalready imply polynomial or exponential improvements for A2. However, when \u0001 < \u03bd, the label\ncomplexity becomes O( 1\n\u00012 ), the same as passive learning whatever \u03b8 is. In fact, without any as-\nsumption on the noise, the O( 1\n\u00012 ) lower bound of agnostic\nactive learning [Kaa06].\n\n\u00012 ) result is inevitable due to the \u2126( \u03bd2\n\npolylog\n\n1\n\u03b4\n\nln\n\n,\n\nRecently, there has been considerable interest in how noise affects the learning rate. A remarkable\nnotion is due to Tsybakov [Tsy04], which was \ufb01rst introduced for passive learning. Let \u03b7(x) =\nP (Y = 1|X = x). Tsybakov\u2019s noise condition assumes that for some c > 0, 0 < \u03b1 \u2264 \u221e\n\n(6)\nfor all 0 < t \u2264 t0, where t0 is some constant. (6) implies a connection between the pseudo distance\n\u03c1(h, h\u2217) and the excess risk erD(h) \u2212 erD(h\u2217):\n\nPr\nX\u223cDX\n\n(|\u03b7(X) \u2212 1/2| \u2264 t) \u2264 ct\u2212\u03b1,\n\n\u03c1(h, h\u2217) \u2264 c(cid:48) (erD(h) \u2212 erD(h\u2217))1/\u03ba ,\n\n(7)\n\u03b1 \u2265 1 is called the noise\nwhere h\u2217 is the Bayes classi\ufb01er, c(cid:48) is some \ufb01nite constant. Here \u03ba = 1+\u03b1\nexponent. \u03ba = 1 is the optimal case, where the problem has bounded noise; \u03ba > 1 correspond to\nunbounded noise.\n\nCastro and Nowak [CN07] noticed that Tsybakov\u2019s noise condition is also important in active learn-\ning. They proved label complexity bounds in terms of \u03ba for the membership-query setting. A notable\nfact is that \u02dcO(( 1\n\u03ba ) (\u03ba > 1) is both an upper and a lower bound for membership-query in the\nminimax sense. It is important to point out that the lower bound automatically applies to pool-based\nmodel, since pool makes weaker assumptions than membership-query. Hence for large \u03ba, active\nlearning has very limited improvement over passive learning whatever other factors are.\n\n\u0001 ) 2\u03ba\u22122\n\nRecently, Hanneke [Han09] obtained similar label complexity for pool-based model. He showed the\n\u03b4 ) for the bounded noise case, i.e. \u03ba = 1. Here we slightly\nlabels requested by A2 is O(\u03b82 ln 1\n\n\u0001 ln 1\n\n6\n\n\fgeneralize Hanneke\u2019s result to unbounded noise by introducing the following noise condition. We\nassume there exist c1, c2 > 0 and T0 > 0 such that\n\n(|\u03b7(X) \u2212 1/2| \u2264 1\n(cid:181)\nT\nfor all T \u2265 T0. It is not dif\ufb01cult to show that (8) implies\n(er(h) \u2212 er(h\u2217)) ln\n\n\u03c1(h, h\u2217) = O\n\nPr\nX\u223cDX\n\n) \u2264 c1e\u2212c2T ,\n\n1\n\n(er(h) \u2212 er(h\u2217))\n\n(cid:182)\n\n.\n\n(8)\n\n(9)\n\n\u0001 )).\n\n\u03b4 \u00b7 polylog( 1\n\nThis condition assumes unbounded noise. Under this noise condition, A2 has a better label com-\nplexity.\nTheorem 10 Assume that the learning problem satis\ufb01es the noise condition (8) and DX has a den-\nsity upper bounded by a constant M. For any hypothesis space H that has a O(n\u22121/2) uniform\nconvergence bound, if the Bayes classi\ufb01er h\u2217 is in H, then with probability at least 1\u2212 \u03b4, A2 outputs\n\u02c6h \u2208 H with erD(\u02c6h) \u2264 erD(h\u2217) + \u0001, and the number of labels requested by the algorithm is at most\nO(\u03b82(\u0001) \u00b7 ln 1\nProof As the proof of [Han07], one can show that with probability 1 \u2212 \u03b4 we never remove h\u2217 from\nVi, and for any h, h(cid:48) \u2208 Vi we must have \u2206(Vi)(eri(h) \u2212 eri(h(cid:48))) = erD(h) \u2212 erD(h(cid:48)), where\neri(h) is the error rate of h conditioned on DIS(Vi). These guarantees erD(\u02c6h) \u2264 erD(h\u2217) + \u0001.\nIf \u2206(Vi) \u2264 2\u0001\u03b8(\u0001), due to the O(n\u22121/2) uniform convergence bound, O(\u03b82(\u0001) ln 1\n\u03b4 ) labels suf\ufb01ces\nto make \u2206(Vi)(U B(Si, h, \u03b4(cid:48)) \u2212 LB(Si, h, \u03b4(cid:48))) \u2264 \u0001 for all h \u2208 DIS(Vi) and the algorithm stops.\n\u0001 ) times \u2206(Vi) <\nHence we next consider \u2206(Vi) > 2\u0001\u03b8(\u0001). Note that there are at most O(ln 1\n2\u2206(Vjk) occurs. So below we bound the number of labels needed to make \u2206(Vi) < 1\n2\u2206(Vjk)\n1\noccurs. By the de\ufb01nition of \u03b8(\u0001), if \u03c1(h, h\u2217) \u2264 \u2206(Vjk )\n2\u2206(Vjk). Let\n\u03b3(h) = erD(h) \u2212 erD(h\u2217). By the noise assumption (9) we have that if\n\nfor all h \u2208 Vi, then \u2206(Vi) < 1\n\n2\u03b8(\u0001)\n\n\u03b3(h) ln\n\n1\n\n\u03b3(h)\n\n\u2264 c\n\n\u2206(Vjk)\n2\u03b8(\u0001) ,\n\n(10)\n\n)\n\n, and in turn if \u03b3(h) \u2264 c \u2206(Vjk )\n\u03b8(\u0001) ln 1\n\u0001\n\nthen \u2206(Vi) < 1\n2\u2206(Vjk). Here and below, c is appropriate constant but may be different from\nline to line. Note that (10) holds if \u03b3(h) \u2264 c \u2206(Vjk )\nsince\n\u03b8(\u0001) ln \u03b8(\u0001)\n\u2206(Vjk\n\u2206(Vjk) \u2265 \u2206(Vi) > 2\u0001\u03b8(\u0001). But to have the last inequality, the algorithm only needs to label\nO(\u03b82(\u0001) ln2 1\n\u03b4 ) instances from DIS(Vi). So the total number of labels requested by A2 is\n\u0001 ln 1\nO(\u03b82(\u0001) ln 1\n\u0001 )\n\u03b4 ln3 1\nNow we give our main label complexity bounds for agnostic active learning.\nTheorem 11 Let the instance space be [0, 1]d+1. Let the Hypothesis space be HK\nC , where K > d.\nAssume that the Bayes classi\ufb01er h\u2217 of the learning problem is in HK\nC ; the noise condition (8) holds;\nand DX has a density bounded by a Kth order smooth function as in Theorem 6. Then the A2\nalgorithm outputs \u02c6h with error rate erD(\u02c6h) \u2264 erD(h\u2217) + \u0001 and the number of labels requested is at\nmost \u02dcO\nProof Note that the density DX is upper bounded by a smooth function implies that it is also upper\nbounded by a constant M. Combining Theorem 5, 6 and 10 the theorem follows.\n\n, where in \u02dcO we hide the polylog\n\n(cid:162) 2d\n\nK+d ln 1\n\u03b4\n\n(cid:179)(cid:161)\n\nterm.\n\n(cid:180)\n\n(cid:161)\n\n(cid:162)\n\n1\n\u0001\n\n1\n\u0001\n\nCombining Theorem 5, 7 and 10 we can show the following theorem.\nTheorem 12 Let the instance space be [0, 1]d+1. Let the Hypothesis space be H\u221e\nC . Assume that\nthe Bayes classi\ufb01er h\u2217 of the learning problem is in H\u221e\nC ; the noise condition (8) holds; and DX\nhas a density bounded by an in\ufb01nitely smooth function as in Theorem 7. Then the A2 algorithm\noutputs \u02c6h with error rate erD(\u02c6h) \u2264 erD(h\u2217) + \u0001 and the number of labels requested is at most\nO\n\npolylog\n\n(cid:162)\n\n(cid:162)\n\n(cid:161)\n\n(cid:161)\n\n.\n\n1\n\u0001\n\nln 1\n\u03b4\n\n7\n\n\f4 Conclusion\n\nWe show that if the Bayesian classi\ufb01cation boundary is smooth and the distribution is bounded by a\nsmooth function, then under some noise condition active learning achieves polynomial or exponen-\ntial improvement in the label complexity than passive supervised learning according to whether the\nsmoothness is of \ufb01nite order or in\ufb01nite.\n\nAlthough we assume that the classi\ufb01cation boundary is the graph of a function, our results can be\ngeneralized to the case that the boundaries are a \ufb01nite number of functions. To be precise, consider\nN functions f1(x) \u2264 \u00b7\u00b7\u00b7 \u2264 fN (x), for all x \u2208 [0, 1]d. Let f0(x) \u2261 0, fN +1(x) \u2261 1. The positive (or\nnegative) set de\ufb01ned by these functions is {(x, xd+1) : f2i(x) \u2264 x \u2264 f2i+1(x), i = 0, 1, . . . , N\n2 }.\nOur theorems still hold in this case. In addition, by techniques in [Dud99] (page 259), our results\nmay generalize to problems which have intrinsic smooth boundaries (not only graphs of functions).\n\nAppendix\n\nIn this appendix we describe very brie\ufb02y the ideas to prove Lemma 8 and Lemma 9. The formal\nproofs can be found in the supplementary \ufb01le.\nIdeas to Prove Lemma 8 First consider the d = 1 case. Note that if f \u2208 F K\nC , then |f (K\u22121)(x) \u2212\nf (K\u22121)(x(cid:48))| \u2264 C|x \u2212 x(cid:48)| for all x, x(cid:48) \u2208 [0, 1]. It is not dif\ufb01cult to see that we only need to show for\nany f such that |f (K\u22121)(x)\u2212f (K\u22121)(x(cid:48))| \u2264 C|x\u2212x(cid:48)|, if\nK+1 ).\nTo show this, note that in order that (cid:107)f(cid:107)\u221e achieves the maximum while\nof f must be as large as possible. Indeed, it can be shown that (one of) the optimal f is of the form\n\n(cid:82) 1\n0 |f(x)|dx = r, then (cid:107)f(cid:107)\u221e = O(r\n\n(cid:82) |f| = r, the derivatives\n\nK\n\n\uf8f1\uf8f2\uf8f3 C\n\nf(x) =\n\nK!|x \u2212 \u03be|K\n\n0\n\n0 \u2264 x \u2264 \u03be,\n\u03be < x \u2264 1.\n\nThat is, |f (K\u22121)(x) \u2212 f (K\u22121)(x(cid:48))| = C|x \u2212 x(cid:48)| (i.e. the K \u2212 1 order derivatives reaches the upper\nbound of the Lipschitz constant.) for all x, x(cid:48) \u2208 [0, \u03be], where \u03be is determined by\n0 f(x)dx = r. It\nis then easy to check that (cid:107)f(cid:107)\u221e = O(r\nFor the general d > 1 case, we relax the constraint. Note that all K \u2212 1th order partial derivatives\nare Lipschitz implies that all K \u22121th order directional derivatives are Lipschitz too. Under the latter\nconstraint, (one of) the optimal f has the form\n\nK+1 ).\n\nK\n\n(11)\n\n(cid:82) 1\n\n\uf8f1\uf8f2\uf8f3 C\n\nf(x) =\n\n(cid:82)\n\nK!|(cid:107)x(cid:107) \u2212 \u03be|K\n\n0\n\n0 \u2264 (cid:107)x(cid:107) \u2264 \u03be,\n\u03be < (cid:107)x(cid:107).\n\n(cid:82)\n\nwhere \u03be is determined by\n\n[0,1]d |f(x)|dx = r. This implies (cid:107)f(cid:107)\u221e = O(r\n\nK\n\nK+d ).\n\nC , if\n\n[0,1]d |f(x)|dx = r, then (cid:107)f(cid:107)\u221e = O(r \u00b7 logd( 1\n\nIdeas to Prove Lemma 9 Similar to the proof of Lemma 8, we only need to show that for any\nf \u2208 F \u221e\nSince f is in\ufb01nitely smooth, we can choose K large and depending on r. For the d = 1 case, let\nK + 1 = log 1\n. We know that the optimal f is of the form of Eq.(11). (Actually this choice of K\nlog log 1\nr\nis approximately the largest K such that Eq.(11) is still the optimal form. If K is larger than this, \u03be\nwill be out of [0, 1].) Since\nK! \u03beK. Note\n\n(cid:82) 1\n0 |f(x)| = r, we have \u03beK+1 = (K+1)!\n\n. Now, (cid:107)f(cid:107)\u221e = C\n\nr )).\n\nC\n\nr\n\nlog log 1\nr\n\nlog 1\n\nr )K+1 = ( 1\nr )\n\nthat ( 1\nFor the d > 1 case, let K + d = log 1\nlog log 1\nr\n\nr = log 1\n\nr\n\nr . By Stirling\u2019s formula we can show (cid:107)f(cid:107)\u221e = O(r \u00b7 log 1\nr ).\n\n. By similar arguments we can show (cid:107)f(cid:107)\u221e = O(r\u00b7logd 1\nr ).\n\nAcknowledgement\n\nThis work was supported by NSFC(60775005).\n\n8\n\n\fReferences\n\n[Ang88]\n[BAL06] M.-F. Balcan, A.Beygelzimer, and J. Langford. Agnostic active learning. In 23th International\n\nD. Angluin. Queries and concept learning. Machine Learning, 2:319\u2013342, 1988.\n\n[CN07]\n\n[Das05]\n\nConference on Machine Learning, 2006.\nR. Castro and R. Nowak. Minimax bounds for active learning.\nLearning Theory, 2007.\nS. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Infor-\nmation Processing Systems, 2005.\n\nIn 20th Annual Conference on\n\n[DHM07] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances\n\nin Neural Information Processing Systems, 2007.\nR.M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.\nE. Friedman. Active learning for smooth problems.\nTheory, 2009.\n\nIn 22th Annual Conference on Learning\n\n[Dud99]\n[Fri09]\n\n[Han07]\n\n[Han09]\n\n[GKW03] V.E. Gine, V.I. Koltchinskii, and J. Wellner. Ratio limit theorems for empirical processes. Stochas-\n\ntic Inequalities and Applications, 56:249\u2013278, 2003.\nS. Hanneke. A bound on the label complexity of agnostic active learning. In 24th International\nConference on Machine Learning, 2007.\nS. Hanneke. Adaptive rates of convergence in active learning.\nLearning Theory, 2009.\n\nIn 22th Annual Conference on\n\n[Kaa06] M. Kaariainen. Active learning in the non-realizable case. In 17th International Conference on\n\n[Tsy04]\n\nAlgorithmic Learning Theory, 2006.\nA. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statistics,\n32:135\u2013166, 2004.\nS. van de Geer. Applications of Empirical Process Theory. Cambridge University Press, 2000.\n\n[vdG00]\n[vdVW96] A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes with Application to\n\nStatistics. Springer Verlag, 1996.\n\n9\n\n\f", "award": [], "sourceid": 209, "authors": [{"given_name": "Liwei", "family_name": "Wang", "institution": null}]}