{"title": "Multi-View Active Learning in the Non-Realizable Case", "book": "Advances in Neural Information Processing Systems", "page_first": 2388, "page_last": 2396, "abstract": "The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be $\\widetilde{O}(\\log \\frac{1}{\\epsilon})$, contrasting to single-view setting where the polynomial improvement is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is $\\widetilde{O}(\\frac{1}{\\epsilon})$, where the order of $1/\\epsilon$ is independent of the parameter in Tsybakov noise, contrasting to previous polynomial bounds where the order of $1/\\epsilon$ is related to the parameter in Tsybakov noise.", "full_text": "Multi-View Active Learning in\n\nthe Non-Realizable Case\n\nWei Wang and Zhi-Hua Zhou\n\nNational Key Laboratory for Novel Software Technology\n\nNanjing University, Nanjing 210093, China\n{wangw,zhouzh}@lamda.nju.edu.cn\n\nAbstract\n\nThe sample complexity of active learning under the realizability assumption has\nbeen well-studied. The realizability assumption, however, rarely holds in prac-\ntice. In this paper, we theoretically characterize the sample complexity of active\nlearning in the non-realizable case under multi-view setting. We prove that, with\nunbounded Tsybakov noise, the sample complexity of multi-view active learning\n\u01eb ), contrasting to single-view setting where the polynomial improve-\nment is the best possible achievement. We also prove that in general multi-view\nsetting the sample complexity of active learning with unbounded Tsybakov noise\n\u01eb ), where the order of 1/\u01eb is independent of the parameter in Tsybakov noise,\ncontrasting to previous polynomial bounds where the order of 1/\u01eb is related to the\nparameter in Tsybakov noise.\n\ncan be eO(log 1\nis eO( 1\n\n1\n\nIntroduction\n\nIn active learning [10, 13, 16], the learner draws unlabeled data from the unknown distribution\nde\ufb01ned on the learning task and actively queries some labels from an oracle. In this way, the active\nlearner can achieve good performance with much fewer labels than passive learning. The number\nof these queried labels, which is necessary and suf\ufb01cient for obtaining a good leaner, is well-known\nas the sample complexity of active learning.\n\nMany theoretical bounds on the sample complexity of active learning have been derived based on the\nrealizability assumption (i.e., there exists a hypothesis perfectly separating the data in the hypothesis\nclass) [4, 5, 11, 12, 14, 16]. The realizability assumption, however, rarely holds in practice. Recently,\nthe sample complexity of active learning in the non-realizable case (i.e., the data cannot be perfectly\nseparated by any hypothesis in the hypothesis class because of the noise) has been studied [2, 13, 17].\nIt is worth noting that these bounds obtained in the non-realizable case match the lower bound \u2126( \u03b72\n\u01eb2 )\n[19], in the same order as the upper bound O( 1\n\u01eb2 ) of passive learning (\u03b7 denotes the generalization\nerror rate of the optimal classi\ufb01er in the hypothesis class and \u01eb bounds how close to the optimal\nclassi\ufb01er in the hypothesis class the active learner has to get). This suggests that perhaps active\nlearning in the non-realizable case is not as ef\ufb01cient as that in the realizable case. To improve the\nsample complexity of active learning in the non-realizable case remarkably, the model of the noise\nor some assumptions on the hypothesis class and the data distribution must be considered. Tsybakov\nnoise model [21] is more and more popular in theoretical analysis on the sample complexity of active\nlearning. However, existing result [8] shows that obtaining exponential improvement in the sample\ncomplexity of active learning with unbounded Tsybakov noise is hard.\n\nInspired by [23] which proved that multi-view setting [6] can help improve the sample complexity\nof active learning in the realizable case remarkably, we have an insight that multi-view setting will\nalso help active learning in the non-realizable case. In this paper, we present the \ufb01rst analysis on the\n\n1\n\n\fsample complexity of active learning in the non-realizable case under multi-view setting, where the\nnon-realizability is caused by Tsybakov noise. Speci\ufb01cally:\n\n-We de\ufb01ne \u03b1-expansion, which extends the de\ufb01nition in [3] and [23] to the non-realizable case,\n\nand \u03b2-condition for multi-view setting.\n\nsetting can be improved to eO(log 1\n\n-We prove that the sample complexity of active learning with Tsybakov noise under multi-view\n\u01eb ) when the learner satis\ufb01es non-degradation condition.1 This\nexponential improvement holds no matter whether Tsybakov noise is bounded or not, contrasting to\nsingle-view setting where the polynomial improvement is the best possible achievement for active\nlearning with unbounded Tsybakov noise.\n\ntive learning with unbounded Tsybakov noise under multi-view setting is eO( 1\n1/\u01eb is independent of the parameter in Tsybakov noise, i.e., the sample complexity is always eO( 1\n\n-We also prove that, when non-degradation condition does not hold, the sample complexity of ac-\n\u01eb ), where the order of\n\u01eb )\nno matter how large the unbounded Tsybakov noise is. While in previous polynomial bounds, the\norder of 1/\u01eb is related to the parameter in Tsybakov noise and is larger than 1 when unbounded Tsy-\nbakov noise is larger than some degree (see Section 2). This discloses that, when non-degradation\ncondition does not hold, multi-view setting is still able to lead to a faster convergence rate and our\npolynomial improvement in the sample complexity is better than previous polynomial bounds when\nunbounded Tsybakov noise is large.\n\nThe rest of this paper is organized as follows. After introducing related work in Section 2 and\npreliminaries in Section 3, we de\ufb01ne \u03b1-expansion in the non-realizable case in Section 4. We analyze\nthe sample complexity of active learning with Tsybakov noise under multi-view setting with and\nwithout the non-degradation condition in Section 5 and Section 6, respectively. Finally we conclude\nthe paper in Section 7.\n\n2 Related Work\n\nGenerally, the non-realizability of learning task is caused by the presence of noise. For learning the\ntask with arbitrary forms of noise, Balcan et al. [2] proposed the agnostic active learning algorithm\n\u01eb2 ).2 Hoping to get tighter bound on the sample\ncomplexity of the algorithm A2, Hanneke [17] de\ufb01ned the disagreement coef\ufb01cient \u03b8, which depends\non the hypothesis class and the data distribution, and proved that the sample complexity of the\n\u01eb2 ). Later, Dasgupta et al. [13] developed a general agnostic active learning\n\nA2 and proved that its sample complexity is bO( \u03b72\nalgorithm A2 is bO(\u03b82 \u03b72\nalgorithm which extends the scheme in [10] and proved that its sample complexity is bO(\u03b8 \u03b72\n\nRecently, the popular Tsybakov noise model [21] was considered in theoretical analysis on ac-\ntive learning and there have been some bounds on the sample complexity. For some simple cases,\nwhere Tsybakov noise is bounded, it has been proved that the exponential improvement in the sam-\nple complexity is possible [4, 7, 18]. As for the situation where Tsybakov noise is unbounded,\nonly polynomial improvement in the sample complexity has been obtained. Balcan et al. [4] as-\nsumed that the samples are drawn uniformly from the the unit ball in Rd and proved that the sample\n\n\u01eb2 ).\n\n\u00b5\u03c9\n\nTsybakov noise). This uniform distribution assumption, however, rarely holds in practice. Castro\nand Nowak [8] showed that the sample complexity of active learning with unbounded Tsybakov\n\non the H\u00a8older smoothness and d is the dimension of the data). This result is also based on the\nstrong uniform distribution assumption. Cavallanti et al. [9] assumed that the labels of examples\nare generated according to a simple linear noise model and indicated that the sample complexity\n\n1+\u03bb(cid:1) (\u03bb > 0 depends on\n(cid:1) (\u00b5 > 1 depends on another form of Tsybakov noise, \u03c9 \u2265 1 depends\n\ncomplexity of active learning with unbounded Tsybakov noise is O(cid:0)\u01eb\u2212 2\nnoise is bO(cid:0)\u01eb\u2212 2\u00b5\u03c9+d\u22122\u03c9\u22121\nof active learning with unbounded Tsybakov noise is O(cid:0)\u01eb\u2212\n(1+\u03bb)(2+\u03bb)(cid:1). Hanneke [18] proved that\n1+\u03bb(cid:1) for active learning with unbounded Tsybakov noise. For active learning with unbounded\nbO(cid:0)\u01eb\u2212 2\n1The eO notation is used to hide the factor log log( 1\n2The bO notation is used to hide the factor polylog( 1\n\nTsybakov noise, Castro and Nowak [8] also proved that at least \u2126(\u01eb\u2212\u03c1) labels are requested to learn\n\nthe algorithms or variants thereof in [2] and [13] can achieve the polynomial sample complexity\n\n2(3+\u03bb)\n\n\u01eb ).\n\u01eb ).\n\n2\n\n\fan \u01eb-approximation of the optimal classi\ufb01er (\u03c1 \u2208 (0, 2) depends on Tsybakov noise). This result\nshows that the polynomial improvement is the best possible achievement for active learning with un-\nbounded Tsybakov noise in single-view setting. Wang [22] introduced smooth assumption to active\nlearning with approximate Tsybakov noise and proved that if the classi\ufb01cation boundary and the\nunderlying distribution are smooth to \u03be-th order and \u03be > d, the sample complexity of active learning\n\nis bO(cid:0)\u01eb\u2212 2d\nactive learning is O(cid:0)polylog( 1\n\n\u03be+d(cid:1); if the boundary and the distribution are in\ufb01nitely smooth, the sample complexity of\n\u01eb )(cid:1). Nevertheless, this result is for approximate Tsybakov noise and\n\nthe assumption on large smoothness order (or in\ufb01nite smoothness order) rarely holds for data with\nhigh dimension d in practice.\n\n3 Preliminaries\n\nIn multi-view setting, the instances are described with several different disjoint sets of features. For\nthe sake of simplicity, we only consider two-view setting in this paper. Suppose that X = X1 \u00d7 X2\nis the instance space, X1 and X2 are the two views, Y = {0, 1} is the label space and D is the\ndistribution over X\u00d7Y . Suppose that c = (c1, c2) is the optimal Bayes classi\ufb01er, where c1 and c2 are\nthe optimal Bayes classi\ufb01ers in the two views, respectively. Let H1 and H2 be the hypothesis class\nin each view and suppose that c1 \u2208 H1 and c2 \u2208 H2. For any instance x = (x1, x2), the hypothesis\nhv \u2208 Hv (v = 1, 2) makes that hv(xv) = 1 if xv \u2208 Sv and hv(xv) = 0 otherwise, where Sv is a\nsubset of Xv. In this way, any hypothesis hv \u2208 Hv corresponds to a subset Sv of Xv (as for how to\ncombine the hypotheses in the two views, see Section 5). Considering that x1 and x2 denote the same\ninstance x in different views, we overload Sv to denote the instance set {x = (x1, x2) : xv \u2208 Sv}\nwithout confusion. Let S \u2217\nv correspond to the optimal Bayes classi\ufb01er cv. It is well-known [15] that\nv = {xv : \u03d5v(xv) \u2265 1\n2 }, where \u03d5v(xv) = P (y = 1|xv). Here, we also overload S \u2217\nS \u2217\nv to denote\nthe instances set {x = (x1, x2) : xv \u2208 S \u2217\nv }. The error rate of a hypothesis Sv under the distribution\nv ) 6= 0 and the excess\nv ) is a\n\nD is R(hv) = R(Sv) = P r(x1,x2,y)\u2208D(cid:0)y 6= I(xv \u2208 Sv)(cid:1). In general, R(S \u2217\n\nv \u2212 Sv) and d(Sv, S \u2217\n\nv = (Sv \u2212 S \u2217\n\nerror of Sv can be denoted as follows, where Sv\u2206S \u2217\npseudo-distance between the sets Sv and S \u2217\nv .\n\nv ) \u222a (S \u2217\n\nR(Sv) \u2212 R(S \u2217\n\n|2\u03d5v(xv) \u2212 1|pxv dxv\n\n, d(Sv, S \u2217\nv )\n\n(1)\n\nv ) =ZSv\u2206S \u2217\n\nv\n\nLet \u03b7v denote the error rate of the optimal Bayes classi\ufb01er cv which is also called as the noise rate\nin the non-realizable case. In general, \u03b7v is less than 1\n2 . In order to model the noise, we assume that\nthe data distribution and the Bayes decision boundary in each view satis\ufb01es the popular Tsybakov\nnoise condition [21] that P rxv \u2208Xv (|\u03d5v(xv) \u2212 1/2| \u2264 t) \u2264 C0t\u03bb for some \ufb01nite C0 > 0, \u03bb > 0\nand all 0 < t \u2264 1/2, where \u03bb = \u221e corresponds to the best learning situation and the noise is called\nbounded [8]; while \u03bb = 0 corresponds to the worst situation. When \u03bb < \u221e, the noise is called\nunbounded [8]. According to Proposition 1 in [21], it is easy to know that (2) holds.\n\nd(Sv, S \u2217\n\nv ) \u2265 C1dk\n\n\u2206(Sv, S \u2217\nv )\n\n(2)\n\n\u03bb , C1 = 2C \u22121/\u03bb\n\nHere k = 1+\u03bb\nv ) + P r(S \u2217\nv \u2212 Sv) is also\na pseudo-distance between the sets Sv and S \u2217\nv ) \u2264 1. We will use the\nfollowing lamma [1] which gives the standard sample complexity for non-realizable learning task.\n\nv ) = P r(Sv \u2212 S \u2217\nv ) \u2264 d\u2206(Sv, S \u2217\n\n\u03bb(\u03bb + 1)\u22121\u22121/\u03bb, d\u2206(Sv, S \u2217\n\nv , and d(Sv, S \u2217\n\n0\n\nLemma 1 Suppose that H is a set of functions from X to Y = {0, 1} with \ufb01nite VC-dimension\nV \u2265 1 and D is the \ufb01xed but unknown distribution over X \u00d7 Y . For any \u01eb, \u03b4 > 0, there is a\npositive constant C, such that if the size of sample {(x1, y1), . . . , (xN , yN )} from D is N (\u01eb, \u03b4) =\nC\n\n\u03b4 )(cid:1), then with probability at least 1 \u2212 \u03b4, for all h \u2208 H, the following holds.\n\n\u01eb2(cid:0)V + log( 1\n\n|\n\n1\n\nN XN\n\nI(cid:0)h(xi) 6= yi(cid:1) \u2212 E(x,y)\u2208D I(cid:0)h(x) 6= y(cid:1)| \u2264 \u01eb\n\ni=1\n\n4 \u03b1-Expansion in the Non-realizable Case\n\nMulti-view active learning \ufb01rst described in [20] focuses on the contention points (i.e., unlabeled\ninstances on which different views predict different labels) and queries some labels of them. It is\nmotivated by that querying the labels of contention points may help at least one of the two views\nto learn the optimal classi\ufb01er. Let S1 \u2295 S2 = (S1 \u2212 S2) \u222a (S2 \u2212 S1) denote the contention points\n\n3\n\n\fTable 1: Multi-view active learning with the non-degradation condition\nInput: Unlabeled data set U = {x1, x2, \u00b7 \u00b7 \u00b7 , } where each example xj is given as a pair (xj\nProcess:\n\nQuery the labels of m0 instances drawn randomly from U to compose the labeled data set L\niterate: i = 0, 1, \u00b7 \u00b7 \u00b7 , s\nTrain the classi\ufb01er hi\n\nv (v = 1, 2) by minimizing the empirical risk with L in each view:\n\n1, xj\n2)\n\nhi\n\nv = arg minh\u2208HvP(x1,x2,y)\u2208L I(h(xv) 6= y);\n\n1 and hi\n\nApply hi\nQuery the labels of mi+1 instances drawn randomly from Qi, then add them into L and delete them\nfrom U .\n\n2 to the unlabeled data set U and \ufb01nd out the contention point set Qi;\n\nend iterate\nOutput: hs\n\n+ and hs\n\n\u2212\n\nbetween S1 and S2, then P r(S1 \u2295 S2) denotes the probability mass on the contentions points. \u201c\u2206\u201d\nand \u201c\u2295\u201d mean the same operation rule. In this paper, we use \u201c\u2206\u201d when referring the excess error\nbetween Sv and S \u2217\nv and use \u201c\u2295\u201d when referring the difference between the two views S1 and S2. In\norder to study multi-view active learning, the properties of contention points should be considered.\nOne basic property is that P r(S1 \u2295 S2) should not be too small, otherwise the two views could be\nexactly the same and two-view setting would degenerate into single-view setting.\n\n1 = S \u2217\n\n2 = S \u2217. As for the situation where S \u2217\n\nIn multi-view learning, the two views represent the same learning task and generally are consistent\nwith each other, i.e., for any instance x = (x1, x2) the labels of x in the two views are the same.\nHence we \ufb01rst assume that S \u2217\n2 , we will discuss on it\nfurther in Section 5.2. The instances agreed by the two views can be denoted as (S1 \u2229S2)\u222a(S1 \u2229S2).\nHowever, some of these agreed instances may be predicted different label by the optimal classi\ufb01er\nS \u2217, i.e., the instances in (S1 \u2229 S2 \u2212 S \u2217) \u222a (S1 \u2229 S2 \u2212 S \u2217). Intuitively, if the contention points\ncan convey some information about (S1 \u2229 S2 \u2212 S \u2217) \u222a (S1 \u2229 S2 \u2212 S \u2217), then querying the labels of\ncontention points could help to improve S1 and S2. Based on this intuition and that P r(S1 \u2295 S2)\nshould not be too small, we give our de\ufb01nition on \u03b1-expansion in the non-realizable case.\n\n1 6= S \u2217\n\nDe\ufb01nition 1 D is \u03b1-expanding if for some \u03b1 > 0 and any S1 \u2286 X1, S2 \u2286 X2, (3) holds.\n\nP r(cid:0)S1 \u2295 S2(cid:1) \u2265 \u03b1(cid:16)P r(cid:0)S1 \u2229 S2 \u2212 S \u2217(cid:1) + P r(cid:0)S1 \u2229 S2 \u2212 S \u2217(cid:1)(cid:17)\n\n(3)\n\nWe say that D is \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 if the above holds for all\nS1 \u2208 H1 \u2229 X1, S2 \u2208 H2 \u2229 X2 (here we denote by Hv \u2229 Xv the set {h \u2229 Xv : h \u2208 Hv} for v = 1, 2).\n\nBalcan et al. [3] also gave a de\ufb01nition of expansion, P r(T1 \u2295 T2) \u2265 \u03b1 min(cid:2)P r(T1 \u2229 T2), P r(T1 \u2229\nT2)(cid:3), for realizable learning task under the assumptions that the learner in each view is never \u201ccon\ufb01-\n\ndent but wrong\u201d and the learning algorithm is able to learn from positive data only. Here Tv denotes\nthe instances which are classi\ufb01ed as positive con\ufb01dently in each view. Generally, in realizable learn-\ning tasks, we aim at studying the asymptotic performance and assume that the performance of initial\nclassi\ufb01er is better than guessing randomly, i.e., P r(Tv) > 1/2. This ensures that P r(T1 \u2229 T2) is\nlarger than P r(T1 \u2229 T2). In addition, in [3] the instances which are agreed by the two views but are\npredicted different label by the optimal classi\ufb01er can be denoted as T1 \u2229 T2. So, it can be found that\nDe\ufb01nition 1 and the de\ufb01nition of expansion in [3] are based on the same intuition that the amount of\ncontention points is no less than a fraction of the amount of instances which are agreed by the two\nviews but are predicted different label by the optimal classi\ufb01ers.\n\n5 Multi-view Active Learning with Non-degradation Condition\n\nIn this section, we \ufb01rst consider the multi-view learning in Table 1 and analyze whether multi-\nview setting can help improve the sample complexity of active learning in the non-realizable case\nremarkably. In multi-view setting, the classi\ufb01ers are often combined to make predictions and many\nstrategies can be used to combine them. In this paper, we consider the following two combination\nschemes, h+ and h\u2212, for binary classi\ufb01cation:\n\n+(x) =(cid:26) 1 if hi\n\nhi\n\n0 otherwise\n\n1(x1) = hi\n\n2(x2) = 1\n\n\u2212(x) =(cid:26) 0 if hi\n\nhi\n\n1 otherwise\n\n1(x1) = hi\n\n2(x2) = 0\n\n(4)\n\n4\n\n\f5.1 The Situation Where S \u2217\n\n1 = S \u2217\n2\n\nWith (4), the error rate of the combined classi\ufb01ers hi\n\n+ and hi\n\n\u2212 satisfy (5) and (6), respectively.\n\nR(hi\nR(hi\n\n+) \u2212 R(S \u2217) = R(Si\n\u2212) \u2212 R(S \u2217) = R(Si\n\n1 \u2229 Si\n1 \u222a Si\n\n2) \u2212 R(S \u2217) \u2264 d\u2206(Si\n2) \u2212 R(S \u2217) \u2264 d\u2206(Si\n\n1 \u2229 Si\n1 \u222a Si\n\n2, S \u2217)\n2, S \u2217)\n\n(5)\n\n(6)\n\nv \u2282 Xv (v = 1, 2) corresponds to the classi\ufb01er hi\n\nHere Si\nv \u2208 Hv in the i-th round. In each round of\nmulti-view active learning, labels of some contention points are queried to augment the training data\nset L and the classi\ufb01er in each view is then re\ufb01ned. As discussed in [23], we also assume that the\nlearner in Table 1 satis\ufb01es the non-degradation condition as the amount of labeled training examples\nincreases, i.e., (7) holds, which implies that the excess error of Si+1\nv in\nthe region of Si\n\nis no larger than that of Si\n\nv\n\n1 \u2295 Si\n2.\n\nP r(cid:0)Si+1\n\nv \u2206S \u2217(cid:12)(cid:12)Si\n\n1 \u2295 Si\n\n2(cid:1) \u2264 P r(Si\n\nv\u2206S \u2217(cid:12)(cid:12)Si\n\n1 \u2295 Si\n2)\n\n(7)\n\n1 , . . . , \u03c0v\n\nTo illustrate the non-degradation condition, we give the following example: Suppose the data in\nXv (v = 1, 2) fall into n different clusters, denoted by \u03c0v\nn, and every cluster has the same\nprobability mass for simplicity. The positive class is the union of some clusters while the negative\nclass is the union of the others. Each positive (negative) cluster \u03c0v\n\u03be in Xv is associated with only\n(\u03be, \u03c2 \u2208 {1, . . . , n}) in X3\u2212v (i.e., given an instance xv in \u03c0v\n3 positive (negative) clusters \u03c03\u2212v\n\u03be ,\nx3\u2212v will only be in one of these \u03c03\u2212v\n). Suppose the learning algorithm will predict all instances\nin each cluster with the same label, i.e., the hypothesis class Hv consists of the hypotheses which\ndo not split any cluster. Thus, the cluster \u03c0v\n\u03be can be classi\ufb01ed according to the posterior probability\nP (y = 1|\u03c0v\n\u03be will not in\ufb02uence the estimation of\nthe posterior probability for cluster \u03c0v\n\u03c2 (\u03c2 6= \u03be). It is evident that the non-degradation condition holds\nin this task. Note that the non-degradation assumption may not always hold, and we will discuss on\nthis in Section 6. Now we give Theorem 1.\n\n\u03be ) and querying the labels of instances in cluster \u03c0v\n\n\u03c2\n\n\u03c2\n\nC2\n\n1 (cid:0)V + log( 16(s+1)\n\nTheorem 1 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 ac-\ncording to De\ufb01nition 1, when the non-degradation condition holds, if s = \u2308 2 log 1\n\u2309 and mi =\nlog 1\nC2\n256kC\n\n+\n\u2212, at least one of which is with error rate no larger than R(S \u2217) + \u01eb with probability at least\n\n)(cid:1), the multi-view active learning in Table 1 will generate two classi\ufb01ers hs\n\nand hs\n1 \u2212 \u03b4.\nHere, V = max[V C(H1), V C(H2)] where V C(H) denotes the VC-dimension of the hypothesis\nclass H, k = 1+\u03bb\n\n8\u01eb\n\n\u03b4\n\n0\n\n\u03bb(\u03bb + 1)\u22121\u22121/\u03bb and C2 = 5\u03b1+8\n6\u03b1+8 .\n\n\u03bb , C1 = 2C \u22121/\u03bb\n1 \u2295 Si\nv \u2229 Qi and \u03c4i+1 = P r(T i+1\nv = Si+1\nP r(T i+1\n2 \u2212 S \u2217) + P r(Si\n\nProof sketch. Let Qi = Si\nQi) \u2264 1/8. Let T i+1\nd\u2206(Si\n\n1 \u2229 Si\n\n\u2295T i+1\n\n1\n\n1\n\n2, \ufb01rst with Lemma 1 and (2) we have d\u2206(Si+1\n\n\u2212S \u2217)\n\n1 \u2229 Si+1\n\n| Qi, S \u2217 |\n2 . Considering (7) and\n2\n\u2295T i+1\n)\n1 \u2229 Si\n2 \u2212 S \u2217), then we calculate that\n\n\u2212 1\n\n2\n\n2\n\nAs in each round some contention points are queried and added into the training set, the difference\nbetween the two views is decreasing, i.e., P r(Si+1\n2). Let\n\u03b3i = P r(Si\n2 , with De\ufb01nition 1 and different combinations of \u03c4i+1 and \u03b3i, we can\n,S \u2217)\n\n) is no larger than P r(Si\n\n1 \u2295 Si+1\n\n1 \u2295 Si\n\nP r(Si\n1\n\n1\u2295Si\n\n,S \u2217)\n\n2\n\nhave either\nC2 = 5\u03b1+8\nThus, with (5) and (6) we have either R(hs\n\n2,S \u2217) \u2264 5\u03b1+8\n\n6\u03b1+8 or\n\n6\u03b1+8 is a constant less than 1, we have either d\u2206(Ss\n\nd\u2206(Si+1\nd\u2206(Si\n1\n\n1\n\n\u222aSi+1\n2\n\u222aSi\n\n2,S \u2217) \u2264 5\u03b1+8\n\n+) \u2264 R(S \u2217) + \u01eb or R(hs\n\n6\u03b1+8 . When s = \u2308 2 log 1\nlog 1\nC2\n2, S \u2217) \u2264 \u01eb or d\u2206(Ss\n1 \u222a Ss\n2, S \u2217) \u2264 \u01eb.\n(cid:3)\n\u2212) \u2264 R(S \u2217) + \u01eb.\n\n\u2309, where\n\n8\u01eb\n\n2\u2212S \u2217)\n2) \u2212 1\n\u2295Si\n\u2229Si+1\n2\n\u2229Si\n\nd\u2206(Si+1\nd\u2206(Si\n1\n\n1\n\n1 \u2229 Ss\n\n\u2264 P r(Si\n\n1 \u2229 Si\nd\u2206(Si+1\n\n2\n\n, S \u2217)\n\n2|Qi, S \u2217|Qi)P r(Qi) = P r(Si\n1 \u2229 Si+1\n1 \u2229 Si\n1 \u222a Si+1\n1 \u2229 Si\n\n2 \u2212 S \u2217) + P r(Si\n\n2 \u2212 S \u2217) + P r(Si\n\n1 \u2229 Si\n\n1 \u2229 Si\n\n, S \u2217)\n\n2\n\nd\u2206(Si+1\n\n\u2264 P r(Si\n\n2 \u2212 S \u2217) +\n\nP r(Si\n\n1 \u2295 Si\n\n1 \u2295 Si+1\n\n2\n\n2) \u2212 \u03c4i+1P r(cid:0)(Si+1\n2) + \u03c4i+1P r(cid:0)(Si+1\n\n) \u2229 Qi(cid:1)\n) \u2229 Qi(cid:1).\n\n2 \u2212 S \u2217) +\n\nP r(Si\n\n1 \u2295 Si\n\n1 \u2295 Si+1\n\n2\n\n1\n8\n\n1\n8\n\n5\n\n\fFrom Theorem 1 we know that we only need to request Ps\n\n\u01eb ) labels to learn hs\n+\nand hs\n\u2212, at least one of which is with error rate no larger than R(S \u2217) + \u01eb with probability at least\n1 \u2212 \u03b4. If we choose hs\n+) \u2264 R(S \u2217) + \u01eb, we can get a classi\ufb01er whose\nerror rate is no larger than R(S \u2217) + \u01eb. Fortunately, there are only two classi\ufb01ers and the probability\nof getting the right classi\ufb01er is no less than 1\n\u2212, we give\nDe\ufb01nition 2 at \ufb01rst.\n\ni=0 mi = eO(log 1\n\n2 . To study how to choose between hs\n\n+ and it happens to satisfy R(hs\n\n+ and hs\n\nDe\ufb01nition 2 The multi-view classi\ufb01ers S1 and S2 satisfy \u03b2-condition if (8) holds for some \u03b2 > 0.\n\nP r(cid:0){x : x \u2208 S1 \u2295 S2 \u2227 y(x) = 1}(cid:1)\n\nP r(S1 \u2295 S2)\n\nP r(cid:0){x : x \u2208 S1 \u2295 S2 \u2227 y(x) = 0}(cid:1)\n\nP r(S1 \u2295 S2)\n\n\u2212\n\n(cid:12)(cid:12)(cid:12) \u2265 \u03b2\n\n(8)\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)\n\n(8) implies the difference between the examples belonging to positive class and that belonging to\nnegative class in the contention region of S1 \u2295 S2. Based on De\ufb01nition 2, we give Lemma 2 which\nprovides information for deciding how to choose between h+ and h\u2212. This helps to get Theorem 2.\n2 log( 4\n\u03b4 )\n\nLemma 2 If the multi-view classi\ufb01ers Ss\n\n2 satisfy \u03b2-condition, with the number of\n\n1 and Ss\n\nlabels we can decide correctly whether P r(cid:0){x : x \u2208 Ss\n\n2 \u2227 y(x) = 0}(cid:1)) is smaller with probability at least 1 \u2212 \u03b4.\n\nSs\n1 \u2295 Ss\n\n1 \u2295 Ss\n\n2 \u2227 y(x) = 1}(cid:1) or P r(cid:0){x : x \u2208\n\n\u03b22\n\nclassi\ufb01er whose error rate is no larger than R(S \u2217) + \u01eb with probability at least 1 \u2212 \u03b4.\n\nTheorem 2 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 accord-\ning to De\ufb01nition 1, when the non-degradation condition holds, if the multi-view classi\ufb01ers satisfy\n\u01eb ) labels the multi-view active learning in Table 1 will generate a\n\n\u03b2-condition, by requesting eO(log 1\nFrom Theorem 2 we know that we only need to request eO(log 1\n\n\u01eb ) labels to learn a classi\ufb01er with\nerror rate no larger than R(S \u2217) + \u01eb with probability at least 1 \u2212 \u03b4. Thus, we achieve an exponential\nimprovement in sample complexity of active learning in the non-realizable case under multi-view\nsetting. Sometimes, the difference between the examples belonging to positive class and that be-\nlonging to negative class in Ss\n\n2 may be very small, i.e., (9) holds.\n\n1 \u2295 Ss\n\nP r(cid:0){x : x \u2208 Ss\n\n1 \u2295 Ss\n\nP r(Ss\n\n1 \u2295 Ss\n2)\n\n2 \u2227 y(x) = 1}(cid:1)\n\nP r(cid:0){x : x \u2208 Ss\n\n\u2212\n\n1 \u2295 Ss\n\n2 \u2227 y(x) = 0}(cid:1)\n\nP r(Ss\n\n1 \u2295 Ss\n2)\n\nIf so, we need not to estimate whether R(hs\nboth hs\n\n+ and hs\n\n\u2212 are good approximations of the optimal classi\ufb01er.\n\n+) or R(hs\n\n\u2212) is smaller and Theorem 3 indicates that\n\n(cid:12)(cid:12)(cid:12) = O(\u01eb)\n\n(9)\n\nTheorem 3 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 ac-\ncording to De\ufb01nition 1, when the non-degradation condition holds, if (9) is satis\ufb01ed, by request-\n+ and\n+) \u2264 R(S \u2217) + \u01eb and\n\n\u01eb ) labels the multi-view active learning in Table 1 will generate two classi\ufb01ers hs\n\n\u2212 which satisfy either (a) or (b) with probability at least 1 \u2212 \u03b4. (a) R(hs\nhs\nR(hs\n\ning eO(log 1\n\n\u2212) \u2264 R(S \u2217) + O(\u01eb); (b) R(hs\n\n+) \u2264 R(S \u2217) + O(\u01eb) and R(hs\n\n\u2212) \u2264 R(S \u2217) + \u01eb.\n\nThe complete proof of Theorem 1, and the proofs of Lemma 2, Theorem 2 and Theorem 3 are given\nin the supplementary \ufb01le.\n\n5.2 The Situation Where S \u2217\n\n1 6= S \u2217\n2\n\nAlthough the two views represent the same learning task and generally are consistent with each\nother, sometimes S \u2217\n2 . Therefore, the \u03b1-expansion assumption in De\ufb01nition\n1 should be adjusted to the situation where S \u2217\n2 . To analyze this theoretically, we replace S \u2217\nby S \u2217\n\n2 in De\ufb01nition 1 and get (10). Similarly to Theorem 1, we get Theorem 4.\n\n1 may be not equal to S \u2217\n\n1 6= S \u2217\n\n1 \u2229 S \u2217\n\nP r(cid:0)S1 \u2295 S2(cid:1) \u2265 \u03b1(cid:16)P r(cid:0)S1 \u2229 S2 \u2212 S \u2217\n\n1 \u2229 S \u2217\n\n2(cid:1) + P r(cid:0)S1 \u2229 S2 \u2212 S \u2217\n\n1 \u2229 S \u2217\n\n2(cid:1)(cid:17)\n\n(10)\n\nTheorem 4 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 accord-\ning to (10), when the non-degradation condition holds, if s = \u2308 2 log 1\nlog 1\nC2\nlog( 16(s+1)\nleast one of which is with error rate no larger than R(S \u2217\n(V , k, C1 and C2 are given in Theorem 1.)\n\n)(cid:1), the multi-view active learning in Table 1 will generate two classi\ufb01ers hs\n\n\u2212, at\n2 ) + \u01eb with probability at least 1 \u2212 \u03b4.\n\n\u2309 and mi = 256kC\nC2\n\n1 (cid:0)V +\n\n+ and hs\n\n1 \u2229 S \u2217\n\n8\u01eb\n\n\u03b4\n\n6\n\n\fInput: Unlabeled data set U = {x1, x2, \u00b7 \u00b7 \u00b7 , } where each example xj is given as a pair (xj\nProcess:\n\nTable 2: Multi-view active learning without the non-degradation condition\n1, xj\n2)\nQuery the labels of m0 instances drawn randomly from U to compose the labeled data set L;\nTrain the classi\ufb01er h0\n\nv (v = 1, 2) by minimizing the empirical risk with L in each view:\n\nh0\n\nv = arg minh\u2208HvP(x1,x2,y)\u2208L I(h(xv) 6= y);\n\niterate: i = 1, \u00b7 \u00b7 \u00b7 , s\n\n2\n\n1\n\nand hi\u22121\n\nto the unlabeled data set U and \ufb01nd out the contention point set Qi;\n\nApply hi\u22121\nQuery the labels of mi instances drawn randomly from Qi, then add them into L and delete them\nfrom U ;\nQuery the labels of (2i \u2212 1)mi instances drawn randomly from U \u2212 Qi, then add them into L and\ndelete them from U ;\nTrain the classi\ufb01er hi\n\nv by minimizing the empirical risk with L in each view:\n\nhi\n\nv = arg minh\u2208HvP(x1,x2,y)\u2208L I(h(xv) 6= y).\n\nend iterate\nOutput: hs\n\n+ and hs\n\n\u2212\n\nv is the optimal Bayes classi\ufb01er in the v-th view, obviously, R(S \u2217\nv ), (v = 1, 2). So, learning a classi\ufb01er with error rate no larger than R(S \u2217\n\nProof. Since S \u2217\nthan R(S \u2217\nharder than learning a classi\ufb01er with error rate no larger than R(S \u2217\na classi\ufb01er with error rate no larger than R(S \u2217\nR(Si\n1 \u2229 S \u2217\nno larger than R(S \u2217\nerror rate is less than R(S \u2217\nthe discussion of Section 5.1, with the proof of Theorem 1 we get Theorem 4 proved.\n\n2 ) is no less\n2 ) + \u01eb is not\nv ) + \u01eb. Now we aim at learning\n2 ) + \u01eb. Without loss of generality, we assume\n2 ), we get a classi\ufb01er with error rate\n2 ) + \u01eb. Thus, we can neglect the probability mass on the hypothesis whose\n2 in\n(cid:3)\n\n2 ) for i = 0, 1, . . . , s. If R(Si\n1 \u2229 S \u2217\n\n2 as the optimal. Replacing S \u2217 by S \u2217\n\n2 ) and regard S \u2217\n\n1 \u2229 S \u2217\n1 \u2229 S \u2217\n\nv) \u2264 R(S \u2217\n\nv) > R(S \u2217\n\n1 \u2229 S \u2217\n\n1 \u2229 S \u2217\n\n1 \u2229 S \u2217\n\n1 \u2229 S \u2217\n\n1 \u2229 S \u2217\n\nTheorem 4 shows that for the situation where S \u2217\ntwo classi\ufb01ers hs\nwith probability at least 1 \u2212 \u03b4. With Lemma 2, we get Theorem 5 from Theorem 4.\n\n2 , by requesting eO(log 1\n\n\u2212, at least one of which is with error rate no larger than R(S \u2217\n\n+ and hs\n\n\u01eb ) labels we can learn\n2 ) + \u01eb\n\n1 6= S \u2217\n\n1 \u2229 S \u2217\n\nTheorem 5 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 ac-\ncording to (10), when the non-degradation condition holds, if the multi-view classi\ufb01ers satisfy \u03b2-\n\u01eb ) labels the multi-view active learning in Table 1 will generate a\n\ncondition, by requesting eO(log 1\n\nclassi\ufb01er whose error rate is no larger than R(S \u2217\n\n1 \u2229 S \u2217\n\n2 ) + \u01eb with probability at least 1 \u2212 \u03b4.\n\nGenerally, R(S \u2217\n2 , i.e., P r(S \u2217\nS \u2217\nin the sample complexity of active learning with Tsybakov noise is still possible.\n\n1 is not too much different from\n2 ) \u2264 \u01eb/2, we have Corollary 1 which indicates that the exponential improvement\n\n2 ) is larger than R(S \u2217\n\n2 ). When S \u2217\n\n1 ) and R(S \u2217\n\n1 \u2229 S \u2217\n\n1 \u2295S \u2217\n\nCorollary 1 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 ac-\ncording to (10), when the non-degradation condition holds, if the multi-view classi\ufb01ers satisfy \u03b2-\ncondition and P r(S \u2217\n\u01eb ) labels the multi-view active learning in\nTable 1 will generate a classi\ufb01er with error rate no larger than R(S \u2217\nv )+\u01eb (v = 1, 2) with probability\nat least 1 \u2212 \u03b4.\n\n2 ) \u2264 \u01eb/2, by requesting eO(log 1\n\n1 \u2295 S \u2217\n\nThe proofs of Theorem 5 and Corollary 1 are given in the supplemental \ufb01le.\n\n6 Multi-view Active Learning without Non-degradation Condition\n\nSection 5 considers situations when the non-degradation condition holds, there are cases, however,\nthe non-degradation condition (7) does not hold. In this section we focus on the multi-view active\nlearning in Table 2 and give an analysis with the non-degradation condition waived. Firstly, we give\nTheorem 6 for the sample complexity of multi-view active learning in Table 2 when S \u2217\n2 = S \u2217.\n\n1 = S \u2217\n\nTheorem 6 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 accord-\ning to De\ufb01nition 1, if s = \u2308 2 log 1\nlog 1\nC2\nlearning in Table 2 will generate two classi\ufb01ers hs\n\u2212, at least one of which is with error rate\nno larger than R(S \u2217) + \u01eb with probability at least 1 \u2212 \u03b4. (V , k, C1 and C2 are given in Theorem 1.)\n\n)(cid:1), the multi-view active\n\n1 (cid:0)V + log( 16(s+1)\n\n\u2309 and mi = 256kC\nC2\n\n+ and hs\n\n8\u01eb\n\n\u03b4\n\n7\n\n\fIn the (i + 1)-th round, we randomly query (2i+1 \u2212 1)mi labels from Qi and add\nProof sketch.\nthem into L. So the number of training examples for Si+1\n(v = 1, 2) is larger than the number of\nwhole training examples for Si\nv. Thus we know that d(Si+1\n|Qi, S \u2217|Qi) \u2264 d(Si\nv|Qi, S \u2217|Qi) holds\nfor any \u03d5v. Setting \u03d5v \u2208 {0, 1}, the non-degradation condition (7) stands. Thus, with the proof of\n(cid:3)\nTheorem 1 we get Theorem 6 proved.\n\nv\n\nv\n\nTheorem 6 shows that we can requestPs\n\n+ and hs\n\u2212,\nat least one of which is with error rate no larger than R(S \u2217) + \u01eb with probability at least 1 \u2212 \u03b4. To\nguarantee the non-degradation condition (7), we only need to query (2i \u2212 1)mi more labels in the\ni-th round. With Lemma 2, we get Theorem 7.\n\ni=0 2imi = eO( 1\n\n\u01eb ) labels to learn two classi\ufb01ers hs\n\nTheorem 7 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 accord-\n\u01eb ) labels the\nmulti-view active learning in Table 2 will generate a classi\ufb01er whose error rate is no larger than\nR(S \u2217) + \u01eb with probability at least 1 \u2212 \u03b4.\n\ning to De\ufb01nition 1, if the multi-view classi\ufb01ers satisfy \u03b2-condition, by requesting eO( 1\nTheorem 7 shows that, without the non-degradation condition, we need to request eO( 1\n\n\u01eb ) labels to\nlearn a classi\ufb01er with error rate no larger than R(S \u2217) + \u01eb with probability at least 1 \u2212 \u03b4. The order of\n1/\u01eb is independent of the parameter in Tsybakov noise. Similarly to Theorem 3, we get Theorem 8\nwhich indicates that both hs\n\n\u2212 are good approximations of the optimal classi\ufb01er.\n\n+ and hs\n\nTheorem 8 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 accord-\n\u01eb ) labels the multi-view active learning in Table\n\u2212 which satisfy either (a) or (b) with probability at least\n\u2212) \u2264 R(S \u2217) + O(\u01eb); (b) R(hs\n+) \u2264 R(S \u2217) + O(\u01eb) and\n\ning to De\ufb01nition 1, if (9) holds, by requesting eO( 1\n\n+ and hs\n+) \u2264 R(S \u2217) + \u01eb and R(hs\n\n2 will generate two classi\ufb01ers hs\n1 \u2212 \u03b4. (a) R(hs\nR(hs\n\n\u2212) \u2264 R(S \u2217) + \u01eb.\n\nAs for the situation where S \u2217\nand Corollary 2.\n\n1 6= S \u2217\n\n2 , similarly to Theorem 5 and Corollary 1, we have Theorem 9\n\nTheorem 9 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 accord-\n\u01eb ) labels the multi-view\n2 )+\u01eb\n\ning to (10), if the multi-view classi\ufb01ers satisfy \u03b2-condition, by requesting eO( 1\n\nactive learning in Table 2 will generate a classi\ufb01er whose error rate is no larger than R(S \u2217\nwith probability at least 1 \u2212 \u03b4.\n\n1 \u2229S \u2217\n\nCorollary 2 For data distribution D \u03b1-expanding with respect to hypothesis class H1 \u00d7 H2 accord-\ning to (10), if the multi-view classi\ufb01ers satisfy \u03b2-condition and P r(S \u2217\n2 ) \u2264 \u01eb/2, by requesting\n\u01eb ) labels the multi-view active learning in Table 2 will generate a classi\ufb01er with error rate no\n\n1 \u2295 S \u2217\n\neO( 1\n\nlarger than R(S \u2217\n\nv ) + \u01eb (v = 1, 2) with probability at least 1 \u2212 \u03b4.\n\nThe complete proof of Theorem 6, the proofs of Theorem 7 to 9 and Corollary 2 are given in the\nsupplementary \ufb01le.\n\n7 Conclusion\n\nbakov noise can be improved to eO(log 1\nprove that the sample complexity of active learning with unbounded Tsybakov noise is eO( 1\n\nWe present the \ufb01rst study on active learning in the non-realizable case under multi-view setting in\nthis paper. We prove that the sample complexity of multi-view active learning with unbounded Tsy-\n\u01eb ), contrasting to single-view setting where only polynomial\nimprovement is proved possible with the same noise condition. In general multi-view setting, we\n\u01eb ), where\nthe order of 1/\u01eb is independent of the parameter in Tsybakov noise, contrasting to previous polyno-\nmial bounds where the order of 1/\u01eb is related to the parameter in Tsybakov noise. Generally, the\nnon-realizability of learning task can be caused by many kinds of noise, e.g., misclassi\ufb01cation noise\nand malicious noise. It would be interesting to extend our work to more general noise model.\n\nAcknowledgments\n\nThis work was supported by the NSFC (60635030, 60721002), 973 Program (2010CB327903) and\nJiangsuSF (BK2008018).\n\n8\n\n\fReferences\n\n[1] M. Anthony and P. L. Bartlett, editors. Neural Network Learning: Theoretical Foundations.\n\nCambridge University Press, Cambridge, UK, 1999.\n\n[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, pages\n\n65\u201372, 2006.\n\n[3] M.-F. Balcan, A. Blum, and K. Yang. Co-training and expansion: Towards bridging theory and\n\npractice. In NIPS 17, pages 89\u201396. 2005.\n\n[4] M.-F. Balcan, A. Z. Broder, and T. Zhang. Margin based active learning.\n\nIn COLT, pages\n\n35\u201350, 2007.\n\n[5] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In\n\nCOLT, pages 45\u201356, 2008.\n\n[6] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT,\n\npages 92\u2013100, 1998.\n\n[7] R. M. Castro and R. D. Nowak. Upper and lower error bounds for active learning. In Allerton\n\nConference, pages 225\u2013234, 2006.\n\n[8] R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on\n\nInformation Theory, 54(5):2339\u20132353, 2008.\n\n[9] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear classi\ufb01cation and selective sampling\n\nunder low noise conditions. In NIPS 21, pages 249\u2013256. 2009.\n\n[10] D. A. Cohn, L. E. Atlas, and R. E. Ladner.\n\nImproving generalization with active learning.\n\nMachine Learning, 15(2):201\u2013221, 1994.\n\n[11] S. Dasgupta. Analysis of a greedy active learning strategy. In NIPS 17, pages 337\u2013344. 2005.\n\n[12] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS 18, pages 235\u2013\n\n242. 2006.\n\n[13] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In\n\nNIPS 20, pages 353\u2013360. 2008.\n\n[14] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In\n\nCOLT, pages 249\u2013263, 2005.\n\n[15] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi, editors. A Probabilistic Theory of Pattern Recognition.\n\nSpringer, New York, 1996.\n\n[16] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by\n\ncommittee algorithm. Machine Learning, 28(2-3):133\u2013168, 1997.\n\n[17] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages\n\n353\u2013360, 2007.\n\n[18] S. Hanneke. Adaptive rates of convergence in active learning. In COLT, 2009.\n\n[19] M. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In ACL, pages 63\u201377, 2006.\n\n[20] I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-\n\nview learning. In ICML, pages 435\u2013442, 2002.\n\n[21] A. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statistics,\n\n32(1):135\u2013166, 2004.\n\n[22] L. Wang. Suf\ufb01cient conditions for agnostic active learnable. In NIPS 22, pages 1999\u20132007.\n\n2009.\n\n[23] W. Wang and Z.-H. Zhou. On multi-view active learning and the combination with semi-\n\nsupervised learning. In ICML, pages 1152\u20131159, 2008.\n\n9\n\n\f", "award": [], "sourceid": 787, "authors": [{"given_name": "Wei", "family_name": "Wang", "institution": null}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": ""}]}