{"title": "A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 65, "page_last": 72, "abstract": "In this paper we present two transductive bounds on the risk of the majority vote estimated over partially labeled training sets. Our first bound is tight when the additional unlabeled training data are used in the cases where the voted classifier makes its errors on low margin observations and where the errors of the associated Gibbs classifier can accurately be estimated. In semi-supervised learning, considering the margin as an indicator of confidence constitutes the working hypothesis of algorithms which search the decision boundary on low density regions. In this case, we propose a second bound on the joint probability that the voted classifier makes an error over an example having its margin over a fixed threshold. As an application we are interested on self-learning algorithms which assign iteratively pseudo-labels to unlabeled training examples having margin above a threshold obtained from this bound. Empirical results on different datasets show the effectiveness of our approach compared to the same algorithm and the TSVM in which the threshold is fixed manually.", "full_text": "A Transductive Bound for the Voted Classi\ufb01er with an\n\nApplication to Semi-supervised Learning\n\nMassih R. Amini\n\nLaboratoire d\u2019Informatique de Paris 6\n\nUniversit\u00b4e Pierre et Marie Curie, Paris, France\n\nmassih-reza.amini@lip6.fr\n\nFranc\u00b8ois Laviolette\n\nUniversit\u00b4e Laval\n\nQu\u00b4ebec (QC), Canada\n\nfrancois.laviolette@ift.ulaval.ca\n\nNicolas Usunier\n\nLaboratoire d\u2019Informatique de Paris 6\n\nUniversit\u00b4e Pierre et Marie Curie, Paris, France\n\nnicolas.usunier@lip6.fr\n\nAbstract\n\nWe propose two transductive bounds on the risk of majority votes that are estimated over\npartially labeled training sets. The \ufb01rst one involves the margin distribution of the clas-\nsi\ufb01er and a risk bound on its associate Gibbs classi\ufb01er. The bound is tight when so is\nthe Gibbs\u2019s bound and when the errors of the majority vote classi\ufb01er is concentrated on a\nzone of low margin. In semi-supervised learning, considering the margin as an indicator\nof con\ufb01dence constitutes the working hypothesis of algorithms which search the decision\nboundary on low density regions. Following this assumption, we propose to bound the er-\nror probability of the voted classi\ufb01er on the examples for whose margins are above a \ufb01xed\nthreshold. As an application, we propose a self-learning algorithm which iteratively as-\nsigns pseudo-labels to the set of unlabeled training examples that have their margin above\na threshold obtained from this bound. Empirical results on different datasets show the\neffectiveness of our approach compared to the same algorithm and the TSVM in which\nthe threshold is \ufb01xed manually.\n\n1 Introduction\n\nEnsemble methods [5] return a weighted vote of baseline classi\ufb01ers. It is well known that under the PAC-\nBayes framework [9], one can obtain an estimation of the generalization error (also called risk) of such\nmajority votes (referred as Bayes classi\ufb01er). Unfortunately, those bounds are generally not tight, mainly\nbecause they are indirectly obtain via a bound on a randomized combination of the baseline classi\ufb01ers\n(called the Gibbs classi\ufb01er). Although the PAC-Bayes theorem gives tight risk bounds of Gibbs classi\ufb01ers,\nthe bounds of their associate Bayes classi\ufb01ers come at a cost of worse risk (trivially a factor of 2, or under\nsome margin assumption, a factor of 1+\u0001). In practice the Bayes risk is often smaller than the Gibbs risk.\nIn this paper we present a transductive bound over the Bayes risk. This bound is also based on the risk of\nthe associated Gibbs classi\ufb01er, but it takes as an additional information the exact knowledge of the margin\ndistribution of unlabeled data. This bound is obtained by analytically solving a linear program. The intuitive\nidea here is that given the risk of the Gibbs classi\ufb01er and the margin distribution, the risk of the majority\nvote classi\ufb01er is maximized when all its errors are located on low margin examples. We show that our\nbound is tight when the associated Gibbs risk can accurately be estimated and when the Bayes classi\ufb01er\nmakes most of its errors on low margin examples.\nThe proof of this transductive bound makes use of the (joint) probability over an unlabeled data set that the\nmajority vote classi\ufb01er makes an error and the margin is above a given threshold. This second result natu-\nrally leads to consider the conditional probability that the majority vote classi\ufb01er makes an error knowing\nthat the margin is above a given threshold.\n\n\fThis conditional probability is related to the concept that the margin is an indicator of con\ufb01dence which is\nrecurrent in semi-supervised self-learning algorithms [3,6,10,11,12]. These methods \ufb01rst train a classi\ufb01er\non the labeled training examples. The classi\ufb01er outputs serve then to assign pseudo-class labels to unlabeled\ndata having margin above a given threshold. The supervised method is retrained using the initial labeled set\nand its previous predictions on unlabeled data as additional labeled examples. Practical algorithms almost\n\ufb01x the margin threshold manually.\nIn the second part of the paper, we propose to \ufb01nd this margin threshold by minimizing the bound on\nthe conditional probability. Empirical results on different datasets show the effectiveness of our approach\ncompared to TSVM [7] and the same algorithm but with a manually \ufb01xed threshold as in [11]\nIn the remainder of the paper, we present, in section 2, our transductive bounds and show their outcomes\nin terms of suf\ufb01cient conditions under which unlabeled data may be of help in the learning process and a\nlinear programming method to estimate these bounds. In section 4, we present experimental results obtained\nwith a self-learning algorithm on different datasets in which we use the bound presented in section 2.2 for\nchoosing the threshold which serve in the label assignment step of the algorithm. Finally, in section 5 we\ndiscuss the outcomes of this study and give some pointers to further research.\n\ni=1 \u2208 Z l and an unlabeled set XU = (x(cid:48)\n\n2 Transductive Bounds on the Risk of the Voted Classi\ufb01er\nWe are interested in the study of binary classi\ufb01cation problems where the input space X is a subset of Rd\nand the output space is Y = {\u22121, +1}. We furthermore suppose that the training set is composed of a\ni=l+1 \u2208 X u, where Z represents the\nlabeled set Z(cid:96) = ((xi, yi))l\ni)l+u\nset of X \u00d7 Y. We suppose that each pair (x, y) \u2208 Z(cid:96) is drawn i.i.d. with respect to a \ufb01xed, but unknown,\nprobability distribution D over X \u00d7 Y and we denote the marginal distribution over X by DX .\nTo simplify the notation and the proofs, we restrict ourselves to the deterministic labeling case, that is, for\neach x(cid:48) \u2208 XU , there is exactly one possible label that we will denote by y(cid:48).1\nIn this study, we consider learning algorithms that work in a \ufb01xed hypothesis space H of binary classi\ufb01ers\n(de\ufb01ned without reference to the training data). After observing the training set S = Z(cid:96) \u222a XU , the task of\nthe learner is to choose a posterior distribution Q over H such that the Q-weighted majority vote classi\ufb01er\nBQ (also called the Bayes classi\ufb01er) will have the smallest possible risk on examples of XU . Recall that the\nBayes classi\ufb01er is de\ufb01ned by\n(1)\nwhere, sgn(x)=+1 if the real number x > 0 and \u22121 otherwise. We further denote by GQ the associated\nGibbs classi\ufb01er which for classifying any example x \u2208 X chooses randomly a classi\ufb01er h according to the\ndistribution Q. We accordingly de\ufb01ne the transductive risk of GQ over an unlabeled set by:\n\nBQ(x) = sgn [Eh\u223cQh(x)]\n\n\u2200x \u2208 X .\n\nRu(GQ) def=\n\n(2)\nWhere, [[\u03c0]] = 1 if predicate \u03c0 holds and 0 otherwise, and for every unlabeled example x(cid:48) \u2208 XU we refer to\ny(cid:48) as its true unknown class label. In section 2.1 we show that if we consider the margin as an indicator of\nu(GQ) of the risk of GQ which holds with probability\ncon\ufb01dence and that we dispose a tight upper bound R\u03b4\n1\u2212 \u03b4 over the random choice of Z(cid:96) and XU (for example using Theorem 17 or 18 of Derbelo et al. [4]), we\nare then able to accurately bound the transductive risk of the Bayes classi\ufb01er:\n\nx(cid:48)\u2208XU\n\nEh\u223cQ[[h(x(cid:48)) (cid:54)= y(cid:48)]]\n\n(cid:88)\n\n1\nu\n\nRu(BQ) def=\n\n[[BQ(x(cid:48)) (cid:54)= y(cid:48)]]\n\n(3)\n\n(cid:88)\n\n1\nu\n\n(cid:88)\n\nx(cid:48)\u2208XU\nThis result follows from a bound on the joint Bayes risk:\n\n1\nu\n\n[[BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) > \u03b8]]\n\nx(cid:48)\u2208XU\n\nRu\u2227\u03b8(BQ) def=\n\n(4)\nWhere mQ(\u00b7) = |Eh\u223cQh(\u00b7)| denotes the unsigned margin function. One of the practical issues that arises\nfrom this result is the possibility to de\ufb01ne a threshold \u03b8 for which the bound is optimal and that we use\nin a self-learning algorithm by iteratively assigning pseudo-labels to unlabeled examples having margin\nabove this threshold. We \ufb01nally denote by Euz the expectation of a random variable z with respect to\nthe uniform distribution over XU and for notation convenience we equivalently de\ufb01ne Pu the uniform\nprobability distribution over XU i.e. For any subset A, P (A) = 1\n\n1The proofs can be inferred to the more general noisy case, but one has to replace the summation(cid:80)\n(cid:80)\n(x(cid:48),y(cid:48))\u2208XU\u00d7{\u22121,+1}. P(x(cid:48),y(cid:48))\u223cD(y(cid:48)|x(cid:48)) in the de\ufb01nitions of equations (3) and (4).\n\nu card(A).\n\nx(cid:48)\u2208XU by\n\n\f2.1 Main Result\n\n\u2264\nQ (\u03b8) \u2212 M <\n(cid:27)\n\n(cid:26)\n\nOur main result is the following theorem which provides two bounds on the transductive risks of the Bayes\nclassi\ufb01er (3) and the joint Bayes risk (4).\nTheorem 1 Suppose that BQ is as in (1). Then for all Q and all \u03b4 \u2208 (0, 1] with probability at least 1 \u2212 \u03b4:\n\nu(Q) = R\u03b4\n\nWhere K \u03b4\nand (cid:98).(cid:99)+ denotes the positive part (i.e. (cid:98)x(cid:99)+ = [[x > 0]]x).\nMore generally, with probability at least 1 \u2212 \u03b4, for all Q and all \u03b8 \u2265 0:\n\nu(GQ) + 1\n\nRu(BQ) \u2264 inf\n\u03b3\u2208(0,1]\n2 (EumQ(x(cid:48)) \u2212 1), M\n(cid:26)\n\nPu(mQ(x(cid:48)) < \u03b3) +\n\nu(Q) \u2212 M <\n\n(5)\n(cid:67)\nQ (t) = EumQ(x(cid:48))[[mQ(x(cid:48)) (cid:67) t]] for (cid:67) being < or \u2264\n\n+\n\nRu\u2227\u03b8(BQ) \u2264 inf\n\u03b3\u2208(\u03b8,1]\n\nPu(\u03b8 < mQ(x(cid:48)) < \u03b3) +\n\nK \u03b4\n\nu(Q) + M\n\n(cid:107)\n\n(cid:27)\n\n(6)\n\nQ (\u03b3)\n\n+\n\n1\n\u03b3\n\n(cid:106)\n\n1\n\u03b3\n\n(cid:27)\n\nQ (\u03b3)(cid:5)\n\n(cid:4)K \u03b4\n\nIn section 2.2 we will prove that the bound (5) simply follows from (6). In order to better understand the\nu(Q) the right hand side of equation (5):\nformer bound on the risk of the Bayes classi\ufb01er, denote by F \u03b4\n1\nPu(mQ(x(cid:48)) < \u03b3) +\n\u03b3\n\nQ (\u03b3)(cid:5)\n\nu(Q) def= inf\nF \u03b4\n\nu(Q) \u2212 M <\n\n(cid:4)K \u03b4\n\n\u03b3\u2208(0,1]\n\n+\n\n(cid:26)\n\nand consider the following special case where the classi\ufb01er makes most of its errors on unlabeled examples\nwith low margin. Proposition 2, together with the explanations that follow, makes this idea clearer.\nProposition 2 Assume that \u2200x \u2208 XU , mQ(x) > 0 and that \u2203C \u2208 (0, 1] such that \u2200\u03b3 > 0:\n\nPu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) = \u03b3) (cid:54)= 0 \u21d2 Pu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) < \u03b3) \u2265 C \u00b7 Pu (mQ(x(cid:48)) < \u03b3)\nThen, with probability at least 1 \u2212 \u03b4:\n\nu(Q) \u2212 Ru(BQ) \u2264 1 \u2212 C\n\nF \u03b4\n\nRu(BQ) + R\u03b4\n\nu(GQ) \u2212 Ru(GQ)\n\n\u03b3\u2217\n\nC\n\n(7)\n\nWhere \u03b3\u2217 = sup{\u03b3|Pu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) = \u03b3) (cid:54)= 0}\nNow, suppose that the margin is an indicator of con\ufb01dence. Then, a Bayes classi\ufb01er that makes its error\nmostly on low margin regions will admit a coef\ufb01cient C in inequality (7) close to 1 and the bound of (5)\nu(GQ) ). In the next section we provide proofs\nbecomes tight (provided we have an accurate upper bound R\u03b4\nof all the statements above and show in lemma 4 a simple way to compute the best margin threshold for\nwhich the general bound on the joint Bayes risk is the lowest.\n\n2.2 Proofs\n\nAll our proofs are based on the relationship between Ru(GQ) and Ru(BQ) and the following lemma:\n\nLemma 3 Let (\u03b31, .., \u03b3N ) be the ordered sequence of the different strictly positive values of the margin on\nXU , that is {\u03b3i, i = 1..N} = {mQ(x(cid:48))|x(cid:48) \u2208 XU \u2227 mQ(x(cid:48)) > 0} and \u2200i \u2208 {1, . . . , N \u2212 1}, \u03b3i < \u03b3i+1.\nDenote moreover bi = Pu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) = \u03b3i) for i \u2208 {1, . . . , N}. Then,\n\nN(cid:88)\n\nRu(GQ) =\n\nbi\u03b3i +\n\ni=1\n\n\u2200\u03b8 \u2208 [0, 1], Ru\u2227\u03b8(BQ) =\n\n(1 \u2212 EumQ(x(cid:48)))\n\n1\n2\n\nbi with k = max{i|\u03b3i \u2264 \u03b8}\n\n(8)\n\n(9)\n\nN(cid:88)\n\ni=k+1\n\nProof Equation (9) follows the de\ufb01nition Ru\u2227\u03b8(BQ) = Pu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) > \u03b8).\nEquation (8) is obtained from the de\ufb01nition of the margin mQ which writes as\n\n\u2200x(cid:48) \u2208 XU , mQ(x(cid:48)) = |Eh\u223cQ[[h(x(cid:48)) = 1]] \u2212 Eh\u223cQ[[h(x(cid:48)) = \u22121]]| = |1 \u2212 2Eh\u223cQ[[h(x(cid:48)) (cid:54)= y(cid:48)]]|\n\n\fBy noticing that for all x(cid:48) \u2208 XU the condition Eh\u223cQ[[h(x(cid:48)) (cid:54)= y(cid:48)]] > 1\ny(cid:48)Eh\u223cQh(x(cid:48)) < 0 or BQ(x(cid:48)) (cid:54)= y(cid:48), we can rewrite mQ without absolute values and hence get:\n\n2 is equivalent to the statement\n\nEh\u223cQ[[h(x(cid:48)) (cid:54)= y(cid:48)]] =\n\n(1 + mQ(x(cid:48)))[[BQ(x(cid:48)) (cid:54)= y(cid:48)]] +\n\n(10)\nFinally equation (8) yields by taking the mean over x(cid:48) \u2208 XU and by reorganizing the equation using the\nnotations of bi and \u03b3i. Recall that the values the x(cid:48) for which mQ(x(cid:48)) = 0 counts for 0 in the sum that\nde\ufb01ned the Gibbs risk (see equation 2 and the de\ufb01nition of mQ). (cid:3)\n\n(1 \u2212 mQ(x(cid:48)))[[BQ(x(cid:48)) = y(cid:48)]]\n\n1\n2\n\n1\n2\n\nProof of Theorem 1 First, we notice that equation (5) follows equation (6) from the fact that M\nand the following inequality:\n\n\u2264\nQ (0) = 0\n\nRu(BQ) = Ru\u22270(BQ) + Pu(BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) = 0) \u2264 Ru\u22270(BQ) + Pu(mQ(x(cid:48)) = 0)\n\nFor proving equation (6), we know from lemma 3 that for a \ufb01x \u03b8 \u2208 [0, 1] there exist (b1, . . . , bN ) such that\n0 \u2264 bi \u2264 Pu (mQ(x(cid:48)) = \u03b3i) and which satisfy equations (8) and (9).\nLet k = max{i | \u03b3i \u2264 \u03b8}, assuming now that we can obtain an upper bound R\u03b4\nu(GQ) of Ru(GQ) which\nholds with probability 1 \u2212 \u03b4 over the random choices of Z(cid:96) and XU , from the de\ufb01nition (4) of Ru\u2227\u03b8(BQ)\nwith probability 1 \u2212 \u03b4 we have then\n\nb1,..,bN\n\nbi u.c. \u2200i, 0 \u2264 bi \u2264 Pu (mQ(x(cid:48)) = \u03b3i) and\n\nN(cid:88)\n2 (1 \u2212 EumQ(x(cid:48))). It turns out that the right hand side in equation (11) is the\nu(GQ)\u2212 1\n(cid:32)\n\nbi\u03b3i \u2264 K \u03b4\n\nif i \u2264 k,\n\nN(cid:88)\n\n(cid:33)\n\nu(Q)\n\n(11)\n\ni=k+1\n\ni=1\n\n(cid:23)\n\nWhere K \u03b4\nsolution of a linear program that can be solved analytically and which is attained for:\n\nu(Q) = R\u03b4\n\nRu\u2227\u03b8(BQ) \u2264 max\n\nk<j<i \u03b3j Pu(mQ(x(cid:48))=\u03b3j)\n\n\u03b3i\n\n+\n\nelsewhere.\n\n(12)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n\nbi =\n\n(cid:22) K\u03b4\nu(Q)\u2212(cid:80)\nUsing the notations de\ufb01ned in Theorem 1, we rewrite(cid:80)\n\nPu (mQ(x(cid:48)) = \u03b3i) ,\n\nmin\n\n(cid:110)\n\ni|K \u03b4\n\nu(Q) + M\nWe further de\ufb01ne I = max\nfrom equations (11) and (12) with bI = K\u03b4\nbound on Ru\u2227\u03b8(BQ):\n\n(cid:111)\n\nwhich implies(cid:80)N\n\nk<j<i \u03b3jPu (mQ(x(cid:48)) = \u03b3j) as M <\nQ (\u03b3i) > 0\nQ (\u03b3I )\n\ni=k+1 bi =(cid:80)I\n\nQ (\u03b3i) \u2212 M\n\n\u2264\nQ (\u03b8).\ni=k+1 bi\n\n. From this inequality we hence obtain a\n\n\u2264\nQ (\u03b8)\u2212M <\n\u03b3I\n\n\u2264\nQ (\u03b8) \u2212 M <\n\nu(Q)+M\n\nFor clarity, we defer the proof of equation (12) to lemma 4, and continue the proof of equation (6).\n\nRu\u2227\u03b8(BQ) \u2264 Pu (\u03b8 < mQ(x(cid:48)) < \u03b3I) +\n\nK \u03b4\n\nu(Q) + M\n\n\u2264\nQ (\u03b8) \u2212 M <\n\u03b3I\n\nQ (\u03b3I)\n\n(13)\n\nThe proof of the second point in theorem 1 is just a rewriting of this result as from the de\ufb01nition of \u03b3I, for any\n\u03b3 > \u03b3I, the right-hand side of equation (6) is equal to Pu (mQ(x(cid:48)) < \u03b3), which is greater than the right-hand\nside of equation (13). Moreover, for \u03b3 < \u03b3I, we notice that \u03b3 (cid:55)\u2192 Pu (mQ(x(cid:48)) < \u03b3)+ K\u03b4\nQ (\u03b3)\ndecreases. (cid:3)\nLemma 4 (equation (12)) Let gi, i = 1...N be such that 0 < gi < gi+1, pi \u2265 0, i = 1...N, B \u2265 0 and\nk \u2208 {1, . . . , N}. Then, the optimal value of the linear program:\n\n\u2264\nQ (\u03b8)\u2212M <\n\nu(Q)+M\n\n\u03b3\n\nmax\nq1,...,qN\n\nN(cid:88)\n\nN(cid:88)\nqigi \u2264 B\n(cid:16)\npi,(cid:98) B\u2212(cid:80)\nProof De\ufb01ne O = {0}k \u00d7(cid:81)N\nin O, and that this solution is q\u2217. In the rest of the proof, we denote F (q) =(cid:80)N\n\nis attained for q\u2217 de\ufb01ned by: \u2200i \u2264 k : q\u2217\n\nu.c. \u2200i, 0 \u2264 qi \u2264 pi and\n\ni = 0 and \u2200i > k, q\u2217\n\ni = min\n\ni=k+1\n\ni=1\n\nqi\n\ni=k+1 qi.\n\n(14)\n\n(cid:17)\n\n(cid:99)+\n\nj<i q\u2217\ngi\n\nj gj\n\ni=k+1[0, pi]. We will show that problem (14) has a unique optimal solution\n\n\fqopt \u2208(cid:81)N\n\ni\n\ni\n\ni\n\n= qopt\n\ni=1 q\u2217\n\nnecessarily feasible and(cid:80)N\n\ni=1[0, pi]. De\ufb01ne qopt,O by qopt,O\n\nFirst, the problem is convex, feasible (take \u2200i, qi = 0) and bounded. Therefore there is an optimal solution\n= 0 otherwise. Then, qopt,O \u2208 O,\nit is clearly feasible, and F (qopt,O) = F (qopt). Therefore, there is an optimal solution in O.\nNow, for (q, q(cid:48)) \u2208 RN \u00d7 RN , de\ufb01ne I(q, q(cid:48)) = {i|qi > q(cid:48)\n\ni}, and consider the lexicographic order (cid:23):\n\u2200(q, q(cid:48)) \u2208 RN \u00d7 RN , q (cid:23) q(cid:48) \u21d4 I(q(cid:48), q) = \u2205 or (I(q, q(cid:48)) (cid:54)= \u2205 and min I(q, q(cid:48)) < min I(q(cid:48), q))\n\nif i > k and qopt,O\n\ni < pi} and M = I(q, q\u2217).\n\ni = B. To see this result let M be the set {i > k| : q\u2217\n\nwhich yields the same result; and \ufb01nally, if M \u2265 K, we have(cid:80)N\n\nThe crucial point is that q\u2217 is the greatest feasible solution in O for (cid:23).\nIndeed, notice \ufb01rst that q\u2217 is\ni < pi}, we then have\ntwo possibilities to consider. (a) M = \u2205. In this case q\u2217 is simply the maximal element for (cid:23) in O. (b)\nM (cid:54)= \u2205. In this case, let K = min{i > k|q\u2217\nWe claim that there are no feasible q \u2208 RN such that q (cid:31) q\u2217. By way of contradiction, suppose such a q\nexists. Then, if M \u2264 k, we have qM > 0, and therefore q is not in O; if k < M < K, we have qM > pM ,\ni = B, and q is not\nfeasible.\nWe now show that if q \u2208 O is feasible and q\u2217 (cid:31) q, then q is not optimal (which is equivalent to show that\nan optimal solution in O must be the greatest feasible solution for (cid:23)).\nLet q \u2208 O be a feasible solution such that q\u2217 (cid:31) q. Since q (cid:31) q\u2217, I(q\u2217, q) is not empty. If I(q, q\u2217) = \u2205, we\nhave F (q\u2217) > F (q), and therefore q is not optimal. We now treat the case where I(q, q\u2217) (cid:54)= \u2205.\nLet K = min I(q\u2217, q) and M = min I(q, q\u2217). We have qM > 0 by de\ufb01nition, and K < M because q\u2217 (cid:31) q\nand q \u2208 O. Let \u03bb = min\n\ni=1 qi > (cid:80)N\n\ni=1 q\u2217\n\n(cid:17)\n\n(cid:16)\n\nqM , gM\ngK\n\ni = qi if i (cid:54)\u2208 {K, M} ,\nq(cid:48)\n\nWe can see that q(cid:48) is feasible by the de\ufb01nition of \u03bb, that it satis\ufb01es the box constraints, and(cid:80)\n(cid:80)\n\n\u03bb \u2217 gK \u2212 \u03bb \u2217 gM = (cid:80)\n\ni qigi \u2264 B. Moreover F (q(cid:48)) = F (q) + \u03bb( gM\n\nM = qM \u2212 \u03bb.\nq(cid:48)\n\ni qigi + gM\ngK\n\nand\n\ngK\n\nigi =\n\u2212 1) > F (q) since\n\ngK < gM and \u03bb > 0. Thus, q is not optimal.\nIn summary, we have shown that there is an optimal solution in O, and that a feasible solution in O must be\nthe greatest feasible solution for the lexicographic order in O to be optimal and which is q\u2217. (cid:3)\nProof of Proposition 2 First let us claim that\n\ni q(cid:48)\n\n(pK \u2212 qK)\nand de\ufb01ne q(cid:48) by:\nq(cid:48)\nK = qK + gM\ngK\n\n\u03bb\n\nRu(BQ) \u2265 Pu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) < \u03b3\u2217) +\n\n1\n\u03b3\u2217\n\nwhere \u03b3\u2217 = sup{\u03b3|Pu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) = \u03b3) (cid:54)= 0} and Ku(Q) = Ru(GQ) + 1\nIndeed, assume for now that equation (15) is true. Then, by assumption we have:\n\nQ (\u03b3\u2217)(cid:5)\n(cid:4)Ku(Q) \u2212 M <\nQ (\u03b3\u2217)(cid:5)\n\n(cid:4)Ku(Q) \u2212 M <\n\n+\n\n+\n\n(15)\n2 (EumQ(x(cid:48)) \u2212 1).\n\n(16)\n\nRu(BQ) \u2265 C \u00b7 Pu (mQ(x(cid:48)) < \u03b3\u2217) +\n\n(cid:106)\n\n1\n(cid:107)\n\u03b3\u2217\nQ (\u03b3\u2217)\n\n+\n\nSince F \u03b4\n\nu(Q) \u2264 Pu (mQ(x(cid:48)) < \u03b3\u2217) + 1\n\u03b3\u2217\n\nu(Q) \u2212 M <\nK \u03b4\n\n, with probability at least 1\u2212 \u03b4 we obtain:\n\nu(GQ) \u2212 Ru(GQ)\n\nu(Q) \u2212 M <\n\nu(Q) \u2212 Ru(BQ) \u2264 (1 \u2212 C)Pu (mQ(x(cid:48)) < \u03b3\u2217) + R\u03b4\nF \u03b4\n\nQ (\u03b3\u2217)(cid:99)+ \u2212 (cid:98)Ku(Q) \u2212 M <\n\n\u03b3\u2217\nQ (\u03b3\u2217)(cid:99)+ \u2264 R\u03b4\n\n(17)\nThis is due to the fact that (cid:98)K \u03b4\nu(GQ) \u2212 Ru(GQ) when\nu(GQ) \u2265 Ru(GQ). Taking once again equation (16), we have Pu (mQ(x(cid:48)) < \u03b3\u2217) \u2264 1\nC Ru(BQ). Plug-\nR\u03b4\nging back this result in equation (17) yields Proposition 2.\nNow, let us prove the claim (equation (15)). Since \u2200x(cid:48) \u2208 XU , mQ(x(cid:48)) > 0, we have Ru(BQ) = Ru\u22270(BQ).\nof lemma 3 that Ru(GQ) =(cid:80)K\nUsing the notations of lemma 3, denote K the index such that \u03b3K = \u03b3\u2217. Then, it follows from equation (8)\nKu(Q)\u2212(cid:80)K\u22121\n2 (1 \u2212 EumQ(x(cid:48))). Solving for bK in this equality yields bK =\nPu (mQ(x(cid:48)) = \u03b3i). Finally, from equation (9), we have Ru(BQ) =(cid:80)K\nQ (\u03b3\u2217)(cid:99)+ since bK \u2265 0 and \u2200i, bi \u2264\nby using the lower bound on bK and the fact that(cid:80)K\u22121\ni=1 bi, which implies equation (15)\n\ni=1 bi = Pu (BQ(x(cid:48)) (cid:54)= y(cid:48) \u2227 mQ(x(cid:48)) < \u03b3\u2217). (cid:3)\n\nand we therefore have bK \u2265 1\n\n(cid:98)Ku(Q) \u2212 M <\n\ni=1 bi\u03b3i + 1\n\ni=1 bi\u03b3i\n\n\u03b3K\n\n\u03b3K\n\n\fIn general, good PAC-Bayesian approximations of Ru(GQ) are dif\ufb01cult to carry out in supervised learning\n[4] mostly due to the huge number of needed instances to obtain accurate approximations of the distribution\nof the absolute values of the margin. In this section we have shown that the transductive setting allows for\nhigh precision on the bounds from the risk Ru(GQ) of the Gibbs classi\ufb01er to the risk Ru(BQ) if we suppose\nthat the Bayes classi\ufb01er makes its errors mostly on low margin regions.\n\n3 Relationship with margin-based self-learning algorithms\n\nIn Proposition 2 we have considered the hypothesis that the margin is an indicator of con\ufb01dence as one of\nthe suf\ufb01cient conditions which leads to a tight approximation of the risk of the Bayes classi\ufb01er, Ru(BQ).\nThis assumption constitutes the working hypothesis of margin-based self-learning algorithms in which a\nclassi\ufb01er is \ufb01rst built on the labeled training set. The output of the learner can then be used to assign\npseudo-labels to unlabeled examples having a margin above a \ufb01xed threshold (denoted by the set ZU\\ in\nwhat follows) and the supervised method is repeatedly retrained upon the set of the initial labeled and\nunlabeled examples that have been classi\ufb01ed in the previous steps. The idea behind this pseudo-labeling\nis that unlabeled examples having a margin above a threshold are less subject to error prone labels, or\nequivalently, are those which have a small conditional Bayes error de\ufb01ned as:\n\nRu|\u03b8(BQ) def= Pu(BQ(x(cid:48)) (cid:54)= y(cid:48) | mQ(x(cid:48)) > \u03b8) =\n\nRu\u2227\u03b8(BQ)\n\nPu(mQ(x(cid:48)) > \u03b8)\n\n(18)\n\nIndeed,\n\nto push away the decision boundary from the unlabeled data.\n\nInput: Labeled and Unlabeled training sets: Z(cid:96), XU\nInitialize\n(1) Train a classi\ufb01er H on Z(cid:96)\n(2) Set ZU\\ \u2190 \u2205\nrepeat\n\nIn this case the label assignation of unlabeled examples upon a margin criterion has the ef-\nfect\nThis strategy follows the\ncluster assumption [10] used in the design of some semi-supervised learning algorithms where\nthe decision boundary is supposed to pass through a region of low pattern density.\nThough\ntheir learning phase is nearly re-\nmargin-based self-learning algorithms are inductive in essence,\nlated to transductive learning which predicts the labels of a given unlabeled set.\nin both\ncases the pseudo class-label assignation of unlabeled examples is interrelated to their margin.\nFor all these algorithms the choice\nof the threshold is a crucial point,\nas with a low threshold the risk\nto assign false labels to exam-\nples is high and a higher value of\nthe threshold would not provide\nenough examples to enhance the\ncurrent decision function. In order\nto examine the effect of \ufb01xing the\nthreshold or computing it automat-\nically we considered the margin-\nbased self-training algorithm pro-\nposed by T\u00a8ur et al. [10, Figure 6]\n(referred as SLA in the following),\nin which unlabeled examples hav-\ning margin above a \ufb01xed threshold\nare iteratively added to the labeled\nset and are not considered in next rounds for label distribution. In our approach, the best threshold mini-\nmizing the conditional Bayes error (18) from equation (6) of theorem 1 is computed at each round of the\nalgorithm (line 3, \ufb01gure 1 - SLA\u2217) while the threshold is kept \ufb01xed in [10, Figure 6] (line 3 is outside of the\nQ(G), of the risk of the Gibbs classi\ufb01er which is involved in the computation of\nrepeat loop). The bound R\u03b4\nthe threshold in equation (18) was \ufb01xed to its worst value 0.5.\n\n(3) Compute the margin threshold \u03b8\u2217 minimizing (18) from (6)\n(4) S \u2190 {(x(cid:48), y(cid:48)) | x(cid:48) \u2208 XU; mQ(x(cid:48)) \u2265 \u03b8\u2217 \u2227 y(cid:48) = sgn(H(x(cid:48)))}\n(5) ZU\\ \u2190 ZU\\ \u222a S, XU = XU\\S\n(6) Learn a classi\ufb01er H by optimizing a global loss function on\nZ(cid:96) and ZU\\\n\nuntil XU is empty or that there are no adds to ZU\\ ;\nOutput The \ufb01nal classi\ufb01er H\n\nFigure 1: Self-learning algorithm (SLA\u2217)\n\n4 Experiments and Results\n\nIn our experiments, we employed a Boosting algorithm optimizing the following exponential loss2 as the\nbaseline learner (line (6), \ufb01gure 1):\n\ne\u2212y(cid:48)H(x(cid:48))\n\n(19)\n\n(cid:88)\n\nx\u2208Z(cid:96)\n\n1\nl\n\n(cid:88)\n\nx(cid:48)\u2208ZU\\\n\n1\n|ZU\\|\n\nLc(H, Z(cid:96), ZU\\) =\n(cid:80)\n\ne\u2212yH(x) +\n(cid:80)\n\n2Bennett et al. [1] have shown that the minimization of (19) allows to reach a local minima of the margin loss\n\nfunction LM (H, Z(cid:96), ZU\\) = 1\n\nl\n\nx\u2208Z(cid:96)\n\ne\u2212yH(x) + 1|ZU\\|\n\nx(cid:48)\u2208ZU\\ e|H(x(cid:48))|.\n\n\fWhere H = (cid:80)\n\ninput feature jt \u2208 {1, . . . , d} and a threshold \u03bbt as:\n\nt \u03b1tht is a linear weighted sum of decision stumps ht which are uniquely de\ufb01ned by an\n\nht(x) = 2[[\u03d5jt(x) > \u03bbt]] \u2212 1\n\nWith \u03d5j(x) the jth feature characteristic of x. Within this setting, the Gibbs classi\ufb01er is de\ufb01ned as a\nt=1 according to Q such that \u2200t, PQ(ht) = |\u03b1t|(cid:80)\nrandom choice from the set of baseline classi\ufb01ers {ht}T\nt |\u03b1t|.\nAccordingly the Bayes classi\ufb01er is simply the weighted voting classi\ufb01er BQ = sign(H). Although the\nself-learning model (SLA\u2217) is an inductive algorithm we carried out experiments in a transductive setting\nin order to compare results with the transductive SVM of Joachims [7] and the self-learning algorithm\n(SLA) described in [11, Figure 6]. For the latter, after training a classi\ufb01er H on Z(cid:96) (\ufb01gure 1, step 1) we\n\ufb01xed different margin thresholds considering the lowest and the highest output values of H over the labeled\ntraining examples. We evaluated the performance of the algorithms on 4 collections from the benchmark\ndata sets3 used in [3] as well as 2 data sets from the UCI repository [2]. In this case, we chose sets large\nenough for reasonable labeled/unlabeled partitioning, and that represent binary classi\ufb01cation problems.\nEach experiment was repeated 20 times by partitioning, at each time, the data set into two random labeled\nand unlabeled training sets.\n\nTable 1: Means and standard deviations of the classi\ufb01cation error on unlabeled training data over the 20\ntrials for each data set. d denotes the dimension, l and u refer respectively to the number of labeled and\nunlabeled examples in each data set.\n\nDataset\nCOIL2\nDIGIT\nG241c\nUSPS\nPIMA 8\nWDBC 30\n\nd\nl\n241 1500 10\n241 1500 10\n241 1500 10\n241 1500 10\n10\n10\n\nl + u\n\n768\n569\n\nSLA\n\n.302\u2193\u00b1.042\n.201\u2193\u00b1.038\n.314\u2193\u00b1.037\n.342\u2193\u00b1.024\n.379\u2193\u00b1.026\n.168\u2193\u00b1.016\n\nSLA\u2217\n.255\u00b1.019\n.149\u00b1.012\n.248\u00b1.018\n.278\u2193\u00b1.022\n.305\u00b1.021\n.124\u00b1.011\n\nTSVM\n.286\u2193\u00b1.031\n.156\u00b1.014\n.252\u00b1.021\n.261\u00b1.019\n.318\u2193\u00b1.018\n.141\u2193\u00b1.016\n\nSLA\n\nl\n100 .148\u2193\u00b1.015\n.091\u2193\u00b1.01\n100\n100 .201\u2193\u00b1.017\n100 .114\u2193\u00b1.012\n.284\u2193\u00b1.019\n50\n.112\u2193\u00b1.011\n50\n\nSLA\u2217\n.134\u00b1.011\n.071\u00b1.005\n.191\u00b1.014\n.112\u2193\u00b1.012\n.266\u00b1.018\n.079\u00b1.007\n\nTSVM\n.152\u2193\u00b1.016\n.087\u2193\u00b1.009\n.196\u00b1.022\n.103\u00b1.011\n.276\u00b1.021\n.108\u2193\u00b1.01\n\nFor each data set, means and standard deviations of the classi\ufb01cation error on unlabeled training data over\nthe 20 trials are shown in Table 1 for 2 different splits of the labeled and unlabeled sets. The symbol \u2193\nindicates that performance is signi\ufb01cantly worse than the best result, according to a Wilcoxon rank sum test\nused at a p-value threshold of 0.01 [8]. In addition, we show in \ufb01gure 2 the evolutions on the COIL2, DIGIT\nand USPS data sets of the classi\ufb01cation and both risks of the Gibbs classi\ufb01er (on the labeled and unlabeled\ntraining sets) for different number of rounds in the SLA\u2217 algorithm. These \ufb01gures are obtained from one of\nthe 20 trials that we ran for these collections.\nThe most important conclusion from these empirical results is that for all data sets, the self-learning al-\ngorithm becomes competitive when the margin threshold is found automatically rather than if it is \ufb01xed\nmanually. The augmented self-learning algorithm achieves performance statistically better or equivalent to\nthat of TSVM in most cases, while it outperforms the initial method over all runs.\n\nFigure 2: Classi\ufb01cation error, train and test Gibbs errors with respect to the iterations of the SLA\u2217 algorithm\nfor a \ufb01xed number of labeled training data l = 10.\n\n3http://www.kyb.tuebingen.mpg.de/ssl-book/benchmarks.html\n\n\fIn SLA\u2217 the automatic choice of the margin-threshold has the effect to select, at the \ufb01rst rounds of the\nalgorithm, many unlabeled examples for which their class labels can be predicted with high con\ufb01dence by\nthe voted classi\ufb01er. The exponential fall of the classi\ufb01cation rate in \ufb01gure 2 can be explained by the addition\nof these highly informative pseudo-labeled examples at the \ufb01rst steps of the learning process (\ufb01gure 1).\nAfter this fall, few examples are added because the learning algorithm does not increase the margin on\nunlabeled data. Hence, the number of additional pseudo-labeled examples decreases resulting in a plateau\nin the classi\ufb01cation error curves in \ufb01gure 2. We further notice that the error of the Gibbs classi\ufb01er on labeled\ndata increases fastly to a stationary error point and that on the unlabeled examples does not vary in time.\n\n5 Conclusions\n\nThe contribution of this paper is two fold. First, we proposed a bound on the risk of the voted classi\ufb01er\nusing the margin distribution of unlabeled examples and an estimation of the risk of the Gibbs classi\ufb01er.\nWe have shown that our bound is a good approximation of the true risk when the errors of the associated\nGibbs classi\ufb01er can accurately be estimated and that the voted classi\ufb01er makes most its errors on low margin\nexamples.\nThe proof of the bound passed through a second bound on the joint probability that the voted classi\ufb01er\nmakes an error and that the margin is above a given threshold. This tool led to the conditional probability\nthat the voted classi\ufb01er makes an error knowing that the margin is above a given threshold. We showed that\nthe search of a margin threshold minimizing this conditional probability can be obtained by analytically\nsolving a linear program.\nThis resolution conducted to our second contribution which is to \ufb01nd automatically the margin threshold in\na self-learning algorithm. Empirical results on a number of data sets have shown that the adaptive threshold\nallows to enhance the performance of a self-learning algorithm.\n\nReferences\n\nIn First International Workshop on Multiple\n\n[1] Bennett, K., Demiriz, A. & Maclin, R. (2002) Expoliting unlabeled data in ensemble methods. In Proc. ACM Int.\nConf. Knowledge Discovery and Data Mining, 289-296.\n[2] Blake, C., Keogh, E. & Merz, C.J. (1998) UCI repository of machine learning databases. University of California,\nIrvine. [on-line] http://www.ics.uci.edu/ mlearn/MLRepository.html\n[3] Chapelle, O., Sch\u00a8olkopf, B. & Zien, A. (2006) Semi-supervised learning. MA: MIT Press.\n[4] Derbeko, P., El-Yaniv, R. & Meir, R. (2004) Explicit learning curves for transduction and application to clustering\nand compression algorithms. Journal of Arti\ufb01cial Intelligence Research 22:117-142.\n[5] Dietterich, T.G. (2000) Ensemble Methods in Machine Learning.\nClassi\ufb01er Systems, 1-15.\n[6] Grandvalet, Y. & Bengio, Y. (2005) Semi-supervised learning by entropy minimization. In Advances in Neural\nInformation Processing Systems 17, 529-536. Cambridge, MA: MIT Press.\n[7] Joachims, T. (1999) Transductive Inference for Text Classi\ufb01cation using Support Vector Machines. In Proceedings\nof the 16th International Conference on Machine Learning, 200-209.\n[8] Lehmann, E.L. (1975) Nonparametric Statistical Methods Based on Ranks. McGraw-Hill, New York.\n[9] McAllester, D. (2003) Simpli\ufb01ed PAC-Bayesian margin bounds.\nLearning Theory, Lecture Notes in Arti\ufb01cial Intelligence, 203-215.\n[10] Seeger, M. (2002) Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural\nComputation, University of Edinburgh.\n[11] T\u00a8ur, G., Hakkani-T\u00a8ur, D.Z. & Schapire, R.E. (2005) Combining active and semi-supervised learning for spoken\nlanguage understanding. Journal of Speech Communication 45(2):171-186.\n[12] Vittaut, J.-N., Amini, M.-R. & Gallinari, P. (2002) Learning Classi\ufb01cation with Both Labeled and Unlabeled Data.\nIn European Conference on Machine Learning, 468-476.\n\nIn Proc. od the 16th Annual Conference on\n\n\f", "award": [], "sourceid": 676, "authors": [{"given_name": "Massih R.", "family_name": "Amini", "institution": ""}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": null}, {"given_name": "Fran\u00e7ois", "family_name": "Laviolette", "institution": null}]}