{"title": "The Pessimistic Limits and Possibilities of Margin-based Losses in Semi-supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1790, "page_last": 1799, "abstract": "Consider a classification problem where we have both labeled and unlabeled data available.  We show that for linear classifiers defined by convex margin-based surrogate losses that are decreasing,  it is impossible to construct \\emph{any} semi-supervised approach that is able to guarantee an improvement over the supervised classifier measured by this surrogate loss on the labeled and unlabeled data. For convex margin-based loss functions that also increase, we demonstrate safe improvements \\emph{are} possible.", "full_text": "The Pessimistic Limits and Possibilities of\n\nMargin-based Losses in Semi-supervised Learning\n\nRadboud University, The Netherlands\n\nDelft University of Technology, The Netherlands\n\nJesse H. Krijthe\n\njkrijthe@gmail.com\n\nMarco Loog\n\nUniversity of Copenhagen, Denmark\n\nm.loog@tudelft.nl\n\nAbstract\n\nConsider a classi\ufb01cation problem where we have both labeled and unlabeled data\navailable. We show that for linear classi\ufb01ers de\ufb01ned by convex margin-based sur-\nrogate losses that are decreasing, it is impossible to construct any semi-supervised\napproach that is able to guarantee an improvement over the supervised classi\ufb01er\nmeasured by this surrogate loss on the labeled and unlabeled data. For convex\nmargin-based loss functions that also increase, we demonstrate safe improvements\nare possible.\n\n1\n\nIntroduction\n\nSemi-supervised learning has been reported to deliver encouraging results in various settings, e.g.\nfor object detection in computer vision (Rasmus et al., 2015), protein function prediction from\nsequence data (Weston et al., 2005) or prediction of cancer recurrence (Shi & Zhang, 2011) in the\nbio-medical domain and part-of-speech tagging in natural language processing (Elworthy, 1994). In\nother settings, however, using unlabeled data has been shown to lead to a decrease in performance\nwhen compared to the supervised solution (Elworthy, 1994; Cozman & Cohen, 2006). For semi-\nsupervised classi\ufb01ers to be used safely in practice, we may at least want some guarantee that they do\nnot reduce performance compared to their supervised alternatives. Some have attempted to provide\nsuch guarantees either empirically by restrictions on the parameters to be estimated (Loog, 2010)\nor under particular assumptions on the data (Li & Zhou, 2015). In general, however, it is unclear\nfor what classi\ufb01ers one can construct \u2018safe\u2019 semi-supervised approaches that can be expected to not\ndecrease performance, or whether this is at all possible.\n\n1.1 Safety and Pessimism\n\nThis work explores whether and, if so, how we can guarantee unlabeled data to improve, or at least\nnot decrease the performance of a semi-supervised classi\ufb01er in comparison to a supervised classi\ufb01er.\n\u2018Pessimism\u2019 refers to the property that we want this guarantee to hold for every single instantiation of\na problem, even for the worst possible unknown labeling of the unlabeled data. The reason we choose\nsuch a strict criterion is that it is the only criterion that can guarantee (with probability one), that\nperformance degradation will not occur, for the particular dataset one is faced with. Therefore, a semi-\nsupervised approach can only be called truly safe if it guarantees non-degradation of performance in\nthis pessimistic sense. Note that the labelings that we will be considering are not as pessimistic as\nthey \ufb01rst appear: because performance is compared to the supervised classi\ufb01er, these labelings will\nbe optimistic with respect to the supervised classi\ufb01er, as will become apparent when we formally\nde\ufb01ne the criterion for safe semi-supervised learning in Equation (3).\nWe compare the performance of the supervised and semi-supervised classi\ufb01er measured on the labeled\nand unlabeled data. This is, strictly speaking, a transductive setting (Joachims, 1999), where one\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmeasures performance on a speci\ufb01cally de\ufb01ned set of objects, and not a semi-supervised setting\nwhere one measures performance on unseen objects generated from the same distribution as the\ntraining data. There are two reasons this transductive setting is interesting in the context of safe\nsemi-supervised learning. 1. The performance criterion in this setting corresponds to the performance\ncriterion we would observe and optimize for if the labels for all objects would be available. 2. As the\nnumber of unlabeled objects grows, and they start to better represent the distribution of interest in the\ninductive/semi-supervised setting, the limits and possibilities that we derive continue to hold, while\nwe converge to the setting where the marginal distribution of the inputs is assumed to be known that\nis considered in other work on the possibilities of semi-supervised learning (Sokolovska et al., 2008;\nBen-David et al., 2008). All in all, while we consider a transductive setting, it should be clear that\nour analysis does provide valid insights into the safe semi-supervised setting as well.\n\n1.2 The Use of Surrogate Losses\n\nImportant in this work, is that we take the view that a semi-supervised version of, for instance,\nlogistic regression is a classi\ufb01er that still attempts to minimize logistic loss, but uses unlabeled data\nto improve its ability to do so. So it should be judged on how well it generalizes in terms of this\nintrinsic loss. If we were to compare performance in terms of some other loss, like the error rate, one\nruns the risk of attributing improvements to the use of unlabeled data that are, in fact, caused by other\nchanges to the classi\ufb01er. For instance, the semi-supervised classi\ufb01er might implicitly use some other\nsurrogate loss that turns out to be better aligned with the loss used for evaluation.\nTherefore, as our de\ufb01nition of performance, we consider the surrogate loss the classi\ufb01er optimizes\nand compare this loss for the supervised and the semi-supervised learner on the combined labeled\nand unlabeled data. The surrogate loss corresponds to the loss one would minimize if we did have\nlabels for the unlabeled objects. Considering the same criterion in the supervised and semi-supervised\ncase aligns the goal of constructing a semi-supervised classi\ufb01er with the one used when constructing\na supervised classi\ufb01er. It avoids con\ufb02ating improved performance based on a change in surrogate\nloss function with improvements gained by the availability of unlabeled data. For the same reason\nwe also keep the regularization parameter \ufb01xed in the objective functions of the supervised and\nsemi-supervised classi\ufb01ers.\n\n1.3 Outline\n\nThe main conclusion from our analysis (Theorems 1 and 2) is that for classi\ufb01ers de\ufb01ned by convex\nmargin-based surrogate losses that are decreasing, it is impossible to come up with any semi-\nsupervised approach that is able to guarantee safe improvement. We also consider the case of losses\nthat are not decreasing and in particular study the quadratic loss. We show under what conditions it is\npossible in this case to come up with a semi-supervised classi\ufb01er that provides safe improvements\nover the supervised classi\ufb01er.\nThe rest of this work is structured as follows. We start by introducing margin-based loss functions in\nthe empirical risk minimization framework and the extension to the semi-supervised setting. In this,\nwe only treat binary linear classi\ufb01ers. Though not a real restriction, it does simplify our exposition\nand allows us to focus on the core ideas. In Section 3, we formalize our strict notion of safe semi-\nsupervised learning. We \ufb01rst show that for the class of decreasing loss functions it is impossible to\nderive any semi-supervised learning strategy that is not worse than the supervised classi\ufb01er for all\npossible labelings of the unlabeled data. We then consider the case of soft assignment of unlabeled\nobjects to classes. Here, too, it is impossible to provide a strict improvement guarantee for this class\nof loss functions. We subsequently show for what losses it is possible to get strict improvements. In\nSection 5 we apply the theory to a few well-known loss functions. In Section 6 we discuss how these\nresults relate to other results on the (im)possibility of (safe) semi-supervised learning and what the\nimplications of these results are for other safe approaches.\n\n2 Preliminaries\n\nWe consider binary linear classi\ufb01ers in the empirical risk minimization framework. Let X be an\nL \u21e5 d design matrix of L labeled objects, where each row x> is a d-dimensional vector of feature\nvalues corresponding to each labeled object. Let y 2 {1, +1}L be the corresponding label vector.\n\n2\n\n\fThe vector w 2 Rd contains the weights de\ufb01ning a linear classi\ufb01er through sign(x>w). We consider\nconvex margin-based surrogate loss functions, which are loss functions of the form (yx>w). Many\nwell-known classi\ufb01ers can be described in this way, including support vector machines, least squares\nclassi\ufb01cation, least squares support vector machines and logistic regression (Bartlett et al., 2006).\n\n2.1 Empirical Risk Minimization\nIn the empirical risk minimization framework a classi\ufb01er is obtained by minimizing a chosen surrogate\nloss  over a set of training objects plus an optional regularization term \u2326, which we take to be a\nconvex function of w:\n\nR(w, X, y) =\n\n(yix>i w) + \u2326(w) .\n\n(1)\n\nLXi=1\n\nBy minimizing this with respect to w we get a supervised classi\ufb01er:\n\nwsup = arg min\n\nw\n\nR(w, X, y) .\n\nIn the semi-supervised setting, we have an additional design matrix corresponding to unlabeled objects\nXu, sized U \u21e5 d, with unknown labels yu 2 {1, +1}U. We therefore consider the corresponding\nsemi-supervised risk function:\n\nRsemi\n\n\n\n(w, X, y, Xu, q) = R(w, X, y) +\n\nqi(x>i w) + (1  qi)(x>i w) ,\n\n(2)\n\nwhere q 2 [0, 1]U are what we will refer to as responsibilities, indicating the unknown and possibly\n\u2018soft\u2019 membership of each object to a class. For instance, if the true labels were known these would\ncorrespond to \u2018hard\u2019 responsibilities qtrue 2{ 0, 1}U and the semi-supervised risk formulation\nbecomes equal to the supervised risk formulation in Equation (1), where the sum is now over the L\nlabeled objects and the U objects for which we did not have a label.\n\nUXi=1\n\n3 Limits of Safe Semi-supervision\n\nEven though we know the true labeling of the unlabeled objects in Equation (2) belongs to some\nq 2{ 0, 1}U, we do not know which one. Now, a semi-supervised procedure wsemi is safe if it is\nguaranteed to attain a loss on the labeled and unlabeled objects equal to or lower than the supervised\nsolution for all possible labelings of the data, since this is guaranteed to include the true labeling of\nthe unlabeled objects. We \ufb01rst formalize this de\ufb01nition of safety, then consider the cases of hard and\nsoft labeling, and come to our negative results: for many loss functions safe semi-supervision is, in\nfact, not possible. Positive results follow in Section 4.\n\n3.1 Hard labeling\nLet D denote the difference in terms of the chosen loss  on a set of objects between a new classi\ufb01er\nw and the supervised classi\ufb01er wsup for some set of responsibilities for the unlabeled data:\n(wsup, X, y, Xu, q) .\n\nD(w, wsup, X, y, Xu, q) = Rsemi\n\n\n\n(w, X, y, Xu, q)  Rsemi\n\n\n\nThe true unknown labels can, in principle, correspond to any q 2{ 0, 1}U. For a semi-supervised\nclassi\ufb01er wsemi to be safe we therefore need that:\n(3)\n\nmax\n\nD(wsemi, wsup, X, y, Xu, q) \uf8ff 0 .\n\nq2{0,1}U\n\nIf the inequality is strict for at least one instantiation of q, the semi-supervised solution is different\nfrom the supervised solution and potentially better. Is it possible to construct some semi-supervised\nstrategy that has this guaranteed improvement over the supervised solution for margin-based surrogate\nlosses? The following theorem gives a condition under which this strict improvement is never possible.\nTheorem 1. Let wsup be a minimizer of R(w, X, y) and assume it is unique. If  is a decreas-\ning margin-based loss function, meaning (a)  (b) for a \uf8ff b, then there is no safe semi-\nsupervised procedure which guarantees Equation (3) while having at least one q\u21e4 2{ 0, 1}U for\nwhich D(wsemi, wsup, X, y, Xu, q\u21e4) < 0 .\n\n3\n\n\fProof. We are going to prove this by contradiction. Assume D(wsemi, wsup, X, y, Xu, q\u21e4) < 0\nand de\ufb01ne M to be R(wsemi, X, y) R(wsup, X, y). The latter is the difference in the supervised\nobjective function between the semi-supervised and supervised classi\ufb01er. Based on our assumption\nwe can now write\n\nM +\n\nq\u21e4i ((x>i wsemi)  (x>i wsup))\n\nUXi=1\n+ (1  q\u21e4i )((x>i wsemi)  (x>i wsup))\n\n(4)\n\n< 0 .\n\nLet Ai = (x>i wsemi)(x>i wsup) and Bi = (x>i wsemi)(x>i wsup). Since  is decreasing,\neither Ai  0 and Bi \uf8ff 0, or Ai \uf8ff 0 and Bi  0. Set qnew\ni = 0 in\nthe latter. Then, when using qnew instead of q\u21e4 in Equation (4), the sum will be non-negative. Also,\nM > 0, because wsup is the unique minimizer of R(w, X, y) and wsemi 6= wsup. We therefore\nhave that\n\ni = 1 in the former case and qnew\n\nD(wsemi, wsup, X, y, Xu, qnew) > 0 .\n\nwhich contradicts Equation (3).\n\nRemark 1. Alternatively, we can drop the requirement that wsup is the unique minimizer of\nR(w, X, y) by requiring the loss functions to be strictly decreasing.\n\n3.2 Beyond Hard Labelings\n\nIn Equation (3) we considered improvement over all hard labelings of the unlabeled data. Alternatively\nwe could also consider improvements for the larger set of all soft assignments of objects to classes,\nde\ufb01ning safety to mean\n\nmax\nq2[0,1]U\n\nD(wsemi, wsup, X, y, Xu, q) \uf8ff 0 .\n\n(5)\n\nIf there is at least one q 2 [0, 1]U for which the inequality is strict, the semi-supervised solution is\npotentially better than the supervised solution. There are several reasons why this is an interesting\nrelaxation to consider. First of all, it requires the semi-supervised solution to guarantee improvements\nfor a larger class of responsibilities than just the hard labelings, meaning it becomes more dif\ufb01cult\nto construct a procedure with this property. If a procedure guarantees improvement in this sense, it\nimplies it also works for all possible hard labelings. Secondly, it corresponds to a scenario different\nfrom the hard labeling where there is uncertainty in the labels of the unlabeled objects. And lastly,\nthe convex constraint makes the problem more amenable to analysis.\nThe set of classi\ufb01ers induced by all different responsibilities turns out to be a useful concept in the\nremainder of this paper.\nDe\ufb01nition 1. The constraint set C is the set of all possible classi\ufb01ers that can be obtained by\nminimizing the semi-supervised loss for any vector of responsibilities q assigned to the unlabeled\ndata, i.e.,\n\nC =\u21e2arg min\n\nw\n\nRsemi\n\n\n\n(w, X, y, Xu, q)q 2 [0, 1]U .\n\nThe following lemma provides an intermediary step towards our second negative result. It tells us\nthat no strict improvement is possible if the supervised solution is already part of the constraint set.\nLemma 1. If R(w, X, y) is strictly convex and wsup 2C , then there is a soft assignment q\u21e4 such\nthat for every choice of semi-supervised classi\ufb01er wsemi 6= wsup, D(wsemi, wsup, X, y, Xu, q\u21e4) >\n0.\n\n\n\nProof. As wsup 2C  there is a soft labeling q\u21e4 such that wsup minimizes the semi-supervised\nrisk Rsemi\n(w, X, y, Xu, q\u21e4). This risk function is strictly convex because the supervised risk is\nstrictly convex and therefore wsup is its unique minimizer. This immediately implies that for every\nwsemi 6= wsup, we have that Rsemi\n\n(wsemi, X, y, Xu, q\u21e4) > Rsemi\n\n(wsup, X, y, Xu, q\u21e4).\n\n\n\n\n\n4\n\n\fFor decreasing margin-based losses, we now show that we can always explicitly construct a q\u21e4, such\nthat the corresponding semi-supervised solution equals the original supervised one. With this, a result\nsimilar to Theorem 1 for the soft-assignment guarantee directly follows, but \ufb01rst we formulate that\nexplicit construction of the necessary soft labeling.\nLemma 2. If  is a decreasing convex margin-based loss function where for each unlabeled object\nx, the derivatives 0(x>wsup) and 0(x>wsup) exist, we can recover wsup by minimizing the\nsemi-supervised loss by assigning responsibilities q 2 [0, 1]U as\n0(x>wsup) + 0(x>wsup)\nif 0(x>wsup) + 0(x>wsup) 6= 0, and any q 2 [0, 1] otherwise.\nProof. Consider the case where we have one unlabeled object x with responsibility q. The semi-\nsupervised objective then becomes\n\n0(x>wsup)\n\nq =\n\n(6)\n\n,\n\nRsemi\n\n\n\n(w) =R(w, X, y)\n\n+ q(x>w) + (1  q)(x>w) .\n\nSince  is convex, to guarantee that wsup is still a global minimizer of Rsemi\nq 2 [0, 1] that causes the gradient of this objective, evaluated in wsup, to remain equal to zero:\n\n, we need to \ufb01nd a\n\n\n\n(wsup) = 0 + q0(x>wsup)x\n\nrwRsemi\n\n\n\n (1  q)0(x>wsup)x\n\n= 0\n\n(7)\n\nwhere 0 denotes the derivative of (a) with respect to a. As long as 0(x>wsup) + 0(x>wsup) 6=\n0, we can explicitly solve for q to get\n\nq =\n\n0(x>wsup)\n\n0(x>wsup) + 0(x>wsup)\n\n.\n\n(8)\n\nIf  is a decreasing loss, then\n\n0(a) \uf8ff 0\n\nand for each object 0 \uf8ff q \uf8ff 1. If 0(x>wsup) + 0(x>wsup) = 0, because  is decreasing, we\nknow both 0(x>wsup) = 0 and 0(x>wsup) = 0 and so any q is allowed to satisfy (7), including\n0 \uf8ff q \uf8ff 1. Since 0 \uf8ff q \uf8ff 1 for each object individually, we can do it for all objects to get a vector\nof responsibilities q 2 [0, 1]U.\nNow that we have shown by a constructive argument that for decreasing margin-based losses it always\nholds that wsup 2C , the following result is straightforward.\nTheorem 2. Let  be a decreasing convex margin-based loss function and wsup be the unique\nminimizer of a strictly convex R(w, X, y) and suppose for each unlabeled object x, the deriva-\ntives 0(x>wsup) and 0(x>wsup) exist. There is no semi-supervised classi\ufb01er wsemi for which\nEquation (5) holds, while having at least one q\u21e4 for which D(wsemi, wsup, X, y, Xu, q\u21e4) < 0.\n\nProof. This follows directly from Lemma 1 and Lemma 2.\nRemark 2. The requirement to have a strictly convex supervised risk function can be relaxed. What\nwe basically need in the proof is that wsup is the unique optimizer for Rsemi\n(w, X, y, Xu, q\u21e4).\nNevertheless, unlike, for instance, a hinge loss that is not regularized by something like a 2-norm of\nthe weight vector, many interesting objective functions are strictly convex.\nThis result means that for decreasing loss functions it is impossible to construct a semi-supervised\nlearner that is different from the supervised learner and, in terms of its surrogate loss on the full\ntraining data, is never outperformed by the supervised solution. In other words, if the semi-supervised\nclassi\ufb01er is taken to be different from the supervised classier, there is always the risk that there is a\ntrue labeling of the unlabeled data for which the loss of the semi-supervised learner on the full data\nbecomes larger than the loss of the supervised one.\nIs it unexpected that semi-supervised learning cannot be done safely in this strict setting? For whom\nit is not, it may then come as a surprise that there are margin-based losses for which it is actually\npossible to construct safe semi-supervised learners.\n\n\n\n5\n\n\f4 Possibilities for Safe SSL\n\nIf we look beyond the decreasing losses, and consider those that can increase as well, we may yet\nbe able to get a classi\ufb01er that is guaranteed to be better than the supervised solution in terms of the\nsurrogate loss, even in the pessimistic regime. When can we expect safe semi-supervised learning to\nallow for improvements of its supervised counterpart? And if improvements are possible, how then\ndo we construct an actual classi\ufb01er that does so in a safe way?\nTo construct a semi-supervised learner that at least is guaranteed to never be worse, we need to \ufb01nd\nwsemi, the w that minimizes D(w, wsup, X, y, Xu, q) for all possible q. This corresponds, more\nprecisely, to the following minimax problem:\n\nmin\nw\n\nmax\nq2[0,1]U\n\nD(w, wsup, X, y, Xu, q) .\n\n(9)\n\nThis is a formulation similar to the one used by Loog (2016), where instead of margin-based losses,\nthe loss functions are log-likelihoods of a generative model. It is clear that Equation (9) can never be\nlarger than 0. This simply indicates that we can always \ufb01nd a semi-supervised learner that is at least\nas good as the supervised one, by simply sticking to the supervised solution. To show that we can do\nbetter than that, consider the following.\nIf Rsemi\nis convex in w, then since the objective is linear in q and [0, 1]U is a compact space we can\ninvoke (Sion, 1958, Corrolary 3.3), which states that the value of the minimax problem is equal to the\nvalue of the maximin problem:\n\n\n\nmax\nq2[0,1]U\n\nmin\nw\n\nD(w, wsup, X, y, Xu, q) .\n\n(10)\n\nAssume the function D is strictly convex in w for every \ufb01xed q. Now suppose wsup is not in C. In\nthat case, the inner minimization in Equation (10) is always strictly smaller than 0 for each q because\nof the strict convexity of the loss. This means that Equation (10) is strictly smaller than 0 and in turn\nthe same holds for Equation (9).\nSo, if wsup /2C , wsemi will strictly improve upon wsup.\n4.1 Some Suf\ufb01cient Conditions\nSo all that is required to show that the procedure just described leads to an improved classi\ufb01er is\ntherefore that wsup /2C . For an unlabeled data set consisting of a single sample x, this is readily\ndone by reconsidering the proof of Lemma 2 and the argument in the previous paragraph. In particular,\nrewriting Equation (7), we can conclude the following:\nLemma 3. Let Xu = x> and  be a margin-based loss function where the derivatives 0(x>wsup)\nand 0(x>wsup) exist and Rsemi\n\nbe strictly convex. If there is no q 2 [0, 1] such that\n\n\n\n(0(x>wsup) + 0(x>wsup))xq = (0(x>wsup))x\n\nthen wsup /2C  so wsemi has to be different from wsup and, therefore, the former has to improve\nover the latter.\n\nThe case in which U > 1 turns out to be hard to fully characterize. Again starting from Equation (7),\nwe can state that if there is no q 2 [0, 1]U such that\n\nUXi=1\n\nqi0(x>i wsup)xi  (1  qi)0(x>i wsup)xi = 0\n\nthen the gradient evaluated in the supervised solution of the objective function over all training data\nis not zero and so the semi-supervised solution is different, therefore improving over the supervised\nsolution. But this result is hardly insightful. For one, it is unclear if this at all happens when U > 1.\nWe do, however, have a suf\ufb01cient condition that leads the semi-supervised learner to improve over\nthe supervised counterpart. For this, we consider convex, margin-based losses  that are decreasing\nto the left of 1 and to the right of 1 start to increase, as for instance, in the cases of the quadratic or\nabsolute loss. So these losses increasingly penalize overestimation of the label value of every object.\n\n6\n\n\fTable 1: Margin-based loss functions and their corresponding responsibilities\n\nName\nLogistic\n\nHinge\n\n(yx>wsup)\n\np2 log(1 + exp(yx>wsup))\n\nmax(1  yx>wsup, 0)\n\nq(x>wsup)\n(1 + exp(x>wsup))1\n\nif  1 < x>wsup < 1\nif x>wsup > 1\nif x>wsup < 1\nexp(x>wsup)\n\nexp(x>wsup)+exp(x>wsup)\n2 (x>wsup + 1)\n1\n\n1\n2 ,\n1,\n0\n\n8><>:\n( 1\n\n2 ,\nNo solution,\n\nif  1 < yx>wsup < 1\notherwise\n\nRange\n(0, 1)\n\n{0, 1\n\n2 , 1}\n\n(0, 1)\n(1,1)\n{ 1\n2}\n\nExponential\nQuadratic\n\nAbsolute\n\nexp(yx>wsup)\n(1  yx>wsup)2\n|1  yx>w|\n\nTheorem 3. Let\n\n0(a)\u21e2\uf8ff 0,\n\n> 0,\n\nif a \uf8ff 1\nif a > 1,\n\n\n\nbe strictly convex. If, for all x 2 Xu, |x>wsup| is larger than 1, then wsemi 6= wsup. That\n\nand Rsemi\nis, we get an improved semi-supervised estimator if all points in Xu are outside of the margin.\nThe restriction that all points should be outside of the margin is, of course, rather strong. But,\nas indicated, the requirement is only suf\ufb01cient and certainly not necessary. The proof, as well as\nan alternative condition for improvement for the quadratic loss are provided in the supplementary\nmaterial.\n\n5 Examples\n\nTable 1 shows the implied responsibilities q(x>wsup) for loss functions corresponding to a number\nof well-known classi\ufb01ers. The table contains both examples of decreasing losses and losses that also\nstrictly increase. In this \ufb01rst group, the range of the responsibilities will always be between [0, 1],\nmeaning the (partial) labels of the unlabeled data can always be set in such a way that the supervised\nsolution is obtained from the semi-supervised objective function. This in turn implies that no safe\nsemi-supervised method exists for these losses. This shows, for instance, that it is not possible to\nconstruct a safe semi-supervised version of the support vector machine or for logistic regression. In\nthe second case (for quadratic and absolute losses) it is not always possible to set the responsibilities\nin such a way as to recover the supervised solution and a safe semi-supervised classi\ufb01er is sometimes\npossible.\nA more thorough description of these examples, as well as a more precise characterization for when\nto expect improvements in case of the quadratic loss, is provided in the supplementary material.\n\n6 Discussion\nAs Seeger (2001) and others have argued, for diagnostic methods, where p(y|x) gets modeled directly\nand not through modeling the joint distribution p(y, x), semi-supervised learning without additional\nassumptions should be impossible because the parameters of p(y|x) and p(x) are a priori independent.\nConsidering why these methods do not allow for safe semi-supervised versions offers a different\nunderstanding of why this claim may or may not be true. While our results applied to logistic\nregression corroborates their claim, the quadratic loss shows a counterexample. This shows that\nfor losses that strictly increase over some interval, even safe improvements can be possible in the\ndiagnostic setting. One important strength of our analysis is that we also consider the minimization\nof loss functions that may not induce a correct probability. It is the decreasingness of the loss,\nrather than correspondence to a probabilistic model that determines whether safe semi-supervised\nlearning is possible. Moreover, some of the losses for which safe semi-supervised learning is possible\nare successfully applied in supervised learning in practice and it is therefore interesting that safe\nsemi-supervised versions exist.\n\n7\n\n\fOur results also might seem to contradict the result by Sokolovska et al. (2008) (and, by extension\nKawakita & Takeuchi (2014)) that, when the supervised model is misspeci\ufb01ed, a particular semi-\nsupervised adaptation of logistic regression has an asymptotic variance that is at least as small as\nsupervised logistic regression. In this work, however, we cover the pessimistic setting where a\nsemi-supervised learner needs to outperform the supervised learner for all possible labelings in a\n\ufb01nite sample setting. This is a much stricter requirement than the asymptotic result in (Sokolovska\net al., 2008).\nThe (negative) result presented here is in line with the conclusions of Ben-David et al. (2008), who\nshow that the worst-case sample complexity of a supervised learner is at most a constant factor higher\nthan that of any semi-supervised approach for a classi\ufb01er over the real line, and they conjecture\nthis result holds in general. Darnst\u00e4dt et al. (2013) prove that a slightly altered and more precise\nformulation of this conjecture holds when hypothesis classes have \ufb01nite VC-dimension, while they\nshow that it does not hold for more complex hypothesis classes. Whereas these works consider\ngeneralization bounds on the error rate in the PAC learning framework, in our work, we considered\na more conservative or pessimistic setting of safe semi-supervised learning, while considering\nperformance on a \ufb01nite sample in terms of the surrogate loss. This leads to an alternative explanation\nwhy these (strict) improvements are not possible for some losses, similar to the claim in Ben-David\net al. (2008).\nIt also leads, however, to the contrasting conclusion that for some losses, these\nimprovements are possible (even when the VC dimension is \ufb01nite), which contradicts the claim of\nBen-David et al. (2008) that improvements are not possible unless strong assumptions about the\ndistribution of the labels are made.\nThe improvement guarantee, in terms of classi\ufb01cation accuracy, of the safe semi-supervised SVM\nby Li & Zhou (2015) depends on the assumption that the true labeling of the objects is given by\none of the low-density separators that their algorithm \ufb01nds. In our analysis we avoid making such\nassumptions. The consequence of this is that all possible labelings have to be considered, not just\nthose corresponding to a low-density separator. If their low-density assumptions holds, their method\nprovides one way of making use of this information to guarantee safe improvements. As we have\ndemonstrated, however, in a worst case sense no such guarantees can be given, at least in terms of the\nsemi-supervised objective considered in our work. Without making these untestable assumptions, our\nresults show a safe semi-supervised support vector machine is impossible.\nFor loss functions that are strictly increasing over some interval, safe improvement is possible. One\ncould ascribe this fact to a peculiar property of these losses: they give increasingly higher loss even\nif the sign of the decision function is correct. The improvements in terms of the loss that we get\nmay therefore not be useful for classi\ufb01cation, since they may be in a part of the loss function where\nthe surrogate loss already forms a bad approximation to the {0, 1}-loss. In the supervised case,\nhowever, surrogate losses like the quadratic loss generally give decent performance in terms of the\nerror rate, e.g. competitive with SVMs (Rifkin et al., 2003). It is therefore not surprising either that\nits pessimistic semi-supervised counterpart has also shown increased performance (Krijthe & Loog,\n2017a,b).\n\n7 Conclusion\n\nWe have shown that for the class of convex margin-based losses, the fact whether they are decreasing\nor not plays a key role in whether they admit safe semi-supervised procedures. In particular, we have\nshown that, without making additional assumptions, it is impossible to construct safe semi-supervised\nprocedures for decreasing losses by deriving what partial assignment of the unlabeled objects leads to\nthe recovery of the supervised classi\ufb01er from a semi-supervised objective. This subsequently implied\nthat if we choose any semi-supervised procedure that deviates from the supervised solution, there\nis some labeling of the unlabeled objects (which could be the true labeling) for which it decreases\nperformance. While this means that for many supervised procedures it is impossible to construct a\nsafe semi-supervised learner in this strict sense, some losses do admit such solutions. A less strict\nguarantee might admit performance improvement by aiming for semi-supervised solutions that in\nexpectation rather than on any particular dataset, outperform their supervised counterparts.\nThe stark reality is that if one sticks to strictly safe semi-supervised learning, besides opportunities\nfor some surrogate losses, there are clear limits to the development of such procedures.\n\n8\n\n\fAcknowledgements\nWe thank Alexander Mey for his constructive feedback on an earlier version of this manuscript. This\nwork was funded by Project P23 of the Dutch COMMIT research programme.\n\nReferences\nBartlett, Peter L, Jordan, Michael I., and McAuliffe, Jon D. Convexity, Classi\ufb01cation, and Risk\n\nBounds. Journal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\nBen-David, Shai, Lu, Tyler, and P\u00e1l, David. Does Unlabeled Data Provably Help? Worst-case\nAnalysis of the Sample Complexity of Semi-Supervised Learning. In Proceedings of the 21st\nAnnual Conference on Learning Theory, pp. 33\u201344, 2008.\n\nCozman, F and Cohen, Ira. Risks of Semi-Supervised Learning. In Chapelle, Olivier, Sch\u00f6lkopf,\nBernhard, and Zien, A (eds.), Semi-Supervised Learning, chapter 4, pp. 56\u201372. MIT press, 2006.\nDarnst\u00e4dt, Malte, Simon, HU, and Sz\u00f6r\u00e9nyi, Bal\u00e1zs. Unlabeled Data Does Provably Help. In 30th\nInternational Symposium on Theoretical Aspects of Computer Science, pp. 185\u2013196, 2013. doi:\n10.4230/LIPIcs.STACS.2013.185.\n\nElworthy, David. Does Baum-Welch re-estimation help taggers? In Proceedings of the 4th Conference\n\non Applied Natural Language Processing, pp. 53\u201358, 1994.\n\nFung, Glenn and Mangasarian, Olvi L. Proximal Support Vector Machine Classi\ufb01ers. In KDD \u201901\nProceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pp. 77\u201386, 2001. doi: 10.1145/502512.502527.\n\nHastie, Trevor, Tibshirani, Robert, and Buja, Andreas. Flexible Discriminant Analysis by Optimal\nISSN\n\nScoring. Journal of the Amercian Statistical Association, 89(428):1255\u20131270, 1994.\n0162-1459. doi: 10.1080/01621459.1994.10476866.\n\nJoachims, Thorsten. Transductive inference for text classi\ufb01cation using support vector machines. In\nProceedings of the 16th International Conference on Machine Learning, pp. 200\u2013209. Morgan\nKaufmann Publishers, 1999.\n\nKawakita, Masanori and Takeuchi, Jun\u2019ichi. Safe semi-supervised learning based on weighted\n\nlikelihood. Neural Networks, 53:146\u2013164, may 2014. doi: 10.1016/j.neunet.2014.01.016.\n\nKrijthe, Jesse Hendrik and Loog, Marco. Robust Semi-supervised Least Squares Classi\ufb01cation by\nImplicit Constraints. Pattern Recognition, 63:115\u2013126, 2017a. ISSN 00313203. doi: 10.1016/j.\npatcog.2016.09.009.\n\nKrijthe, Jesse Hendrik and Loog, Marco. Projected Estimators for Robust Semi-supervised Classi\ufb01-\n\ncation. Machine Learning, 106(7):993\u20131008, 2017b. doi: 10.1007/s10994-017-5626-8.\n\nLi, Yu-Feng and Zhou, Zhi-Hua. Towards Making Unlabeled Data Never Hurt. IEEE Transactions\non Pattern Analysis and Machine Intelligence, 37(1):175\u2013188, jan 2015. doi: 10.1109/TPAMI.\n2014.2299812.\n\nLoog, Marco. Constrained Parameter Estimation for Semi-Supervised Learning: The Case of\nthe Nearest Mean Classi\ufb01er.\nIn Machine Learning and Knowledge Discovery in Databases\n(Lecture Notes in Computer Science Volume 6322), pp. 291\u2013304. Springer, 2010. doi: 10.1007/\n978-3-642-15883-4_19.\n\nLoog, Marco. Contrastive Pessimistic Likelihood Estimation for Semi-Supervised Classi\ufb01cation.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):462\u2013475, 2016.\n\nPoggio, Tomaso and Smale, Steve. The Mathematics of Learning: Dealing with Data. Notices of the\n\nAMS, 50(5):537\u2013544, 2003.\n\nRasmus, Antti, Valpola, Harri, Honkala, Mikko, Berglund, Mathias, and Raiko, Tapani. Semi-\nsupervised learning with Ladder Networks. In Advances in Neural Information Processing Systems,\npp. 3546\u20133554, 2015.\n\n9\n\n\fRasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes for Machine Learning.\n\nMIT Press, apr 2005. ISBN 9780262182539.\n\nRifkin, Ryan, Yeo, Gene, and Poggio, Tomaso. Regularized least-squares classi\ufb01cation. In Suykens,\nJohan A. K., Horvath, Gabor, Basu, Sankar, Micchelli, Charles, and Vandewalle, Joos (eds.), Nato\nScience Series Sub Series III Computer and Systems Sciences 190, pp. 131\u2013154. IOS Press, 2003.\n\nSeeger, Matthias. Learning with labeled and unlabeled data. Technical report, 2001.\nShi, Mingguang and Zhang, Bing. Semi-supervised learning improves gene expression-based\n\nprediction of cancer recurrence. Bioinformatics, 27(21):3017\u20133023, 2011.\n\nSion, Maurice. On general minimax theorems. Paci\ufb01c J. Math, 8(1):171\u2013176, 1958.\nSokolovska, Nataliya, Capp\u00e9, Olivier, and Yvon, Francois. The asymptotics of semi-supervised\nlearning in discriminative probabilistic models. In Cohen, William W., McCallum, Andrew, and\nRoweis, Sam T. (eds.), Proceedings of the 25th International Conference on Machine Learning, pp.\n984\u2013991, Helsinki, Finland, 2008. ACM Press.\n\nSuykens, Johan A. K. and Vandewalle, J. Least Squares Support Vector Machine Classi\ufb01ers. Neural\n\nProcessing Letters, 9:293\u2013300, 1999.\n\nWeston, Jason, Leslie, Christina, Ie, Eugene, Zhou, Dengyong, Elisseeff, Andre, and Noble,\nWilliam Stafford. Semi-supervised protein classi\ufb01cation using cluster kernels. Bioinformatics, 21\n(15):3241\u20137, aug 2005. ISSN 1367-4803. doi: 10.1093/bioinformatics/bti497.\n\n10\n\n\f", "award": [], "sourceid": 902, "authors": [{"given_name": "Jesse", "family_name": "Krijthe", "institution": "Radboud University Nijmegen"}, {"given_name": "Marco", "family_name": "Loog", "institution": "Delft University of Technology"}]}