{"title": "Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1199, "page_last": 1207, "abstract": "In PU learning, a binary classifier is trained from positive (P) and unlabeled (U) data without negative (N) data. Although N data is missing, it sometimes outperforms PN learning (i.e., ordinary supervised learning). Hitherto, neither theoretical nor experimental analysis has been given to explain this phenomenon. In this paper, we theoretically compare PU (and NU) learning against PN learning based on the upper bounds on estimation errors. We find simple conditions when PU and NU learning are likely to outperform PN learning, and we prove that, in terms of the upper bounds, either PU or NU learning (depending on the class-prior probability and the sizes of P and N data) given infinite U data will improve on PN learning. Our theoretical findings well agree with the experimental results on artificial and benchmark data even when the experimental setup does not match the theoretical assumptions exactly.", "full_text": "Theoretical Comparisons of Positive-Unlabeled\nLearning against Positive-Negative Learning\n\nGang Niu1 Marthinus C. du Plessis1 Tomoya Sakai1 Yao Ma3 Masashi Sugiyama2,1\n\n{ gang@ms., christo@ms., sakai@ms., yao@ms., sugi@ }k.u-tokyo.ac.jp\n\n1The University of Tokyo, Japan\n\n2RIKEN, Japan 3Boston University, USA\n\nAbstract\n\nIn PU learning, a binary classi\ufb01er is trained from positive (P) and unlabeled (U) data\nwithout negative (N) data. Although N data is missing, it sometimes outperforms\nPN learning (i.e., ordinary supervised learning). Hitherto, neither theoretical nor\nexperimental analysis has been given to explain this phenomenon. In this paper,\nwe theoretically compare PU (and NU) learning against PN learning based on the\nupper bounds on estimation errors. We \ufb01nd simple conditions when PU and NU\nlearning are likely to outperform PN learning, and we prove that, in terms of the\nupper bounds, either PU or NU learning (depending on the class-prior probability\nand the sizes of P and N data) given in\ufb01nite U data will improve on PN learning.\nOur theoretical \ufb01ndings well agree with the experimental results on arti\ufb01cial and\nbenchmark data even when the experimental setup does not match the theoretical\nassumptions exactly.\n\n1\n\nIntroduction\n\nPositive-unlabeled (PU) learning, where a binary classi\ufb01er is trained from P and U data, has drawn\nconsiderable attention recently [1, 2, 3, 4, 5, 6, 7, 8]. It is appealing to not only the academia but also\nthe industry, since for example the click-through data automatically collected in search engines are\nhighly PU due to position biases [9, 10, 11]. Although PU learning uses no negative (N) data, it is\nsometimes even better than PN learning (i.e., ordinary supervised learning, perhaps with class-prior\nchange [12]) in practice. Nevertheless, there is neither theoretical nor experimental analysis for this\nphenomenon, and it is still an open problem when PU learning is likely to outperform PN learning.\nWe clarify this question in this paper.\n\nProblem settings For PU learning, there are two problem settings based on one sample (OS) and\ntwo samples (TS) of data respectively. More speci\ufb01cally, let X \u2208 Rd and Y \u2208 {\u00b11} (d \u2208 N) be the\ninput and output random variables and equipped with an underlying joint density p(x, y). In OS [3],\na set of U data is sampled from the marginal density p(x). Then if a data point x is P, this P label is\nobserved with probability c, and x remains U with probability 1 \u2212 c; if x is N, this N label is never\nobserved, and x remains U with probability 1. In TS [4], a set of P data is drawn from the positive\nmarginal density p(x | Y = +1) and a set of U data is drawn from p(x). Denote by n+ and nu the\nsizes of P and U data. As two random variables, they are fully independent in TS, and they satisfy\nn+/(n+ + nu) \u2248 c\u03c0 in OS where \u03c0 = p(Y = +1) is the class-prior probability. Therefore, TS is\nslightly more general than OS, and we will focus on TS problem settings.\nSimilarly, consider TS problem settings of PN and NU learning, where a set of N data (of size n\u2212) is\nsampled from p(x | Y = \u22121) independently of the P/U data. For PN learning, if we enforce that\nn+/(n+ + n\u2212) \u2248 \u03c0 when sampling the data, it will be ordinary supervised learning; otherwise, it is\nsupervised learning with class-prior change, a.k.a. prior probability shift [12].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn [7], a cost-sensitive formulation for PU learning was proposed, and its risk estimator was proven\nunbiased if the surrogate loss is non-convex and satis\ufb01es a symmetric condition. Therefore, we can\nnaturally compare empirical risk minimizers in PU and NU learning against that in PN learning.\n\nContributions We establish risk bounds of three risk minimizers in PN, PU and NU learning for\ncomparisons in a \ufb02avor of statistical learning theory [13, 14]. For each minimizer, we \ufb01rstly derive\na uniform deviation bound from the risk estimator to the risk using Rademacher complexities (see,\ne.g., [15, 16, 17, 18]), and secondly obtain an estimation error bound. Thirdly, if the surrogate loss\nis classi\ufb01cation-calibrated [19], an excess risk bound is an immediate corollary. In [7], there was a\ngeneralization error bound similar to our uniform deviation bound for PU learning. However, it is\nbased on a tricky decomposition of the risk, where surrogate losses for risk minimization and risk\nanalysis are different and labels of U data are needed for risk evaluation, so that no further bound is\nimplied. On the other hand, ours utilizes the same surrogate loss for risk minimization and analysis\nand requires no label of U data for risk evaluation, so that an estimation error bound is possible.\nOur main results can be summarized as follows. Denote by \u02c6gpn, \u02c6gpu and \u02c6gnu the risk minimizers in\nPN, PU and NU learning. Under a mild assumption on the function class and data distributions,\n\u2022 Finite-sample case: The estimation error bound of \u02c6gpu is tighter than that of \u02c6gpn whenever\n\u221a\n\u221a\nnu < (1 \u2212 \u03c0)/\nn\u2212, and so is the bound of \u02c6gnu tighter than that of \u02c6gpn if\n\u221a\n\u221a\n\u03c0/\n(1 \u2212 \u03c0)/\nn+.\nnu < \u03c0/\n\u2022 Asymptotic case: Either the limit of bounds of \u02c6gpu or that of \u02c6gnu (depending on \u03c0, n+ and\nn\u2212) will improve on that of \u02c6gpn, if n+, n\u2212 \u2192 \u221e in the same order and nu \u2192 \u221e faster in\norder than n+ and n\u2212.\n\n\u221a\nn\u2212 + 1/\n\nn+ + 1/\n\n\u221a\n\nNotice that both results rely on only the constant \u03c0 and variables n+, n\u2212 and nu; they are simple and\nindependent of the speci\ufb01c forms of the function class and/or the data distributions. The asymptotic\ncase is from the \ufb01nite-sample case that is based on theoretical comparisons of the aforementioned\nupper bounds on the estimation errors of \u02c6gpn, \u02c6gpu and \u02c6gnu. To the best of our knowledge, this is the\n\ufb01rst work that compares PU learning against PN learning.\nThroughout the paper, we assume that the class-prior probability \u03c0 is known. In practice, it can be\neffectively estimated from P, N and U data [20, 21, 22] or only P and U data [23, 24].\n\nOrganization The rest of this paper is organized as follows. Unbiased estimators are reviewed in\nSection 2. Then in Section 3 we present our theoretical comparisons based on risk bounds. Finally\nexperiments are discussed in Section 4.\n\n2 Unbiased estimators to the risk\nFor convenience, denote by p+(x) = p(x | Y = +1) and p\u2212(x) = p(x | Y = \u22121) partial marginal\ndensities. Recall that instead of data sampled from p(x, y), we consider three sets of data X+, X\u2212\nand Xu which are drawn from three marginal densities p+(x), p\u2212(x) and p(x) independently.\nLet g : Rd \u2192 R be a real-valued decision function for binary classi\ufb01cation and (cid:96) : R \u00d7 {\u00b11} \u2192 R\nbe a Lipschitz-continuous loss function. Denote by\n\nR+(g) = E+[(cid:96)(g(X), +1)], R\u2212(g) = E\u2212[(cid:96)(g(X),\u22121)]\n\npartial risks, where E\u00b1[\u00b7] = EX\u223cp\u00b1 [\u00b7]. Then the risk of g w.r.t. (cid:96) under p(x, y) is given by\n\n(1)\nIn PN learning, by approximating R(g) based on Eq. (1), we can get an empirical risk estimator as\n\nR(g) = E(X,Y )[(cid:96)(g(X), Y )] = \u03c0R+(g) + (1 \u2212 \u03c0)R\u2212(g).\n\n(cid:98)Rpn(g) = \u03c0\n\nn+\n\n(cid:80)\n\n(cid:80)\nxj\u2208X\u2212 (cid:96)(g(xj),\u22121).\n\n\u221a\n\nxi\u2208X+\n\n(cid:96)(g(xi), +1) + 1\u2212\u03c0\n\n\u221a\nn+ + 1/\n\nFor any \ufb01xed g, (cid:98)Rpn(g) is an unbiased and consistent estimator to R(g) and its convergence rate is\n\nn\u2212) according to the central limit theorem [25], where Op denotes the\nof order Op(1/\norder in probability.\nIn PU learning, X\u2212 is not available and then R\u2212(g) cannot be directly estimated. However, [7] has\nshown that we can estimate R(g) without any bias if (cid:96) satis\ufb01es the following symmetric condition:\n(2)\n\n(cid:96)(t, +1) + (cid:96)(t,\u22121) = 1.\n\nn\u2212\n\n2\n\n\f(cid:98)Rpu(g) = \u2212\u03c0 + 2\u03c0\n\n(cid:80)\n\nSpeci\ufb01cally, let Ru,\u2212(g) = EX [(cid:96)(g(X),\u22121)] = \u03c0E+[(cid:96)(g(X),\u22121)] + (1 \u2212 \u03c0)R\u2212(g) be a risk that\nU data are regarded as N data. Given Eq. (2), we have E+[(cid:96)(g(X),\u22121)] = 1 \u2212 R+(g), and hence\n(3)\n\nR(g) = 2\u03c0R+(g) + Ru,\u2212(g) \u2212 \u03c0.\n\nBy approximating R(g) based on (3) using X+ and Xu, we can obtain\n\nn+\n\nxi\u2208X+\n\n(cid:96)(g(xi), +1) + 1\nnu\n\nAlthough (cid:98)Rpu(g) regards Xu as N data and aims at separating X+ and Xu if being minimized, it is\n\n\u221a\nan unbiased and consistent estimator to R(g) with a convergence rate Op(1/\nnu) [25].\nSimilarly, in NU learning R+(g) cannot be directly estimated. Let Ru,+(g) = EX [(cid:96)(g(X), +1)] =\n\u03c0R+(g) + (1 \u2212 \u03c0)E\u2212[(cid:96)(g(X), +1)]. Given Eq. (2), E\u2212[(cid:96)(g(X), +1)] = 1 \u2212 R\u2212(g), and\n\n\u221a\nn+ + 1/\n\nxj\u2208Xu\n\n(cid:96)(g(xj),\u22121).\n\nR(g) = Ru,+(g) + 2(1 \u2212 \u03c0)R\u2212(g) \u2212 (1 \u2212 \u03c0).\n\n(4)\n\n(cid:80)\n\nBy approximating R(g) based on (4) using Xu and X\u2212, we can obtain\n\n(cid:98)Rnu(g) = \u2212(1 \u2212 \u03c0) + 1\n\nnu\n\n(cid:80)\n\nxi\u2208Xu\n\n(cid:96)(g(xi), +1) + 2(1\u2212\u03c0)\n\nn\u2212\n\n(cid:80)\nxj\u2208X\u2212 (cid:96)(g(xj),\u22121).\n\nOn the loss function In order to train g by minimizing these estimators, it remains to specify the\nloss (cid:96). The zero-one loss (cid:96)01(t, y) = (1 \u2212 sign(ty))/2 satis\ufb01es (2) but is non-smooth. [7] proposed\nto use a scaled ramp loss as the surrogate loss for (cid:96)01 in PU learning:\n(cid:96)sr(t, y) = max{0, min{1, (1 \u2212 ty)/2}},\n\ninstead of the popular hinge loss that does not satisfy (2). Let I(g) = E(X,Y )[(cid:96)01(g(X), Y )] be the\nrisk of g w.r.t. (cid:96)01 under p(x, y). Then, (cid:96)sr is neither an upper bound of (cid:96)01 so that I(g) \u2264 R(g) is\nnot guaranteed, nor a convex loss so that it gets more dif\ufb01cult to know whether (cid:96)sr is classi\ufb01cation-\ncalibrated or not [19].1 If it is, we are able to control the excess risk w.r.t. (cid:96)01 by that w.r.t. (cid:96). Here\nwe prove the classi\ufb01cation calibration of (cid:96)sr, and consequently it is a safe surrogate loss for (cid:96)01.\nTheorem 1. The scaled ramp loss (cid:96)sr is classi\ufb01cation-calibrated (see Appendix A for the proof).\n\n3 Theoretical comparisons based on risk bounds\nbe the optimal decision function in G, \u02c6gpn = arg ming\u2208G (cid:98)Rpn(g), \u02c6gpu = arg ming\u2208G (cid:98)Rpu(g), and\nWhen learning is involved, suppose we are given a function class G, and let g\u2217 = arg ming\u2208G R(g)\n\u02c6gnu = arg ming\u2208G (cid:98)Rnu(g) be arbitrary global minimizers to three risk estimators. Furthermore, let\n\nR\u2217 = inf g R(g) and I\u2217 = inf g I(g) denote the Bayes risks w.r.t. (cid:96) and (cid:96)01, where the in\ufb01mum of g\nis over all measurable functions.\nIn this section, we derive and compare risk bounds of three risk minimizers \u02c6gpn, \u02c6gpu and \u02c6gnu under\nthe following mild assumption on G, p(x), p+(x) and p\u2212(x): There is a constant CG > 0 such that\n(5)\n\nRn,q(G) \u2264 CG/\n\n\u221a\n\nn\n\nfor any marginal density q(x) \u2208 {p(x), p+(x), p\u2212(x)}, where\n\nRn,q(G) = EX\u223cqnE\u03c3\n\n(cid:2)supg\u2208G 1\n\nn\n\n(cid:80)\n\nxi\u2208X \u03c3ig(xi)(cid:3)\n\nis the Rademacher complexity of G for the sampling of size n from q(x) (that is, X = {x1, . . . , xn}\nand \u03c3 = {\u03c31, . . . , \u03c3n}, with each xi drawn from q(x) and each \u03c3i as a Rademacher variable) [18].\nA special case is covered, namely, sets of hyperplanes with bounded normals and feature maps:\n\nG = {g(x) = (cid:104)w, \u03c6(x)(cid:105)H | (cid:107)w(cid:107)H \u2264 Cw,(cid:107)\u03c6(x)(cid:107)H \u2264 C\u03c6},\n\n(6)\nwhere H is a Hilbert space with an inner product (cid:104)\u00b7,\u00b7(cid:105)H, w \u2208 H is a normal vector, \u03c6 : Rd \u2192 H is a\nfeature map, and Cw > 0 and C\u03c6 > 0 are constants [26].\n\n1A loss function (cid:96) is classi\ufb01cation-calibrated if and only if there is a convex, invertible and nondecreasing\n\ntransformation \u03c8(cid:96) with \u03c8(cid:96)(0) = 0, such that \u03c8(cid:96)(I(g) \u2212 inf g I(g)) \u2264 R(g) \u2212 inf g R(g) [19].\n\n3\n\n\f3.1 Risk bounds\n\nLet L(cid:96) be the Lipschitz constant of (cid:96) in its \ufb01rst parameter. To begin with, we establish the learning\nguarantee of \u02c6gpu (the proof can be found in Appendix A).\nTheorem 2. Assume (2). For any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,2\n\nR(\u02c6gpu) \u2212 R(g\u2217) \u2264 8\u03c0L(cid:96)Rn+,p+(G) + 4L(cid:96)Rnu,p(G) + 2\u03c0\n\n(7)\nwhere Rn+,p+(G) and Rnu,p(G) are the Rademacher complexities of G for the sampling of size n+\nfrom p+(x) and the sampling of size nu from p(x). Moreover, if (cid:96) is a classi\ufb01cation-calibrated loss,\nthere exists nondecreasing \u03d5 with \u03d5(0) = 0, such that with probability at least 1 \u2212 \u03b4,\nI(\u02c6gpu)\u2212I\u2217 \u2264 \u03d5\n\nR(g\u2217)\u2212R\u2217+8\u03c0L(cid:96)Rn+,p+(G)+4L(cid:96)Rnu,p(G)+2\u03c0\n\n(cid:113) 2 ln(4/\u03b4)\n\n. (8)\n\n(cid:16)\n\n(cid:17)\n\n+\n\n+\n\nn+\n\n,\n\nn+\n\nnu\n\nnu\n\n(cid:113) 2 ln(4/\u03b4)\n(cid:113) 2 ln(4/\u03b4)\n\n(cid:113) 2 ln(4/\u03b4)\n\nIn Theorem 2, R(\u02c6gpu) and I(\u02c6gpu) are w.r.t. p(x, y), though \u02c6gpu is trained from two samples following\np+(x) and p(x). We can see that (7) is an upper bound of the estimation error of \u02c6gpu w.r.t. (cid:96), whose\nright-hand side (RHS) is small if G is small; (8) is an upper bound of the excess risk of \u02c6gpu w.r.t. (cid:96)01,\nwhose RHS also involves the approximation error of G (i.e., R(g\u2217) \u2212 R\u2217) that is small if G is large.\n\u221a\nWhen G is \ufb01xed and satis\ufb01es (5), we have Rn+,p+(G) = O(1/\nnu),\nand then\n\u221a\nin Op(1/\n\u221a\nthose complexities of G vanish slower in order than O(1/\n\nR(\u02c6gpu) \u2212 R(g\u2217) \u2192 0,\n\u221a\nnu). On the other hand, when the size of G grows with n+ and nu properly,\n\n\u221a\nn+) and Rnu,p(G) = O(1/\n\nI(\u02c6gpu) \u2212 I\u2217 \u2192 \u03d5(R(g\u2217) \u2212 R\u2217)\n\nnu) but we may have\n\nn+) and O(1/\n\nn+ + 1/\n\n\u221a\n\nR(\u02c6gpu) \u2212 R(g\u2217) \u2192 0,\n\u221a\n\nI(\u02c6gpu) \u2212 I\u2217 \u2192 0,\n\nn+ + 1/\n\nnu) due to the growth of G.\n\n\u221a\nwhich means \u02c6gpu approaches the Bayes classi\ufb01er if (cid:96) is a classi\ufb01cation-calibrated loss, in an order\nslower than Op(1/\nSimilarly, we can derive the learning guarantees of \u02c6gpn and \u02c6gnu for comparisons. We will just focus\non estimation error bounds, because excess risk bounds are their immediate corollaries.\nTheorem 3. Assume (2). For any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\nR(\u02c6gpn) \u2212 R(g\u2217) \u2264 4\u03c0L(cid:96)Rn+,p+(G) + 4(1 \u2212 \u03c0)L(cid:96)Rn\u2212,p\u2212 (G) + \u03c0\nwhere Rn\u2212,p\u2212(G) is the Rademacher complexity of G for the sampling of size n\u2212 from p\u2212(x).\nTheorem 4. Assume (2). For any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\nR(\u02c6gnu)\u2212R(g\u2217) \u2264 4L(cid:96)Rnu,p(G)+8(1\u2212\u03c0)L(cid:96)Rn\u2212,p\u2212 (G)+\nIn order to compare the bounds, we simplify (9), (7) and (10) using Eq. (5). To this end, we de\ufb01ne\n\nf (\u03b4) = 4L(cid:96)CG +(cid:112)2 ln(4/\u03b4). For the special case of G de\ufb01ned in (6), de\ufb01ne f (\u03b4) accordingly as\nf (\u03b4) = 4L(cid:96)CwC\u03c6 +(cid:112)2 ln(4/\u03b4).\n\nCorollary 5. The estimation error bounds below hold separately with probability at least 1 \u2212 \u03b4:\n\n(cid:113) 2 ln(4/\u03b4)\n\n(cid:113) 2 ln(4/\u03b4)\n\n(cid:113) 2 ln(4/\u03b4)\n\n(cid:113) 2 ln(4/\u03b4)\n\n+2(1\u2212\u03c0)\n\n+ (1 \u2212 \u03c0)\n\n. (10)\n\n,\n(9)\n\nn\u2212\n\nn\u2212\n\nn+\n\nnu\n\n\u221a\n\u221a\nn\u2212},\nR(\u02c6gpn) \u2212 R(g\u2217) \u2264 f (\u03b4) \u00b7 {\u03c0/\nn+ + (1 \u2212 \u03c0)/\n\u221a\n\u221a\nR(\u02c6gpu) \u2212 R(g\u2217) \u2264 f (\u03b4) \u00b7 {2\u03c0/\nnu},\nn+ + 1/\n\u221a\n\u221a\nR(\u02c6gnu) \u2212 R(g\u2217) \u2264 f (\u03b4) \u00b7 {1/\nnu + 2(1 \u2212 \u03c0)/\n\nn\u2212}.\n\n(11)\n(12)\n(13)\n\n3.2 Finite-sample comparisons\n\nNote that three risk minimizers \u02c6gpn, \u02c6gpu and \u02c6gnu work in similar problem settings and their bounds\nin Corollary 5 are proven using exactly the same proof technique. Then, the differences in bounds\nre\ufb02ect the intrinsic differences between risk minimizers. Let us compare those bounds. De\ufb01ne\n\n\u03b1pu,pn =(cid:0)\u03c0/\n\u03b1nu,pn =(cid:0)(1 \u2212 \u03c0)/\n\n\u221a\n\n\u221a\nn+ + 1/\n\nnu\n\n\u221a\n\nn\u2212 + 1/\n\n(cid:1) /(cid:0)(1 \u2212 \u03c0)/\n(cid:1) /(cid:0)\u03c0/\n\n\u221a\n\nnu\n\n\u221a\n\u221a\n\nn\u2212(cid:1) ,\n(cid:1) .\n\nn+\n\n(14)\n(15)\n\nEqs. (14) and (15) constitute our \ufb01rst main result.\n\n2Here, the probability is over repeated sampling of data for training \u02c6gpu, while in Lemma 8, it will be for\n\nevaluating (cid:98)Rpu(g).\n\n4\n\n\fTable 1: Properties of \u03b1pu,pn and \u03b1nu,pn.\n\nno speci\ufb01cation\n\nsizes are proportional\n\n\u03c1pn = \u03c0/(1 \u2212 \u03c0)\n\nmono. inc. mono. dec. mono. inc. mono. dec. mono. inc.\n\n\u03b1pu,pn\n\u03b1nu,pn\n\n\u03c0, n\u2212\nn+\n\nn+, nu\n\u03c0, n\u2212, nu\n\n\u03c0, \u03c1pu\n\u03c1pn, \u03c1nu\n\n\u03c1pn\n\u03c0\n\n\u03c1pu\n\u03c1nu\n\n2(cid:112)\u03c1pu +\n2(cid:112)\u03c1nu +\n\nminimum\n\u221a\n\u221a\n\n\u03c1pu\n\u03c1nu\n\nTheorem 6 (Finite-sample comparisons). Assume (5) is satis\ufb01ed. Then the estimation error bound\nof \u02c6gpu in (12) is tighter than that of \u02c6gpn in (11) if and only if \u03b1pu,pn < 1; also, the estimation error\nbound of \u02c6gnu in (13) is tighter than that of \u02c6gpn if and only if \u03b1nu,pn < 1.\nProof. Fix \u03c0, n+, n\u2212 and nu, and then denote by Vpn, Vpu and Vnu the values of the RHSs of (11),\n(12) and (13). In fact, the de\ufb01nitions of \u03b1pu,pn and \u03b1nu,pn in (14) and (15) came from\n\n\u221a\nVpu \u2212 \u03c0f (\u03b4)/\n\u221a\nVpn \u2212 \u03c0f (\u03b4)/\n\nn+\nn+\n\n, \u03b1nu,pn =\n\n\u221a\nVnu \u2212 (1 \u2212 \u03c0)f (\u03b4)/\nn\u2212\n\u221a\nVpn \u2212 (1 \u2212 \u03c0)f (\u03b4)/\nn\u2212\n\n.\n\n\u03b1pu,pn =\n\nAs a consequence, compared with Vpn, Vpu is smaller and (12) is tighter if and only if \u03b1pu,pn < 1,\nand Vnu is smaller and (13) is tighter if and only if \u03b1nu,pn < 1.\n\nWe analyze some properties of \u03b1pu,pn before going to our second main result. The most important\nproperty is that it relies on \u03c0, n+, n\u2212 and nu only; it is independent of G, p(x, y), p(x), p+(x) and\np\u2212(x) as long as (5) is satis\ufb01ed. Next, \u03b1pu,pn is obviously a monotonic function of \u03c0, n+, n\u2212 and\nnu. Furthermore, it is unbounded no matter if \u03c0 is \ufb01xed or not. Properties of \u03b1nu,pn are similar, as\nsummarized in Table 1.\nImplications of the monotonicity of \u03b1pu,pn are given as follows. Intuitively, when other factors are\n\ufb01xed, larger nu or n\u2212 improves \u02c6gpu or \u02c6gpn respectively. However, it is complicated why \u03b1pu,pn is\nmonotonically decreasing with n+ and increasing with \u03c0. The weights of the empirical average of\n\nX+ is 2\u03c0 in (cid:98)Rpu(g) and \u03c0 in (cid:98)Rpn(g), as in (cid:98)Rpu(g) it also joins the estimation of (1 \u2212 \u03c0)R\u2212(g). It\nmakes X+ more important for (cid:98)Rpu(g), and thus larger n+ improves \u02c6gpu more than \u02c6gpn. Moreover,\n(1 \u2212 \u03c0)R\u2212(g) is directly estimated in (cid:98)Rpn(g) and the concentration Op((1 \u2212 \u03c0)/\n\u03c0 is larger, whereas it is indirectly estimated through Ru,\u2212(g) \u2212 \u03c0(1 \u2212 R+(g)) in (cid:98)Rpu(g) and the\n\nn\u2212) is better if\n\nnu) is worse if \u03c0 is larger. As a result, when the sample sizes are\n\n\u221a\nconcentration Op(\u03c0/\n\ufb01xed \u02c6gpu is more (or less) favorable as \u03c0 decreases (or increases).\nA natural question is what the monotonicity of \u03b1pu,pn would be if we enforce n+, n\u2212 and nu to be\nproportional. To answer this question, we assume n+/n\u2212 = \u03c1pn, n+/nu = \u03c1pu and n\u2212/nu = \u03c1nu\nwhere \u03c1pn, \u03c1pu and \u03c1nu are certain constants, then (14) and (15) can be rewritten as\n\u221a\n\u03c1nu)/(\u03c0/\n\n\u03c1pn), \u03b1nu,pn = (1 \u2212 \u03c0 +\n\n\u221a\n\u03c1pu)/((1 \u2212 \u03c0)\n\n\u221a\nn+ + 1/\n\n\u03b1pu,pn = (\u03c0 +\n\n\u03c1pn).\n\n\u221a\n\n\u221a\n\n\u221a\n\nAs shown in Table 1, \u03b1pu,pn is now increasing with \u03c1pu and decreasing with \u03c1pn. It is because, for\ninstance, when \u03c1pn is \ufb01xed and \u03c1pu increases, nu is meant to decrease relatively to n+ and n\u2212.\nFinally, the properties will dramatically change if we enforce \u03c1pn = \u03c0/(1 \u2212 \u03c0) that approximately\nholds in ordinary supervised learning. Under this constraint, we have\n\n\u03b1pu,pn = (\u03c0 +\n\n\u03c1pu)/(cid:112)\u03c0(1 \u2212 \u03c0) \u2265 2(cid:112)\u03c1pu +\n\n\u221a\n\n\u03c1pu,\n\n\u221a\n\u221a\n\n\u221a\n\nwhere the equality is achieved at \u00af\u03c0 =\n\u03c1pu + 1). Here, \u03b1pu,pn decreases with \u03c0 if \u03c0 < \u00af\u03c0\nand increases with \u03c0 if \u03c0 > \u00af\u03c0, though it is not convex in \u03c0. Only if nu is suf\ufb01ciently larger than n+\n(e.g., \u03c1pu < 0.04), could \u03b1pu,pn < 1 be possible and \u02c6gpu have a tighter estimation error bound.\n\n\u03c1pu/(2\n\n3.3 Asymptotic comparisons\nIn practice, we may \ufb01nd that \u02c6gpu is worse than \u02c6gpn and \u03b1pu,pn > 1 given X+, X\u2212 and Xu. This is\nprobably the consequence especially when nu is not suf\ufb01ciently larger than n+ and n\u2212. Should we\nthen try to collect much more U data or just give up PU learning? Moreover, if we are able to have as\nmany U data as possible, is there any solution that would be provably better than PN learning?\n\n5\n\n\fWe answer these questions by asymptotic comparisons. Notice that each pair of (n+, nu) yields a\nvalue of the RHS of (12), each (n+, n\u2212) yields a value of the RHS of (11), and consequently each\ntriple of (n+, n\u2212, nu) determines a value of \u03b1pu,pn. De\ufb01ne the limits of \u03b1pu,pn and \u03b1nu,pn as\n\n\u03b1\u2217\npu,pn = limn+,n\u2212,nu\u2192\u221e \u03b1pu,pn, \u03b1\u2217\n\nnu,pn = limn+,n\u2212,nu\u2192\u221e \u03b1nu,pn.\n\npu,pn and \u03b1\u2217\n\nRecall that n+, n\u2212 and nu are independent, and we need two conditions for the existence of \u03b1\u2217\nnu,pn: n+ \u2192 \u221e and n\u2212 \u2192 \u221e in the same order and nu \u2192 \u221e faster in order than them. It is\nand \u03b1\u2217\na bit stricter than what is necessary, but is consistent with a practical assumption: P and N data are\nroughly equally expensive, whereas U data are much cheaper than P and N data. Intuitively, since\n\u03b1pu,pn and \u03b1nu,pn measure relative qualities of the estimation error bounds of \u02c6gpu and \u02c6gnu against\nthat of \u02c6gpn, \u03b1\u2217\nnu,pn measure relative qualities of the limits of those bounds accordingly.\nIn order to illustrate properties of \u03b1\u2217\nnu,pn, assume only nu approaches in\ufb01nity while n+\nand n\u2212 stay \ufb01nite, so that \u03b1\u2217\nn+) and \u03b1\u2217\nn\u2212).\nnu,pn < 1 unless n+/n\u2212 = \u03c02/(1 \u2212 \u03c0)2.\nThus, \u03b1\u2217\nIn principle, this exception should be exceptionally rare since n+/n\u2212 is a rational number whereas\n\u03c02/(1 \u2212 \u03c0)2 is a real number. This argument constitutes our second main result.\nTheorem 7 (Asymptotic comparisons). Assume (5) and one set of conditions below are satis\ufb01ed:\n\n\u221a\nn\u2212/((1 \u2212 \u03c0)\npu,pn < 1 or \u03b1\u2217\n\nnu,pn = 1, which implies \u03b1\u2217\n\n\u221a\nnu,pn = (1 \u2212 \u03c0)\n\n(a) n+ < \u221e, n\u2212 < \u221e and nu \u2192 \u221e. In this case, let \u03b1\u2217 = (\u03c0\n(b) 0 < limn+,n\u2212\u2192\u221e n+/n\u2212 < \u221e and limn+,n\u2212,nu\u2192\u221e(n+ + n\u2212)/nu = 0. In this case, let\n\n\u221a\nn\u2212)/((1 \u2212 \u03c0)\n\npu,pn and \u03b1\u2217\n\npu,pn\u03b1\u2217\n\npu,pn = \u03c0\n\nn+/(\u03c0\n\nn+);\n\n\u221a\n\n\u221a\n\n\u03b1\u2217 = \u03c0/((1 \u2212 \u03c0)(cid:112)\u03c1\u2217\n\npu,pn\n\n\u221a\n\npn) where \u03c1\u2217\n\npn = limn+,n\u2212\u2192\u221e n+/n\u2212.\n\npu,pn and \u03b1\u2217\n\npu,pn < 1)\nnu,pn < 1) if \u03b1\u2217 > 1. The\n\npu,pn in both cases. The proof of case (a) has been given as an illustration\n\nThen, either the limit of estimation error bounds of \u02c6gpu will improve on that of \u02c6gpn (i.e., \u03b1\u2217\nif \u03b1\u2217 < 1, or the limit of bounds of \u02c6gnu will improve on that of \u02c6gpn (i.e., \u03b1\u2217\npn = \u03c02/(1 \u2212 \u03c0)2 in (b).\nonly exception is n+/n\u2212 = \u03c02/(1 \u2212 \u03c0)2 in (a) or \u03c1\u2217\nProof. Note that \u03b1\u2217 = \u03b1\u2217\nof the properties of \u03b1\u2217\nAs a result, when we \ufb01nd that \u02c6gpu is worse than \u02c6gpn and \u03b1pu,pn > 1, we should look at \u03b1\u2217 de\ufb01ned in\nTheorem 7. If \u03b1\u2217 < 1, \u02c6gpu is promising and we should collect more U data; if \u03b1\u2217 > 1 otherwise,\nwe should give up \u02c6gpu, but instead \u02c6gnu is promising and we should collect more U data as well. In\naddition, the gap between \u03b1\u2217 and one indicates how many U data would be suf\ufb01cient. If the gap is\nsigni\ufb01cant, slightly more U data may be enough; if the gap is slight, signi\ufb01cantly more U data may\nbe necessary. In practice, however, U data are cheaper but not free, and we cannot have as many U\ndata as possible. Therefore, \u02c6gpn is still of practical importance given limited budgets.\n\nnu,pn. The proof of case (b) is analogous.\n\n3.4 Remarks\n\nthe risk R(g):\nLemma 8. For any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\n\nTheorem 2 relies on a fundamental lemma of the uniform deviation from the risk estimator (cid:98)Rpu(g) to\nsupg\u2208G |(cid:98)Rpu(g) \u2212 R(g)| \u2264 4\u03c0L(cid:96)Rn+,p+(G) + 2L(cid:96)Rnu,p(G) + 2\u03c0\nIn Lemma 8, R(g) is w.r.t. p(x, y), though (cid:98)Rpu(g) is w.r.t. p+(x) and p(x). Rademacher complexities\n\nare also w.r.t. p+(x) and p(x), and they can be bounded easily for G de\ufb01ned in Eq. (6).\nTheorems 6 and 7 rely on (5). Thanks to it, we can simplify Theorems 2, 3 and 4. In fact, (5) holds\nfor not only the special case of G de\ufb01ned in (6), but also the vast majority of discriminative models in\nmachine learning that are nonlinear in parameters such as decision trees (cf. Theorem 17 in [16]) and\nfeedforward neural networks (cf. Theorem 18 in [16]).\nTheorem 2 in [7] is a similar bound of the same order as our Lemma 8. That theorem is based on a\ntricky decomposition of the risk\n\n(cid:113) ln(4/\u03b4)\n\n(cid:113) ln(4/\u03b4)\n\n2n+\n\n2nu\n\n+\n\n.\n\nE(X,Y )[(cid:96)(g(X), Y )] = \u03c0E+[\u02dc(cid:96)(g(X), +1)] + E(X,Y )[\u02dc(cid:96)(g(X), Y )],\n\nwhere the surrogate loss \u02dc(cid:96)(t, y) = (2/(y + 3))(cid:96)(t, y) is not (cid:96) for risk minimization and labels of Xu\nare needed for risk evaluation, so that no further bound is implied. Lemma 8 uses the same (cid:96) as risk\n\nminimization and requires no label of Xu for evaluating (cid:98)Rpu(g), so that it can serve as the stepping\n\nstone to our estimation error bound in Theorem 2.\n\n6\n\n\f(a) Theo. (nu var.)\n\n(b) Expe. (nu var.)\n\n(c) Theo. (\u03c0 var.)\n\n(d) Expe. (\u03c0 var.)\n\nFigure 1: Theoretical and experimental results based on arti\ufb01cial data.\n\n4 Experiments\n\nIn this section, we experimentally validate our theoretical \ufb01ndings.\nArti\ufb01cial data Here, X+, X\u2212 and Xu are in R2 and drawn from three marginal densities\n\n\u221a\n\n\u221a\np+(x) = N (+12/\n\n2, I2),\n\np\u2212(x) = N (\u221212/\n\n2, I2),\n\np(x) = \u03c0p+(x) + (1 \u2212 \u03c0)p\u2212(x),\n\nwhere N (\u00b5, \u03a3) is the normal distribution with mean \u00b5 and covariance \u03a3, 12 and I2 are the all-one\nvector and identity matrix of size 2. The test set contains one million data drawn from p(x, y).\nThe model g(x) = (cid:104)w, x(cid:105) + b where w \u2208 R2, b \u2208 R and the scaled ramp loss (cid:96)sr are employed. In\naddition, an (cid:96)2-regularization is added with the regularization parameter \ufb01xed to 10\u22123, and there is\nno hard constraint on (cid:107)w(cid:107)2 or (cid:107)x(cid:107)2 as in Eq. (6). The solver for minimizing three regularized risk\nestimators comes from [7] (refer also to [27, 28] for the optimization technique).\nThe results are reported in Figure 1. In (a)(b), n+ = 45, n\u2212 = 5, \u03c0 = 0.5, and nu varies from 5 to\n200; in (c)(d), n+ = 45, n\u2212 = 5, nu = 100, and \u03c0 varies from 0.05 to 0.95. Speci\ufb01cally, (a) shows\n\u03b1pu,pn and \u03b1nu,pn as functions of nu, and (c) shows them as functions of \u03c0. For the experimental\nresults, \u02c6gpn, \u02c6gpu and \u02c6gnu were trained based on 100 random samplings for every nu in (b) and \u03c0 in\n(d), and means with standard errors of the misclassi\ufb01cation rates are shown, as (cid:96)sr is classi\ufb01cation-\ncalibrated. Note that the empirical misclassi\ufb01cation rates are essentially the risks w.r.t. (cid:96)01 as there\nwere one million test data, and the \ufb02uctuations are attributed to the non-convex nature of (cid:96)sr. Also,\nthe curve of \u02c6gpn is not a \ufb02at line in (b), since its training data at every nu were exactly same as the\ntraining data of \u02c6gpu and \u02c6gnu for fair experimental comparisons.\nIn Figure 1, the theoretical and experimental results are highly consistent. The red and blue curves\nintersect at nearly the same positions in (a)(b) and in (c)(d), even though the risk minimizers in the\nexperiments were locally optimal and regularized, making our estimation error bounds inexact.\n\nBenchmark data Table 2 summarizes the speci\ufb01cation of benchmarks, which were downloaded\nfrom many sources including the IDA benchmark repository [29], the UCI machine learning reposi-\ntory, the semi-supervised learning book [30], and the European ESPRIT 5516 project.3 In Table 2,\nthree rows describe the number of features, the number of data, and the ratio of P data according to\nthe true class labels. Given a random sampling of X+, X\u2212 and Xu, the test set has all the remaining\ndata if they are less than 104, or else drawn uniformly from the remaining data of size 104.\nFor benchmark data, the linear model for the arti\ufb01cial data is not enough, and its kernel version is\nemployed. Consider training \u02c6gpu for example. Given a random sampling, g(x) = (cid:104)w, \u03c6(x)(cid:105) + b is\nused where w \u2208 Rn++nu, b \u2208 R and \u03c6 : Rd \u2192 Rn++nu is the empirical kernel map [26] based on\nX+ and Xu for the Gaussian kernel. The kernel width and the regularization parameter are selected\nby \ufb01ve-fold cross-validation for each risk minimizer and each random sampling.\n\n3See http://www.raetschlab.org/Members/raetsch/benchmark/ for IDA, http://archive.ics.\nuci.edu/ml/ for UCI, http://olivier.chapelle.cc/ssl-book/ for the SSL book and https://www.\nelen.ucl.ac.be/neural-nets/Research/Projects/ELENA/ for the ELENA project.\n\nTable 2: Speci\ufb01cation of benchmark datasets.\n\nbanana\n2\n5300\n.448\n\nphoneme magic\n10\n19020\n.648\n\n5\n5404\n.293\n\nimage\n18\n2086\n.570\n\ngerman\n20\n1000\n.300\n\ndim\nsize\nP ratio\n\n7\n\ntwonorm waveform spambase\n57\n4597\n.394\n\n21\n5000\n.329\n\n20\n7400\n.500\n\ncoil2\n241\n1500\n.500\n\n050100150200nu024681012,,pu;pn,nu;pn,=1050100150200nu20253035Misclassification rate (%)^gpu^gnu^gpn00.20.40.60.81:100101102,00.20.40.60.81:5101520253035Misclassification rate (%)\f(a) Theo.\n\n(b) banana\n\n(c) phoneme\n\n(d) magic\n\n(e) image\n\n(f) german\n\n(g) twonorm\n\n(h) waveform\n\n(i) spambase\n\n(j) coil2\n\nFigure 2: Experimental results based on benchmark data by varying nu.\n\n(a) Theo.\n\n(b) banana\n\n(c) phoneme\n\n(d) magic\n\n(e) image\n\n(f) german\n\n(g) twonorm\n\n(h) waveform\n\n(i) spambase\n\n(j) coil2\n\nFigure 3: Experimental results based on benchmark data by varying \u03c0.\n\nThe results by varying nu and \u03c0 are reported in Figures 2 and 3 respectively. Similarly to Figure 1,\nin Figure 2, n+ = 25, n\u2212 = 5, \u03c0 = 0.5, and nu varies from 10 to 300, while in Figure 3, n+ = 25,\nn\u2212 = 5, nu = 200, and \u03c0 varies from 0.05 to 0.95. Figures 2(a) and 3(a) depict \u03b1pu,pn and \u03b1nu,pn\nas functions of nu and \u03c0, and all the remaining sub\ufb01gures depict means with standard errors of the\nmisclassi\ufb01cation rates based on 100 random samplings for every nu and \u03c0.\nThe theoretical and experimental results based on benchmarks are still highly consistent. However,\nunlike in Figure 1(b), in Figure 2 only the errors of \u02c6gpu decrease with nu, and the errors of \u02c6gnu just\n\ufb02uctuate randomly. This may be because benchmark data are more dif\ufb01cult than arti\ufb01cial data and\nhence n\u2212 = 5 is not suf\ufb01ciently informative for \u02c6gnu even when nu = 300. On the other hand, we\ncan see that Figures 3(a) and 1(c) look alike, and so do all the remaining sub\ufb01gures in Figure 3 and\nFigure 1(d). Nevertheless, three intersections in Figure 3(a) are closer than those in Figure 1(c), as\nnu = 200 in Figure 3(a) and nu = 100 in Figure 1(c). The three intersections will become a single\none if nu = \u221e. By observing the experimental results, three curves in Figure 3 are also closer than\nthose in Figure 1(d) when \u03c0 \u2265 0.6, which demonstrates the validity of our theoretical \ufb01ndings.\n\n5 Conclusions\n\nIn this paper, we studied a fundamental problem in PU learning, namely, when PU learning is likely\nto outperform PN learning. Estimation error bounds of the risk minimizers were established in PN,\nPU and NU learning. We found that under the very mild assumption (5): The PU (or NU) bound is\ntighter than the PN bound, if \u03b1pu,pn in (14) (or \u03b1nu,pn in (15)) is smaller than one (cf. Theorem 6);\neither the limit of \u03b1pu,pn or that of \u03b1nu,pn will be smaller than one, if the size of U data increases\nfaster in order than the sizes of P and N data (cf. Theorem 7). We validated our theoretical \ufb01ndings\nexperimentally using one arti\ufb01cial data and nine benchmark data.\n\nAcknowledgments\n\nGN was supported by the JST CREST program and Microsoft Research Asia. MCdP, YM, and MS\nwere supported by the JST CREST program. TS was supported by JSPS KAKENHI 15J09111.\n\n8\n\n050100150200250300nu02468,,pu;pn,nu;pn,=1050100150200250300nu30354045Misclassification rate (%)^gpu^gnu^gpn050100150200250300nu30354045Misclassification rate (%)050100150200250300nu354045Misclassification rate (%)050100150200250300nu30354045Misclassification rate (%)050100150200250300nu4042444648Misclassification rate (%)050100150200250300nu354045Misclassification rate (%)050100150200250300nu20253035Misclassification rate (%)050100150200250300nu30354045Misclassification rate (%)050100150200250300nu36384042444648Misclassification rate (%)00.20.40.60.81:100101102,,pu;pn,nu;pn,=100.20.40.60.81:1020304050Misclassification rate (%)^gpu^gnu^gpn00.20.40.60.81:1020304050Misclassification rate (%)00.20.40.60.81:1020304050Misclassification rate (%)00.20.40.60.81:10203040Misclassification rate (%)00.20.40.60.81:1020304050Misclassification rate (%)00.20.40.60.81:1020304050Misclassification rate (%)00.20.40.60.81:10203040Misclassification rate (%)00.20.40.60.81:1020304050Misclassification rate (%)00.20.40.60.81:1020304050Misclassification rate (%)\fReferences\n[1] F. Denis. PAC learning from positive statistical queries. In ALT, 1998.\n[2] F. Letouzey, F. Denis, and R. Gilleron. Learning from positive and unlabeled examples. In ALT, 2000.\n[3] C. Elkan and K. Noto. Learning classi\ufb01ers from only positive and unlabeled data. In KDD, 2008.\n[4] G. Ward, T. Hastie, S. Barry, J. Elith, and J. Leathwick. Presence-only data and the EM algorithm.\n\nBiometrics, 65(2):554\u2013563, 2009.\n\n[5] C. Scott and G. Blanchard. Novelty detection: Unlabeled data de\ufb01nitely help. In AISTATS, 2009.\n[6] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine Learning\n\nResearch, 11:2973\u20133009, 2010.\n\n[7] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In\n\nNIPS, 2014.\n\n[8] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled\n\ndata. In ICML, 2015.\n\n[9] G. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past\n\nobservations. In SIGIR, 2008.\n\n[10] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An experimental comparison of click position-bias\n\nmodels. In WSDM, 2008.\n\n[11] O. Chapelle and Y. Zhang. A dynamic Bayesian network click model for web search ranking. In WWW,\n\n2009.\n\n[12] J. Qui\u00f1onero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine\n\nLearning. MIT Press, 2009.\n\n[13] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.\n[14] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In O. Bousquet,\nU. von Luxburg, and G. R\u00e4tsch, editors, Advanced Lectures on Machine Learning, pages 169\u2013207. Springer,\n2004.\n\n[15] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information\n\nTheory, 47(5):1902\u20131914, 2001.\n\n[16] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[17] R. Meir and T. Zhang. Generalization error bounds for Bayesian mixture algorithms. Journal of Machine\n\nLearning Research, 4:839\u2013860, 2003.\n\n[18] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.\n[19] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101(473):138\u2013156, 2006.\n\n[20] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classi\ufb01er to new a priori\n\nprobabilities: A simple procedure. Neural Computation, 14(1):21\u201341, 2002.\n\n[21] M. C. du Plessis and M. Sugiyama. Semi-supervised learning of class balance under class-prior change by\n\ndistribution matching. In ICML, 2012.\n\n[22] A. Iyer, S. Nath, and S. Sarawagi. Maximum mean discrepancy for class ratio estimation: Convergence\n\nbounds and kernel selection. In ICML, 2014.\n\n[23] M. C. du Plessis, G. Niu, and M. Sugiyama. Class-prior estimation for learning from positive and unlabeled\n\ndata. In ACML, 2015.\n\n[24] H. G. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of\n\ndistributions. In ICML, 2016.\n\n[25] K.-L. Chung. A Course in Probability Theory. Academic Press, 1968.\n[26] B. Sch\u00f6lkopf and A. Smola. Learning with Kernels. MIT Press, 2001.\n[27] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In ICML, 2006.\n[28] A. L. Yuille and A. Rangarajan. The concave-convex procedure (CCCP). In NIPS, 2001.\n[29] G. R\u00e4tsch, T. Onoda, and K. R. M\u00fcller. Soft margins for AdaBoost. Machine learning, 42(3):287\u2013320,\n\n2001.\n\n[30] O. Chapelle, B. Sch\u00f6lkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.\n[31] C. McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics,\n\npages 148\u2013188. Cambridge University Press, 1989.\n\n[32] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.\n\n9\n\n\f", "award": [], "sourceid": 662, "authors": [{"given_name": "Gang", "family_name": "Niu", "institution": "University of Tokyo"}, {"given_name": "Marthinus Christoffel", "family_name": "du Plessis", "institution": "The University of Tokyo"}, {"given_name": "Tomoya", "family_name": "Sakai", "institution": "The University of Tokyo"}, {"given_name": "Yao", "family_name": "Ma", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}