{"title": "Learning from Label Proportions with Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7169, "page_last": 7179, "abstract": "In this paper, we leverage generative adversarial networks (GANs) to derive an effective algorithm LLP-GAN for learning from label proportions (LLP), where only the bag-level proportional information in labels is available. Endowed with end-to-end structure, LLP-GAN performs approximation in the light of an adversarial learning mechanism, without imposing restricted assumptions on distribution. Accordingly, we can directly induce the final instance-level classifier upon the discriminator. Under mild assumptions, we give the explicit generative representation and prove the global optimality for LLP-GAN. Additionally, compared with existing methods, our work empowers LLP solver with capable scalability inheriting from deep models. Several experiments on benchmark datasets demonstrate vivid advantages of the proposed approach.", "full_text": "Learning from Label Proportions with\n\nGenerative Adversarial Networks\n\nSamsung Research China - Beijing\n\nUniversity of International Business and Economics\n\nBo Wang\u2217\n\nBeijing 100029, China\nwangbo@uibe.edu.cn\n\nJiabin Liu\u2217\n\nBeijing 100028, China\n\nliujiabin008@126.com\n\nZhiquan Qi\u2020\n\nYingjie Tian\n\nYong Shi\n\nUniversity of Chinese Academy of Sciences\n\nBeijing 100190, China\n\nqizhiquan@foxmail.com, {tyj,yshi}@ucas.ac.cn\n\nAbstract\n\nIn this paper, we leverage generative adversarial networks (GANs) to derive an\neffective algorithm LLP-GAN for learning from label proportions (LLP), where\nonly the bag-level proportional information in labels is available. Endowed with\nend-to-end structure, LLP-GAN performs approximation in the light of an adver-\nsarial learning mechanism, without imposing restricted assumptions on distribution.\nAccordingly, we can directly induce the \ufb01nal instance-level classi\ufb01er upon the dis-\ncriminator. Under mild assumptions, we give the explicit generative representation\nand prove the global optimality for LLP-GAN. Additionally, compared with exist-\ning methods, our work empowers LLP solver with capable scalability inheriting\nfrom deep models. Several experiments on benchmark datasets demonstrate vivid\nadvantages of the proposed approach.\n\n1\n\nIntroduction\n\nDeep learning bene\ufb01ts from end-to-end design philosophy, which emphasizes minimal a priori\nrepresentational and computational assumption, and greatly avoids explicit structure and \u201chand-\nengineering\u201d [4]. Doubtless, most of its achievements are af\ufb01rmed by the access to the abundance of\nfully supervised data [14, 13, 26]. One reason is that a large amount of complete data can alleviate\nthe over-\ufb01tting problem without deteriorating the hypothesis set complexity (e.g., a sophisticated\nnetwork design with vast parameters).\nUnfortunately, fully labeled data is not always handy to utilize. Firstly, it is infeasible or labor-\nintensive to obtain abundant accurate labeled data [31]. Secondly, labels are not accessible under\ncertain circumstances, such as privacy constraints [22]. Therefore, the community begins to pay\nattention to weakly supervised learning (a.k.a., weakly labeled learning, WeLL), concerning any\ncorrosion in supervised information. For example, semi-supervised learning (SSL) is a WeLL\nproblem by concealing most of the labels in the training stage. Another widespread WeLL problem is\nmulti-instance learning (MIL) [19], which is also a representative in learning with bags.\nFurthermore, generative adversarial networks (GANs) [11], which is originally proposed to synthesize\nhigh \ufb01delity data in line with the estimation of underlying data distribution, can be potentially applied\nto WeLL. For instance, GANs can deal with K-class SSL [28], by treating the generated data as\n\n\u2217Joint \ufb01rst authors with equal contribution\n\u2020Corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An illustration of multi-class learning from label proportions. In detail, the data belongs to\nthree categories and is partitioned into four non-overlapping groups. In each group, the sizes of green,\nblue, and orange rectangles respectively denote available label proportions in different categories.\nWe only know the sample feature information and class proportions in every group.\n\nclass K +1 and exploiting feature matching (FM) as the generator objective. Unlike performing\nmaximizing log-likelihood on the variational lower bound of unlabeled data [16, 17], GANs seeks\nthe equilibrium between two networks (discriminator and generator) by alternatively upgrading in an\nadversarial game, and directly obtains the \ufb01nal classi\ufb01er upon the discriminator.\nIn this paper, we push the envelope further by focusing on applying GANs to another WeLL problem:\nlearning from label proportions (LLP) (see [23, 27, 33] for real-life applications). We illustrate\nmulti-class LLP problem in Figure 1. By referring group as bag, LLP also \ufb01ts for learning with bags\nsettings, which is primarily established in MIL [9]. In LLP, we strive for an instance-level multi-class\nclassi\ufb01er merely with multi-bag proportional information and instance features (inputs). On the right,\ninstances from different categories are classi\ufb01ed based on a well-trained multi-class classi\ufb01er.\nThe main challenge for LLP is to shrink the uncertainty in label inference based on the bag-level\nproportional information. Before deep learning making its appearance, several shallow methods have\nbeen proposed, such as probability estimation methods (e.g., MeanMap [23] and Laplacian MeanMap\n[21]) and SVM-based methods (e.g., InvCal [27] and alter-\u221dSVM [34, 22]). However, statistical\napproaches are extremely constrained by strict assumption on data distribution and prior knowledge,\nwhile the SVM-based methods suffer from the NP-hard combinatorial optimization issue, thus is lack\nof scalability.\nThe motivation of our work mainly lies in the following three aspects. Firstly, as introduced above,\nGAN is an elegant recipe for solving WeLL problems, especially SSL [28]. From this viewpoint, our\napproach is in line with the idea of applying GAN to incomplete label scenarios. More importantly,\nthe success of generative models for WeLL stems from the explicit or implicit representation learning,\nwhich has been an essential method for unsupervised learning for a long time [5, 24], e.g., VAE\n[16]. In our approach, the convolution layers in discriminator can perform as a feature extractor for\ndownstream tasks, which is proved to be ef\ufb01cient [24]. Hence, our work can be regarded as solving\nLLP based on representation learning with GANs. In this scheme, generated fake samples encourage\nthe discriminator to not only detect the difference between the real and the fake instances, but also\ndistinguish true K classes for real samples (through K + 1 classi\ufb01er). Thirdly, most LLP methods\nassume that the bags are i.i.d. [23, 34], which cannot suf\ufb01ciently explore the underlying distribution\nin the data and may be contradicted in certain applications. Instead, the generator in LLP-GAN is\ndesignated to learn data distributions through the adversarial scheme without this assumption.\nThe remainder of this paper is organized as follows:\n\nment based on entropy regularization for the existing deep LLP solver.\n\n\u2022 In Section 2, we give preliminaries regarding LLP problem and propose a simple improve-\n\u2022 In Section 3, we describe our adversarial learning framework for LLP, especially the\nlower bound of discriminator. In particular, we reveal the relationship between prior class\nproportions and posterior class likelihoods. More importantly, we offer a decomposition\nrepresentation of the class likelihood with respect to the prior class proportions, which\nveri\ufb01es the existence of the \ufb01nal classi\ufb01er.\n\u2022 In Section 4, we empirically show that our method can achieve SOTA performance on\n\nlarge-scale LLP problems with a low computational complexity.\n\n2\n\ndolphinpandabutterflydolphinpandabutterflydolphinpandabutterfly45%25%30%25%45%50%30%50%17%33%30%20%Bag1Bag2Bag3Bag4dolphinbutterflypandadolphinbutterflypandaLLP\f2 Preliminaries\n\nThis section offers necessary preliminaries for our approach, including the formal problem setting\nand related work with simple extensions.\n\n2.1 The Multi-class LLP\n\nBefore further discussion, we formally describe multi-class LLP. For simplicity, we assume that all\ni }, i = 1, 2,\u00b7\u00b7\u00b7, n be the bags in training set. Then,\nthe bags are disjoint and let Bi ={x1\ntraining data is D =B1 \u222a B2 \u222a\u00b7\u00b7\u00b7\u222a Bn,Bi \u2229 Bj =\u2205,\u2200i (cid:54)= j, where the total number of bags is n.\nAssuming we have K classes, for Bi, let pi be a K-element vector, where the kth element pk\n\nproportion of instances belonging to the class k, with the constraint(cid:80)K\n\ni ,\u00b7\u00b7\u00b7, xNi\n\ni is the\n\ni , x2\n\nk=1 pk\n\ni = 1, i.e.,\n\ni \u2208Bi, yj\u2217\n|{j\u2208 [1 : Ni]|xj\n|Bi|\n\ni = k}|\n\npk\ni :=\nHere, [1 : Ni] ={1, 2,\u00b7\u00b7\u00b7, Ni} and yj\u2217\nthis way, we can denote the available training data as L ={(Bi, pi)}n\nan instance-level classi\ufb01er based on L.\n\ni\n\n.\n\nis the unaccessible ground-truth instance-level label of xj\n\ni . In\ni=1. The goal of LLP is to learn\n\n(1)\n\n2.2 Deep LLP Approach\n\nIn terms of deep learning, DLLP \ufb01rstly leverages DNNs to solve multi-class LLP problem [1]. Using\nDNN\u2019s probabilistic classi\ufb01cation outputs, it is straightforward to adapt cross-entropy loss into a\nbag-level version by averaging the probability outputs in every bag as the proportion estimation. To\nthis end, inspired by [31], DLLP reshapes standard cross-entropy loss by substituting instance-level\nlabel with label proportion, in order to meet the requirement of proportion consistency.\nIn detail, suppose that \u02dcpj\ni , where \u03b8 is the network\nparameter. Let \u2295 be element-wise summation operator. Then, the bag-level label proportion in the\nith bag is obtained by incorporating the element-wise posterior probability:\n\ni ) is the vector-valued DNNs output for xj\n\ni = p\u03b8(y|xj\n\nNi(cid:77)\n\nj=1\n\nNi(cid:77)\n\nj=1\n\npi =\n\n1\nNi\n\n\u02dcpj\n\ni =\n\n1\nNi\n\np\u03b8(y|xj\ni ).\n\n(2)\n\nDifferent from the discriminant approaches, in order to smooth max function [6], \u02dcpj\ni is in a vector-type\nsoftmax manner to produce the probability distribution for classi\ufb01cation. Taking log as element-wise\nlogarithmic operator, the objective of DLLP can be intuitively formulated using cross-entropy loss\n(cid:124)\ni log(pi). It penalizes the difference between prior and posterior probabilities in\ni=1 p\n\nLprop =\u2212(cid:80)n\n\nbag-level, and commonly exists in GAN-based SSL [29].\n\n2.3 Entropy Regularization for DLLP\n\nFollowing the entropy regularization strategy [12], we can introduce an extra loss Ein with a trade-off\nhyperparameter \u03bb to constrain instance-level output distribution in a low entropy accordingly:\n\nL = Lprop + \u03bbEin = \u2212 n(cid:88)\n\nn(cid:88)\n\nNi(cid:88)\n\n(cid:124)\ni log(pi) \u2212 \u03bb\np\n\n(\u02dcpj\ni )\n\n(cid:124)\n\nlog(\u02dcpj\n\ni ).\n\n(3)\n\ni=1\n\ni=1\n\nj=1\n\nThis straightforward extension of DLLP is similar to a KL divergence, taking care of bag-level and\ninstance-level consistencies simultaneously. It takes advantage of DNN\u2019s output distribution to cater\nto the label proportions requirement, as well as minimizing output entropy as a regularization term to\nguarantee high true-fake belief. This is believed to be linked with an inherent maximum a posteriori\n(MAP) estimation [6] with certain prior distribution in network parameters. However, we will not\nlook at the performance of this extension and consider not to include it as a baseline, because the\nexperimental results empirically suggest that the original DLLP has already converged to the solution\nwith fairly low instance-level entropy, which makes the proposed regularization term redundant. We\noffer results of this empirical study in the Supplementary Material.\n\n3\n\n\f3 Adversarial Learning for LLP\n\nIn this section, we focus on LLP based on adversarial learning and propose LLP-GAN, which devotes\nGANs to harnessing LLP problem.\nWe illustrate the LLP-GAN framework in Figure 2. Firstly, the generator is employed to generate\nimage with input noise, which is labeled as fake, and the discriminator yields class con\ufb01dence maps\nfor each class (including the fake one) by taking both fake and real data as its inputs. This results in\nthe adversarial loss. Secondly, we incorporate the proportions by adding the cross entropy loss.\n\nFigure 2: An illustration of our LLP-GAN framework.\n\n3.1 The Objective Function of Discriminator\n\nmax\n\nD\n\nn(cid:88)\n\n(cid:104)\n\n(cid:105)\n\n(cid:104)\n\n(cid:105)\n\nn(cid:88)\n\nIn LLP-GAN, our discriminator is not only to identify whether a sample is from the real data or not,\nbut also to elaborately distinguish each real input\u2019s label assignment as a K classes classi\ufb01er. We\nincorporate the unsupervised adversarial learning into the Lunsup term.\nNext, the main issue becomes how to exploit the proportional information to guide this unsupervised\nlearning correctly. To this end, we replace the supervised information in semi-supervised GANs with\nlabel proportions, resulting in Lsup, same as Lprop in (3).\nDe\ufb01nition 1. Suppose that P is a partition to divide the data space into n disjoint sections. Let\nd(x), i = 1, 2,\u00b7\u00b7\u00b7 , n be marginal distributions with respect to elements in P respectively. Accord-\npi\nd(x), i = 1, 2,\u00b7\u00b7\u00b7 , n. In the meantime,\ningly, n bags in LLP training data spring from sampling upon pi\nlet p(x, y) be the unknown holistic joint distribution.\nWe normalize the \ufb01rst K classes in PD(\u00b7|x) into the instance-level posterior probability \u02dcpD(\u00b7|x) and\ncompute p based on (2). Then, the ideal optimization problem for the discriminator of LLP-GAN is:\n\nV (G, D) = Lunsup +Lsup = Lreal +Lf ake\u2212\u03bbCEL(p, p)\n\n=\n\nEx\u223cpi\n\nd\n\nlogPD(y\u2264 K|x)\n\n+Ex\u223cpg\n\nlogPD(K +1|x)\n\n+\u03bb\n\n(cid:124)\np\ni log(pi).\n\n(4)\n\ni=1\n\ni=1\n\nPD(k|x)\n\nHere, pg(x) is the distribution of the synthesized data.\nRemark 1. When PD(K +1|x)(cid:54)= 1, the normalized instance-level posterior probability \u02dcpD(\u00b7|x) is:\n\u02dcpD(k|x) =\n(5)\nK , k = 1, 2,\u00b7\u00b7\u00b7 , K. Note that weight \u03bb in (4) is added to balance\nIf PD(K+1|x) = 1, let \u02dcpD(k|x) = 1\nbetween supervised and unsupervised terms, which is a slight revision of SSL with GANs [28, 8].\nIntuitively, we reckon that the proportional information is too weak to ful\ufb01ll supervised learning\npursuit. Hence, a relatively large weight should be preferable in the experiments. However, large \u03bb\nmay result in unstable GANs training. For simplicity, we \ufb01x \u03bb = 1 in the following theoretical analysis\non discriminator.\n\n, k = 1, 2,\u00b7\u00b7\u00b7 , K.\n\n1\u2212PD(K +1|x)\n\nAside from identifying the \ufb01rst two terms in (4) as that in semi-supervised GANs, the cross-entropy\nterm harnesses the label proportions consistency. In order to justify the non-triviality of this loss, we\n\ufb01rst look at its lower bound. More importantly, it is easier to perform the gradient method on the\nlower bound, because it swaps the order of log and the summation operation. For brevity, the analysis\nwill be done in a non-parametric setting, i.e., we assume that both D and G have in\ufb01nite capacity.\n\n4\n\nDiscriminatorDiscriminatorC1 C2C3GeneratorGeneratorNoiseC4 Real DataFake Data[5/9 2/9 2/9][3/9 2/9 4/9][4/9 2/9 3/9][2/9 4/9 3/9]Predicted ProportionsAdversarial LossCross Entropy Loss\fRemark 2 (The Lower Bound Approximation). Let pi(k) be the class k proportion in the ith bag.\nAccording to the idea of sampling methods and Jensen\u2019s inequality, we have:\n\n\u2212CEL(p, p) =\n\npi(k)log\n\n\u02dcpD(k|xj\ni )\n\nn(cid:88)\nK(cid:88)\n(cid:119) n(cid:88)\nK(cid:88)\n\nk=1\n\ni=1\n\nNi(cid:88)\n\nj=1\n\n(cid:104) 1\n(cid:104)(cid:90)\n\nNi\n\n(cid:105)\n\npi(k)log\n\nd(x)\u02dcpD(k|x)dx\npi\n\npi(k)Ex\u223cpi\n\nd\n\nlog \u02dcpD(k|x)\n\n(cid:105)(cid:62) n(cid:88)\n\nK(cid:88)\n\n(cid:104)\n\n(cid:105)\n\n(6)\n\n.\n\ni=1\n\nk=1\n\ni=1\n\nk=1\n\nThe expectation in the last term can be approximated by sampling. Similar to EM mechanism [20]\nfor mixture models, by approximating \u2212CEL(p, p) with its lower bound, we can perform gradient\nascend independently on every sample. Hence, SGD can be applied.\n\nAs shown in (6), in order to facilitate the gradient computation, we substitute cross entropy in (4) by\n\nits lower bound and denote this approximate objective function for discriminator by (cid:101)V (G, D).\n\n3.2 The Optimal Discriminator and LLP Classi\ufb01er\n\nNow, we give the optimal discriminator and the \ufb01nal classi\ufb01er for LLP based on the analysis of\n\n(cid:101)V (G, D). Firstly, we have the following result of the lower bound in (6).\n\nLemma 1. The maximization on the lower bound in (6) induces an optimal discriminator D\u2217 with a\nposterior distribution \u02dcpD\u2217 (y|x), which is consistent with the prior distribution pi(y) in each bag.\n\n(cid:90)\n\nd\n\nd\n\nd\n\nd\n\nk=1\n\ndy\n\n+\n\nk=1\n\n(7)\n\n(cid:105)\n\n(cid:105)\n\n(cid:105)\n\nlog\n\nd\n\n(cid:105)\n\n(cid:105)\n\n.\n\n(cid:104)\n\n(cid:90)\n\np(yi)\n\nlog\nd\n\nEx\u223cpi\n\nd\n\npi(y)log\n\n= Ex\u223cpi\n\nK(cid:88)\n\npi(k)Ex\u223cpi\n\npi(k)Ex\u223cpi\n\n\u02dcpD(y|x)\n\ndy+Ex\u223cpi\n\n[logp(x)] = Ex\u223cpi\n\n(cid:62) K(cid:88)\n\n\u02dcpD(y|x)\np(y|x)\n\nlog \u02dcpD(y|x)+log\n\n= Ex\u223cpi\np(x|y)\np(y|x)\n\n(cid:104) pi(y)p(x|y)\n\nProof. Taking the aggregation with respect to one bag, for example, the ith bag, we have:\n\u02dcpD(y|x)\np(y|x)\n\n(cid:104) p(x, y)\n(cid:104)\n(cid:104)\nlog \u02dcpD(k|x)\nHere, because we only consider x \u223c pi\nd, p(x, y) = pi(y)p(y|x) holds. Note that the last term in (7)\nis free of the discriminator, and the aggregation can be independently performed within every bag\ndue to the disjoint assumption on bags. Then, maximizing the lower bound in (6) is equivalent to\nminimizing the expectation of KL-divergence between pi(y) and \u02dcpD(y|x). Because of the in\ufb01nite\ncapacity assumption on discriminator and the non-negativity of KL-divergence, we have:\nd(x).\n\n\u02dcpD(y|x)\nKL(pi(y)(cid:107)\u02dcpD(y|x))\np(x|k)\np(k|x)\n\nD\u2217 = arg min\nThat concludes the proof.\nLemma 1 tells us that if there is only one bag, then the \ufb01nal classi\ufb01er \u02dcpD\u2217 (y|x) a.e.= p(y). However,\nthere are normally multiple bags in LLP problem, the \ufb01nal classi\ufb01er will somehow be a trade-off\namong all the prior proportions pi(y), i = 1, 2,\u00b7\u00b7\u00b7, n. Next, we will show how the adversarial learning\non the discriminator helps to determine the formulation of this trade-off in a weighted aggregation.\n\nTheorem 1. For \ufb01xed G, the optimal discriminator D\u2217 for (cid:101)V (G, D) satis\ufb01es:\n\nKL(pi(y)(cid:107)\u02dcpD(y|x)) \u21d4 \u02dcpD\u2217 (y|x) a.e.= pi(y), x \u223c pi\n\nEx\u223cpi\n\nd\n\n(8)\n\nD\n\nProof. According to (4) and (6) and given any generator G, we have:\n\n(cid:101)V (G, D) =\n(cid:90) (cid:110) n(cid:88)\n\n=\n\nn(cid:88)\n\ni=1\n\nEx\u223cpi\n\n(cid:104)\n(cid:104)\nlog(cid:2) K(cid:88)\n\nd\n\npi\nd(x)\n\nPD(k|x)(cid:3)+\n\nK(cid:88)\n\npi(k)log\n\nPD(k|x)\n\n1\u2212PD(K +1|x)\n\ni=1\n\nk=1\n\nk=1\n\nlog(1\u2212PD(K +1|x))\n\n+Ex\u223cpg\n\nlogPD(K +1|x)\n\npi(k)Ex\u223cpi\n\nd\n\nlog \u02dcpD(k|x)\n\n(cid:105)\n\n, k = 1, 2,\u00b7\u00b7\u00b7 , K.\nK(cid:88)\n(cid:104)\n1\u2212 K(cid:88)\n\nn(cid:88)\n\n+pg(x)log\n\n(cid:105)\n\nk=1\n\ni=1\n\n+\n\n(9)\n\n(cid:105)\n\n(cid:104)\n\n(cid:105)(cid:111)\n\nPD(k|x)\n\ndx.\n\nk=1\n\n(10)\n\nPD\u2217 (y = k|x) =\n\ni=1 pi(k)pi\ni=1 pi\n\nd(x)\nd(x)+pg(x)\n\n(cid:80)n\n(cid:80)n\n(cid:105)\n\n(cid:104)\n\nBy taking the derivative of the integrand, we \ufb01nd the solution in [0, 1] for maximization as (9).\n\n5\n\n\fRemark 3 (Beyond the Incontinuity of pg). According to [2], the problematic scenario is that the\ngenerator is a mapping from a low dimensional space to a high dimensional one. This will result in\nthe density of pg(x) infeasible. However, based on the de\ufb01nition of \u02dcpD(y|x) in (5), we have:\n\n(cid:80)n\n(cid:80)n\n\n\u02dcpD\u2217 (y|x) =\n\ni=1 pi(y)pi\nd(x)\n\ni=1 pi\n\nd(x)\n\n=\n\nn(cid:88)\n\ni=1\n\nwi(x)pi(y).\n\n(11)\n\nHence, our \ufb01nal classi\ufb01er does not depend on pg(x). Furthermore, (11) explicitly expresses the\n(cid:80)n\nnormalized weights of the aggregation with wi(x) = pi\nRemark 4 (Relationship to One-side Label Smoothing). Notice that the optimal discriminator\nD\u2217 is also related to the one-sided label smoothing mentioned in [28], which is inspirited by [30]\nand shown to reduce the vulnerability of neural networks to adversarial examples [32].\nIn particular, in our model, we only smooth labels of real data (multi-class) in the discriminator, by\nsetting the targets as the prior proportions pi(y) in corresponding bags.\n\nd(x)\ni=1 pi\n\nd(x) .\n\n3.3 The Objective Function of Generator\n\n(cid:101)V (G, D\u2217) = min\n\nNormally, for the generator, we should solve the following optimization problem with respect to pg.\n\nEx\u223cpg logPD\u2217 (K + 1|x).\n\nG\n\nG\n\nmin\n\nof a set of convex functions is still convex, we have the following suf\ufb01cient and necessary condition\nof global optimality.\nTheorem 2. The global minimum of C(G) is achieved if and only if pg = 1\nn\n\nDenoting C(G) = maxD(cid:101)V (G, D) =(cid:101)V (G, D\u2217), because(cid:101)V (G, D) is convex in pg and the supremum\nProof. Denote pd =(cid:80)n\n(cid:104)\n\nd. Hence, according to Theorem 1, we can reformulate C(G) as:\n\n(cid:80)n\n\nd.\ni=1 pi\n\ni=1 pi\n\n(12)\n\n(cid:104)\n\n(cid:105)\n\n(cid:105)\n\nK(cid:88)\n\npg(x)\n\n+\n\npi(k)Ex\u223cpi\n\nd\n\nlog \u02dcpD\u2217 (k|x)\n\nC(G) =\n\n+Ex\u223cpg\n\nlog\n\npd(x)+pg(x)\n\nk=1\n\n(cid:105)\nn(cid:88)\n= 2 \u00b7 JSD(pd(cid:107)pg)\u22122log(2)\u2212 n(cid:88)\n\npd(x)+pg(x)\n\nEx\u223cpi\n\npd(x)\n\nlog\n\ni=1\n\nd\n\nn(cid:88)\n(cid:105)\n\ni=1\n\n,\n\n(cid:104)\n(cid:104)\n\nCE(pi(y), \u02dcpD\u2217 (y|x))\n\nEx\u223cpi\n\nd\n\ni=1\n\n(13)\nwhere JSD(\u00b7(cid:107)\u00b7) and CE(\u00b7,\u00b7) are the Jensen-Shannon divergence and cross entropy between two\ndistributions, respectively. However, note that pd is a summation of n independent distributions, so\nn pd is a well-de\ufb01ned probabilistic density. Then, we have:\n1\n\nC(G) = nlog(n)\u2212(n+1)log(n+1)\u2212 n(cid:88)\n\n\u2217\nC(G\n\n) = min\n\nG\n\nCE(pi(y), \u02dcpD\u2217 (y|x))\n\nEx\u223cpi\n\nd\n\ni=1\n\n(cid:105) \u21d0\u21d2 pg\u2217 a.e.\n\n=\n\n1\nn\n\npd.\n\n(14)\n\n(cid:104)\n\nThat concludes the proof.\nRemark 5. When there is only one bag, the \ufb01rst two terms in (14) will degenerate as nlog(n)\u2212(n+\n1)log(n+1) =\u22122log2, which adheres to results in original GANs. On the other hand, the third term\nmanifests the uncertainty on instance label, which is concealed in the form of proportion.\nRemark 6. According to the analysis above, ideally, we can obtain the Nash equilibrium between\nthe discriminator and the generator, i.e., the solution pair (G\u2217, D\u2217) satis\ufb01es:\n),\u2200G.\n\n, D),\u2200D;(cid:101)V (G\n\n) (cid:54) (cid:101)V (G, D\n\n) (cid:62) (cid:101)V (G\n\n(cid:101)V (G\n\n(15)\n\n, D\n\n, D\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\nHowever, as shown in [8], a well-trained generator would lead to the inef\ufb01ciency of supervised\ninformation. In other words, the discriminator would possess the same generalization ability as merely\ntraining it on Lprop. Hence, we apply feature matching (FM) to the generator and obtain its alternative\nobjective by matching the expected value of the features (statistics) on an intermediate layer of the\ndiscriminator [28]: L(G) =(cid:107)Ex\u223c 1\n2. In fact, FM is similar to the perceptual\nloss for style transfer in a concurrent work [15], and the goal of this improvement is to impede the\n\u201cperfect\u201d generator resulting in unstable training and discriminator with low generalization.\n\nf (x)\u2212Ex\u223cpg f (x)(cid:107)2\n\nn pd\n\n6\n\n\f3.4 LLP-GAN Algorithm\n\nSo far, we have clari\ufb01ed the objective functions of both discriminator and generator in LLP-GAN.\nWhen accomplishing the training stage, the discriminator can be put into effect as the \ufb01nal classi\ufb01er.\n\nThe strict proof for algorithm convergence is similar to that in [11]. Because maxD(cid:101)V (G, D) is\nconvex in G, and the subdifferential of maxD(cid:101)V (G, D) contains that of (cid:101)V (G, D\u2217) in every step, the\n\nline search method (stochastic) gradient descent converges [7].\nWe present the LLP-GAN algorithm, which coincides with the algorithm of the original GAN [11].\n\nAlgorithm 1: LLP-GAN Training Algorithm\nInput: The training set L ={(Bi, pi)}n\nOutput: The parameters of the \ufb01nal discriminator D.\nSet m to the total number of training data points.\nfor i=1:L do\n\ni=1; L: number of total iterations; \u03bb: weight parameter.\n\nDraw m samples {z(1), z(2),\u00b7\u00b7\u00b7, z(m)} from a simple-to-sample noise prior p(z) (e.g., N (0, I)).\nCompute {G(z(1)), G(z(2)),\u00b7\u00b7\u00b7 , G(z(m))} as sampling from pg(x).\n\nFix the generator G and perform gradient ascent on parameters of D in (cid:101)V (G, D) for one step.\n\nFix the discriminator D and perform gradient descent on parameters of G in L(G) for one step.\n\nend\nReturn parameters of the discriminator D in the last step.\n\n4 Experiments\n\nFour benchmark datasets, MNIST, SVHN, CIFAR-10, and CIFAR-100 are investigated in our\nexperiments1. In addition to test error comparison, three issues are discussed: the generated samples,\nthe performance under different selections of hyperparameter \u03bb, and the algorithm scalability.\n\n4.1 Experimental Setup\n\nTo keep up the same settings in previous work, bag size is \ufb01xed as 16, 32, 64, and 128. We divide\ntraining data into bags. MNIST data can be found in the code in the Supplementary Material. We\nconceal the accessible instance-level labels by replacing them with bag-level label proportions. Note\nthat we still need the instance-level labels in test data to justify the effectiveness of the obtained\nclassi\ufb01er.\n\n4.2 Results on CIFAR-10\n\nFirstly, we perform both DLLP and LLP-GAN on CIFAR-10, which is a computer-vision dataset\nused for object recognition with 60,000 color images belonging to 10 categories, respectively. In the\nexperimental setting, the training data is equally divided into \ufb01ve minibatches, with 10,000 images in\neach one, and the test data with exactly 1,000 images in every category.\n\n4.2.1 Convergence Analysis\n\nWe report the convergence curves of test error (y-axis) with respect to the epoch (x-axis) under\ndifferent bag sizes in Figure 3. As shown, our results are highly superior to DLLP in most of the\nepochs, with signi\ufb01cant convergence in test error. In contrast, DLLP fails to converge under relatively\nlarge bag sizes (i.e., 64 and 128). Also, our method achieves a better performance in accuracy.\n\n4.2.2 Generated Samples\n\nThe original GAN suffers from inef\ufb01cient training on the generator [2]. It suggests that the dis-\ncriminator and generator cannot simultaneously perform well [8]. In LLP-GAN, although it is the\ndiscriminator that we are interested in, we still expect a competent generator to construct ef\ufb01cient\nadversarial learning paradigm. As a result, we look at the generated samples of original GANs with\n\n1Code is available at https://github.com/liujiabin008/LLP-GAN.\n\n7\n\n\f(a) Bag size: 16\n\n(b) Bag size: 32\n\n(c) Bag size: 64\n\n(d) Bag size: 128\n\nFigure 3: The convergence curves on CIFAR-10 w/ different bag sizes.\n\n(a) GANs with FM\n\n(b) Ours after 50 epochs\n\n(c) Ours after 60 epochs\n\n(d) Ours after 70 epochs\n\nFigure 4: Generated samples on CIFAR-10.\n\nFM in Figure 4(a) and our method in Figure 4(b), 4(c) and 4(d). It demonstrates that our approach\ncan stably learn a comparable generator to produce similar samples to that of GANs.\n\n4.3 The Results of Error Rate\n\nSecondly, DLLP and LLP-GAN are carried out on four bench-\nmark datasets with different bag sizes in Table 1. We also give\nthe fully supervised learning results as the baselines. In detail,\nbaseline for MNIST and CIFAR-10 is offered by [25]. We de-\nscribe its architecture in the Supplementary Material. Network\nin [18] is used as the baseline for SVHN and CIFAR-100.\nIn terms of test error, our method reaches a relatively better\nresult, except for the simplest task MNIST, where both algo-\nrithms can attain satisfying results. However, DLLP becomes\nunacceptable when the bag size increases, while our method\ncan properly tackle relatively large bag size. Besides, for each\ndataset, LLP becomes extremely dif\ufb01cult as bag size soaring, which is consistent with our intuition.\nAgain, the architectures of our network are given in the Supplementary Material.\n\nFigure 5: The average error rates w/\ndifferent bag sizes.\n\nTable 1: Test error rates (%) on benchmark datasets w/ different bag sizes.\n\nDataset\n\nAlgorithm\n\nMNIST\n\nSVHN\n\nCIFAR-10\n\nCIFAR-100\n\n16\n\nDLLP\n\nDLLP\n\n1.23 (0.100)\nLLP-GAN 1.10 (0.026)\n4.45 (0.069)\nLLP-GAN 4.03 (0.021)\n19.70 (0.77)\nLLP-GAN 13.68 (0.35)\n53.24 (0.77)\nLLP-GAN 50.95 (0.67)\n\nDLLP\n\nDLLP\n\nBag Size\n\n32\n\n1.33 (0.094)\n1.23 (0.088)\n5.29 (0.54)\n4.83 (0.51)\n34.39 (0.82)\n16.23 (0.43)\n98.38 (0.11)\n56.44 (0.78)\n\n64\n\n1.57 (0.088)\n1.40 (0.089)\n5.80 (0.91)\n5.42 (0.59)\n68.32 (1.34)\n21.03 (1.82)\n98.65 (0.09)\n64.37 (1.52)\n\n128\n\n3.55 (0.27)\n3.49 (0.27)\n39.73 (1.60)\n11.17 (1.12)\n82.89 (2.66)\n27.39 (4.31)\n98.98 (0.08)\n85.01 (1.81)\n\nBaseline\nCNNs\n0.36\n\n2.35\n\n9.27\n\n35.68\n\nBecause InvCal and alter-\u221dSVM are originally designed for binary problem, we randomly select two\nclasses and merely conduct binary classi\ufb01cation on all datasets. The detailed results are provided\nin the Supplementary Material. The average error rates with different bag sizes are displayed in\nFigure 5. From the results, we can con\ufb01dently tell the advantage of our algorithm in performance,\n\n8\n\n02004006008001000Epoch0.20.40.60.81ErrorDLLPLLP-GAN02004006008001000Epoch0.20.40.60.81ErrorDLLPLLP-GAN02004006008001000Epoch0.20.40.60.81ErrorDLLPLLP-GAN02004006008001000Epoch0.20.40.60.81ErrorDLLPLLP-GAN163264128Bag Size00.050.10.15ErrorInvCalalter-/SVMDLLPLLP-GAN\fespecially when the bag size is relatively large. Indeed, at this moment, we cannot clarify to what\nextent this advantage attributes to the deep learning model. However, in spite of both using deep\nlearning models, our method constantly performs better than DLLP.\n\n4.4 Hyperparameter Analysis and Complexity with Sample Size\n\nThirdly, we illustrate the convergence curves of MNIST, SVHN, and CIFAR-10 under different \u03bbs in\nFigure 6(a), 6(b) and 6(c). For simpler task (MNIST), the performance is not sensitive to \u03bb. However,\nfor harder task (CIFAR-10), the performance becomes sensitive to \u03bb. On the other hand, smaller \u03bb\ndemonstrates more \ufb02uctuations, which is much severer in simpler tasks (MNIST and SVHN). Besides,\nFigure 6(b) indicates that the convergence speed may be sensitive to the choice of \u03bb. In most of the\ncases, \u03bb(cid:62) 1 is a good choice, leading to a comparable performance within limited training time.\nIn addition, \ufb01xing the bag size, we provide the relative training time (training time per bag) to\nthe relative sample size in Figure 6(d). We take logarithmic operation on sample size (x-axis). It\ndemonstrates that the relative training time is asymptotically linear to the logarithmic sample size m.\nDenote the total training time as t, then t \u2248 O(mlnm) < O(m2). Here, we assume that sample size\nand # of bags are with same magnitude, due to the relative small bag sizes involved in our study.\n\n(a) \u03bb on MNIST\n\n(b) \u03bb on SVHN\n\n(c) \u03bb on CIFAR-10\n\n(d) Training time w/ differ-\nent sample sizes\n\nFigure 6: Analysis on hyperparameter and complexity.\n\n4.5 Discussion on Experimental Results\n\nTwo issues should be clari\ufb01ed for experiments. Firstly, as shown in Figure 3, the results demonstrate\noscillation as bag size soaring. This phenomenon indicates a common drawback of deep models: For\nmore complex objective surfaces (more possible label candidates), normally the convergence will\nbe dramatically getting worse, due to more chances to attain local minima or saddle points of the\nobjective. Secondly, because our results are based on original datasets without data augmentation, the\nreported DLLP performance is worse than that in the concurrent [10].\n\n5 Conclusion\n\nThis paper proposed a new algorithm LLP-GAN for LLP problem in virtue of the adversarial learning\nbased on GANs. Consequently, our method is superior to existing methods in the following three\naspects. Firstly, it demonstrates nice theoretical properties that are innately in accordance with\nGANs. Secondly, LLP-GAN can produce a probabilistic classi\ufb01er, which bene\ufb01ts from the generative\nmodel and meets the proportion consistency naturally. Thirdly, on account of equipping CNNs, our\nalgorithm is suitable for the large-scale problem, especially for image datasets. Additionally, the\nexperiments on four benchmark datasets have veri\ufb01ed all these advantages of our approach.\nNevertheless, limitations in our method can be summarized in four aspects. Firstly, learning com-\nplexity in the sense of PAC has not been involved in this study. That is to say, we cannot evaluate\nthe performance under limited data. Secondly, there is no guarantee on algorithm robustness to data\nperturbations, notably when the proportions are imprecisely provided. Thirdly, varying GAN models\n(such as WGAN [3]) are not fully considered, and their performance is still unknown. In addition,\nin many real-world applications, the bags are built based on certain features, such as the education\nlevels and job titles, rather than randomly established. Hence, a practical issue will be to ensure good\nperformance under these non-random bag assignments. To overcome these drawbacks will shed light\non the promising improvement of our current work.\n\n9\n\n\fAcknowledgements\n\nThis work is supported by grants from: National Natural Science Foundation of China (No.61702099,\n71731009, 61472390, 71932008, 91546201, and 71331005), Science and Technology Service\nNetwork Program of Chinese Academy of Sciences (STS Program, No.KFJ-STS-ZDTP-060), and\nthe Fundamental Research Funds for the Central Universities in UIBE (No.CXTD10-05). Bo Wang\nwould like to acknowledge that this research was conducted during his visit at Texas A&M University\nand thank Dr. Xia Hu for his hosting and insightful discussions.\n\nReferences\n[1] Ehsan M. Ardehaly and Aron Culotta. Co-training for demographic classi\ufb01cation using deep\nlearning from label proportions. In International Conference on Data Mining Workshops, pages\n1017\u20131024. IEEE, 2017.\n\n[2] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\n\nsarial networks. In International Conference on Learning Representations, 2016.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In International Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[4] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, et al. Relational inductive biases, deep\n\nlearning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.\n\n[5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and\nnew perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u2013\n1828, 2013.\n\n[6] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, January 2006.\n\n[7] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[8] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good\nsemi-supervised learning that requires a bad gan. In Advances in neural information processing\nsystems, pages 6510\u20136520, 2017.\n\n[9] Thomas G. Dietterich, Richard H. Lathrop, and Tom\u00e1s Lozano-P\u00e9rez. Solving the multiple\n\ninstance problem with axis-parallel rectangles. Arti\ufb01cial Intelligence, 89(1-2):31\u201371, 1997.\n\n[10] Gabriel Dulac-Arnold, Neil Zeghidour, Marco Cuturi, Lucas Beyer, and Jean-Philippe Vert.\n\nDeep multi-class learning from label proportions. arXiv preprint arXiv:1905.12909, 2019.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. Generative adversarial nets. In\n\nAdvances in Neural Information Processing Systems, pages 2672\u20132680, 2014.\n\n[12] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In\n\nAdvances in neural information processing systems, pages 529\u2013536, 2005.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. Deep residual learning for image recognition.\n\nIn Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[14] Geoffrey Hinton, Li Deng, Dong Yu, et al. Deep neural networks for acoustic modeling in\n\nspeech recognition. IEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\n\n[15] Justin Johnson, Alexandre Alahi, and Fei-Fei Li. Perceptual losses for real-time style transfer\nand super-resolution. In European Conference on Computer Vision, pages 694\u2013711. Springer,\n2016.\n\n[16] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n10\n\n\f[17] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in neural information processing systems,\npages 3581\u20133589, 2014.\n\n[18] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400,\n\n2013.\n\n[19] Oded Maron and Tom\u00e1s Lozano-P\u00e9rez. A framework for multiple-instance learning.\n\nAdvances in Neural Information Processing Systems, pages 570\u2013576, 1998.\n\nIn\n\n[20] Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine,\n\n13(6):47\u201360, 1996.\n\n[21] Giorgio Patrini, Richard Nock, Paul Rivera, and Tiberio Caetano. (Almost) no label no cry. In\n\nAdvances in Neural Information Processing Systems, pages 190\u2013198, 2014.\n\n[22] Zhiquan Qi, Bo Wang, Fan Meng, et al. Learning with label proportions via NPSVM. IEEE\n\nTransactions on Cybernetics, 47(10):3293\u20133305, 2017.\n\n[23] Novi Quadrianto, Alex J. Smola, Tiberio S. Caetano, et al. Estimating labels from label\n\nproportions. Journal of Machine Learning Research, 10(Oct):2349\u20132374, 2009.\n\n[24] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[25] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-\nsupervised learning with ladder networks. In Advances in neural information processing systems,\npages 3546\u20133554, 2015.\n\n[26] Joseph Redmon, Santosh Divvala, Ross Girshick, et al. You only look once: Uni\ufb01ed, real-time\n\nobject detection. In Computer Vision and Pattern Recognition, pages 779\u2013788, 2016.\n\n[27] Stefan Rueping. SVM classi\ufb01er estimation from group probabilities. In International Conference\n\non Machine Learning, pages 911\u2013918, 2010.\n\n[28] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. Improved techniques for training\n\nGANs. In Advances in Neural Information Processing Systems, pages 2234\u20132242, 2016.\n\n[29] Jost T. Springenberg. Unsupervised and semi-supervised learning with categorical generative\n\nadversarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, et al. Rethinking the inception architecture\n\nfor computer vision. In Computer Vision and Pattern Recognition, pages 2818\u20132826, 2016.\n\n[31] Zilei Wang and Jiashi Feng. Multi-class learning from class proportions. Neurocomputing,\n\n119(16):273\u2013280, 2013.\n\n[32] David Warde-Farley and Ian Goodfellow. Adversarial perturbations of deep neural networks. In\n\nPerturbations, Optimization, and Statistics, page 311. MIT Press, 2016.\n\n[33] Felix X. Yu, Liangliang Cao, Michele Merler, et al. Modeling attributes from category-attribute\n\nproportions. In International Conference on Multimedia, pages 977\u2013980. ACM, 2014.\n\n[34] Felix X. Yu, Dong Liu, Sanjiv Kumar, et al. \u221d-SVM for learning with label proportions. In\n\nInternational Conference on Machine Learning, pages 504\u2013512, 2013.\n\n11\n\n\f", "award": [], "sourceid": 3901, "authors": [{"given_name": "Jiabin", "family_name": "Liu", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Bo", "family_name": "Wang", "institution": "University of International Business and Economics"}, {"given_name": "Zhiquan", "family_name": "Qi", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "YingJie", "family_name": "Tian", "institution": "University of Chinese Academy of Sciences"}, {"given_name": "Yong", "family_name": "Shi", "institution": "University of Chinese Academy of Sciences"}]}