{"title": "Good Semi-supervised Learning That Requires a Bad GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 6510, "page_last": 6520, "abstract": "Semi-supervised learning methods based on generative adversarial networks (GANs) obtained strong empirical results, but it is not clear 1) how the discriminator benefits from joint training with a generator, and 2) why good semi-supervised classification performance and a good generator cannot be obtained at the same time. Theoretically we show that given the discriminator objective, good semi-supervised learning indeed requires a bad generator, and propose the definition of a preferred generator. Empirically, we derive a novel formulation based on our analysis that substantially improves over feature matching GANs, obtaining state-of-the-art results on multiple benchmark datasets.", "full_text": "Good Semi-supervised Learning\n\nThat Requires a Bad GAN\n\nZihang Dai\u2217, Zhilin Yang\u2217, Fan Yang, William W. Cohen, Ruslan Salakhutdinov\n\nSchool of Computer Science\nCarnegie Melon University\n\ndzihang,zhiliny,fanyang1,wcohen,rsalakhu@cs.cmu.edu\n\nAbstract\n\nSemi-supervised learning methods based on generative adversarial networks\n(GANs) obtained strong empirical results, but it is not clear 1) how the discrimina-\ntor bene\ufb01ts from joint training with a generator, and 2) why good semi-supervised\nclassi\ufb01cation performance and a good generator cannot be obtained at the same\ntime. Theoretically we show that given the discriminator objective, good semi-\nsupervised learning indeed requires a bad generator, and propose the de\ufb01nition\nof a preferred generator. Empirically, we derive a novel formulation based on\nour analysis that substantially improves over feature matching GANs, obtaining\nstate-of-the-art results on multiple benchmark datasets2.\n\n1\n\nIntroduction\n\nDeep neural networks are usually trained on a large amount of labeled data, and it has been a challenge\nto apply deep models to datasets with limited labels. Semi-supervised learning (SSL) aims to leverage\nthe large amount of unlabeled data to boost the model performance, particularly focusing on the\nsetting where the amount of available labeled data is limited. Traditional graph-based methods [2, 26]\nwere extended to deep neural networks [22, 23, 8], which involves applying convolutional neural\nnetworks [10] and feature learning techniques to graphs so that the underlying manifold structure\ncan be exploited. [15] employs a Ladder network to minimize the layerwise reconstruction loss\nin addition to the standard classi\ufb01cation loss. Variational auto-encoders have also been used for\nsemi-supervised learning [7, 12] by maximizing the variational lower bound of the unlabeled data\nlog-likelihood.\nRecently, generative adversarial networks (GANs) [6] were demonstrated to be able to generate\nvisually realistic images. GANs set up an adversarial game between a discriminator and a generator.\nThe goal of the discriminator is to tell whether a sample is drawn from true data or generated by the\ngenerator, while the generator is optimized to generate samples that are not distinguishable by the\ndiscriminator. Feature matching (FM) GANs [16] apply GANs to semi-supervised learning on K-\nclass classi\ufb01cation. The objective of the generator is to match the \ufb01rst-order feature statistics between\nthe generator distribution and the true distribution. Instead of binary classi\ufb01cation, the discriminator\nemploys a (K + 1)-class objective, where true samples are classi\ufb01ed into the \ufb01rst K classes and\ngenerated samples are classi\ufb01ed into the (K + 1)-th class. This (K + 1)-class discriminator objective\nleads to strong empirical results, and was later widely used to evaluate the effectiveness of generative\nmodels [5, 21].\nThough empirically feature matching improves semi-supervised classi\ufb01cation performance, the\nfollowing questions still remain open. First, it is not clear why the formulation of the discriminator\n\n\u2217Equal contribution. Ordering determined by dice rolling.\n2Code is available at https://github.com/kimiyoung/ssl_bad_gan.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcan improve the performance when combined with a generator. Second, it seems that good semi-\nsupervised learning and a good generator cannot be obtained at the same time. For example, [16]\nobserved that mini-batch discrimination generates better images than feature matching, but feature\nmatching obtains a much better semi-supervised learning performance. The same phenomenon was\nalso observed in [21], where the model generated better images but failed to improve the performance\non semi-supervised learning.\nIn this work, we take a step towards addressing these questions. First, we show that given the\ncurrent (K + 1)-class discriminator formulation of GAN-based SSL, good semi-supervised learning\nrequires a \u201cbad\u201d generator. Here by bad we mean the generator distribution should not match the\ntrue data distribution. Then, we give the de\ufb01nition of a preferred generator, which is to generate\ncomplement samples in the feature space. Theoretically, under mild assumptions, we show that a\nproperly optimized discriminator obtains correct decision boundaries in high-density areas in the\nfeature space if the generator is a complement generator.\nBased on our theoretical insights, we analyze why feature matching works on 2-dimensional toy\ndatasets. It turns out that our practical observations align well with our theory. However, we also \ufb01nd\nthat the feature matching objective has several drawbacks. Therefore, we develop a novel formulation\nof the discriminator and generator objectives to address these drawbacks. In our approach, the\ngenerator minimizes the KL divergence between the generator distribution and a target distribution\nthat assigns high densities for data points with low densities in the true distribution, which corresponds\nto the idea of a complement generator. Furthermore, to enforce our assumptions in the theoretical\nanalysis, we add the conditional entropy term to the discriminator objective.\nEmpirically, our approach substantially improves over vanilla feature matching GANs, and obtains\nnew state-of-the-art results on MNIST, SVHN, and CIFAR-10 when all methods are compared under\nthe same discriminator architecture. Our results on MNIST and SVHN also represent state-of-the-art\namongst all single-model results.\n\n2 Related Work\n\nBesides the adversarial feature matching approach [16], several previous works have incorporated the\nidea of adversarial training in semi-supervised learning. Notably, [19] proposes categorical generative\nadversarial networks (CatGAN), which substitutes the binary discriminator in standard GAN with a\nmulti-class classi\ufb01er, and trains both the generator and the discriminator using information theoretical\ncriteria on unlabeled data. From the perspective of regularization, [14, 13] propose virtual adversarial\ntraining (VAT), which effectively smooths the output distribution of the classi\ufb01er by seeking virtually\nadversarial samples. It is worth noting that VAT bears a similar merit to our approach, which is to\nlearn from auxiliary non-realistic samples rather than realistic data samples. Despite the similarity,\nthe principles of VAT and our approach are orthogonal, where VAT aims to enforce a smooth function\nwhile we aim to leverage a generator to better detect the low-density boundaries. Different from\naforementioned approaches, [24] proposes to train conditional generators with adversarial training\nto obtain complete sample pairs, which can be directly used as additional training cases. Recently,\nTriple GAN [11] also employs the idea of conditional generator, but uses adversarial cost to match\nthe two model-de\ufb01ned factorizations of the joint distribution with the one de\ufb01ned by paired data.\nApart from adversarial training, there has been other efforts in semi-supervised learning using deep\ngenerative models recently. As an early work, [7] adapts the original Variational Auto-Encoder\n(VAE) to a semi-supervised learning setting by treating the classi\ufb01cation label as an additional\nlatent variable in the directed generative model. [12] adds auxiliary variables to the deep VAE\nstructure to make variational distribution more expressive. With the boosted model expressiveness,\nauxiliary deep generative models (ADGM) improve the semi-supervised learning performance upon\nthe semi-supervised VAE. Different from the explicit usage of deep generative models, the Ladder\nnetworks [15] take advantage of the local (layerwise) denoising auto-encoding criterion, and create a\nmore informative unsupervised signal through lateral connection.\n\n3 Theoretical Analysis\nGiven a labeled set L = {(x, y)}, let {1, 2,\u00b7\u00b7\u00b7 , K} be the label space for classi\ufb01cation. Let D and\nG denote the discriminator and generator, and PD and pG denote the corresponding distributions.\n\n2\n\n\fConsider the discriminator objective function of GAN-based semi-supervised learning [16]:\nEx,y\u223cL log PD(y|x, y \u2264 K) + Ex\u223cp log PD(y \u2264 K|x) + Ex\u223cpG log PD(K + 1|x),\n\nmax\n\nD\n\n(1)\n\nwhere p is the true data distribution. The probability distribution PD is over K + 1 classes where\nthe \ufb01rst K classes are true classes and the (K + 1)-th class is the fake class. The objective function\nconsists of three terms. The \ufb01rst term is to maximize the log conditional probability for labeled data,\nwhich is the standard cost as in supervised learning setting. The second term is to maximize the log\nprobability of the \ufb01rst K classes for unlabeled data. The third term is to maximize the log probability\nof the (K + 1)-th class for generated data. Note that the above objective function bears a similar\nmerit to the original GAN formulation if we treat P (K + 1|x) to be the probability of fake samples,\nwhile the only difference is that we split the probability of true samples into K sub-classes.\nLet f (x) be a nonlinear vector-valued function, and wk be the weight vector for class k. As a standard\nsetting in previous work [16, 5], the discriminator D is de\ufb01ned as PD(k|x) =\n.\nk(cid:48) f (x))\nSince this is a form of over-parameterization, wK+1 is \ufb01xed as a zero vector [16]. We next discuss\nthe choices of different possible G\u2019s.\n\nexp(w(cid:62)\nk(cid:48)=1\n\n(cid:80)K+1\n\nexp(w(cid:62)\n\nk f (x))\n\n3.1 Perfect Generator\n\nHere, by perfect generator we mean that the generator distribution pG exactly matches the true data\ndistribution p, i.e., pG = p. We now show that when the generator is perfect, it does not improve the\ngeneralization over the supervised learning setting.\nProposition 1. If pG = p, and D has in\ufb01nite capacity, then for any optimal solution D = (w, f ) of\nthe following supervised objective,\n\nEx,y\u223cL log PD(y|x, y \u2264 K),\n\nD\n\nmax\n\n(2)\nthere exists D\u2217 = (w\u2217, f\u2217) such that D\u2217 maximizes Eq. (1) and that for all x, PD(y|x, y \u2264 K) =\nPD\u2217 (y|x, y \u2264 K).\nThe proof is provided in the supplementary material. Proposition 1 states that for any optimal solution\nD of the supervised objective, there exists an optimal solution D\u2217 of the (K + 1)-class objective such\nthat D and D\u2217 share the same generalization error. In other words, using the (K + 1)-class objective\ndoes not prevent the model from experiencing any arbitrarily high generalization error that it could\nsuffer from under the supervised objective. Moreover, since all the optimal solutions are equivalent\nw.r.t. the (K + 1)-class objective, it is the optimization algorithm that really decides which speci\ufb01c\nsolution the model will reach, and thus what generalization performance it will achieve. This implies\nthat when the generator is perfect, the (K + 1)-class objective by itself is not able to improve the\ngeneralization performance. In fact, in many applications, an almost in\ufb01nite amount of unlabeled\ndata is available, so learning a perfect generator for purely sampling purposes should not be useful.\nIn this case, our theory suggests that not only the generator does not help, but also unlabeled data is\nnot effectively utilized when the generator is perfect.\n\n3.2 Complement Generator\n\nThe function f maps data points in the input space to the feature space. Let pk(f ) be the density of the\ndata points of class k in the feature space. Given a threshold \u0001k, let Fk be a subset of the data support\nwhere pk(f ) > \u0001k, i.e., Fk = {f : pk(f ) > \u0001k}. We assume that given {\u0001k}K\nk=1, the Fk\u2019s are disjoint\nwith a margin. More formally, for any fj \u2208 Fj, fk \u2208 Fk, and j (cid:54)= k, we assume that there exists a\nreal number 0 < \u03b1 < 1 such that \u03b1fj + (1 \u2212 \u03b1)fk /\u2208 Fj \u222a Fk. As long as the probability densities\nof different classes do not share any mode, i.e., \u2200i (cid:54)= j, argmaxf pi(f ) \u2229 argmaxf pj(f ) = \u2205, this\nassumption can always be satis\ufb01ed by tuning the thresholds \u0001k\u2019s. With the assumption held, we will\nshow that the model performance would be better if the thresholds could be set to smaller values\n(ideally zero). We also assume that each Fk contains at least one labeled data point.\nSuppose \u222aK\nk=1Fk is bounded by a convex set B. If the support FG of a generator G in the feature\nspace is a relative complement set in B, i.e., FG = B \u2212 \u222aK\nk=1Fk, we call G a complement generator.\nThe reason why we utilize a bounded B to de\ufb01ne the complement is presented in the supplementary\n\n3\n\n\fmaterial. Note that the de\ufb01nition of complement generator implies that G is a function of f. By\ntreating G as function of f, theoretically D can optimize the original objective function in Eq. (1).\nNow we present the assumption on the convergence conditions of the discriminator. Let U and G be\nthe sets of unlabeled data and generated data.\nAssumption 1. Convergence conditions. When D converges on a \ufb01nite training set {L,U,G}, D\nlearns a (strongly) correct decision boundary for all training data points. More speci\ufb01cally, (1) for\nany (x, y) \u2208 L, we have w(cid:62)\nk f (x) for any other class k (cid:54)= y; (2) for any x \u2208 G, we have\n0 > maxK\n\nk f (x); (3) for any x \u2208 U, we have maxK\n\ny f (x) > w(cid:62)\n\nk f (x) > 0.\n\nk=1 w(cid:62)\n\nk=1 w(cid:62)\n\nlog(cid:80)\n\nIn Assumption 1, conditions (1) and (2) assume classi\ufb01cation correctness on labeled data and\ntrue-fake correctness on generated data respectively, which is directly induced by the objective\nfunction. Likewise, it is also reasonable to assume true-fake correctness on unlabeled data, i.e.,\nk f (x) > 0 for x \u2208 U. However, condition (3) goes beyond this and assumes\nk exp w(cid:62)\nmaxk w(cid:62)\nk f (x) > 0. We discuss this issue in detail in the supplementary material and argue that these\nassumptions are reasonable. Moreover, in Section 5, our approach addresses this issue explicitly by\nadding a conditional entropy term to the discriminator objective to enforce condition (3).\nLemma 1. Suppose for all k, the L2-norms of weights wk are bounded by (cid:107)wk(cid:107)2 \u2264 C. Suppose that\nthere exists \u0001 > 0 such that for any fG \u2208 FG, there exists f(cid:48)\nG(cid:107)2 \u2264 \u0001. With\nthe conditions in Assumption 1, for all k \u2264 K, we have w(cid:62)\nCorollary 1. When unlimited generated data samples are available, with the conditions in Lemma 1,\nwe have lim|G|\u2192\u221e w(cid:62)\nSee the supplementary material for the proof.\nProposition 2. Given the conditions in Corollary 1, for all class k \u2264 K, for all feature space points\nfk \u2208 Fk, we have w(cid:62)\n\nG \u2208 G such that (cid:107)fG \u2212 f(cid:48)\nk fG < C\u0001.\n\nj fk for any j (cid:54)= k.\n\nk fG \u2264 0.\n\nk fk > w(cid:62)\n\nk fk \u2264 w(cid:62)\n\nProof. Without loss of generality, suppose j = arg maxj(cid:54)=k w(cid:62)\nj fk. Now we prove it by contradiction.\nj fk. Since Fk\u2019s are disjoint with a margin, B is a convex set and FG =\nSuppose w(cid:62)\nB \u2212 \u222akFk, there exists 0 < \u03b1 < 1 such that fG = \u03b1fk + (1 \u2212 \u03b1)fj with fG \u2208 FG and fj\nj fG \u2264 0. Thus,\nbeing the feature of a labeled data point in Fj. By Corollary 1, it follows that w(cid:62)\nj fk + (1 \u2212 \u03b1)w(cid:62)\nw(cid:62)\nj fG = \u03b1w(cid:62)\nj fj > 0, leading to\ncontradiction. It follows that w(cid:62)\n\nj fj \u2264 0. By Assumption 1, w(cid:62)\nk fk > w(cid:62)\n\nj fk for any j (cid:54)= k.\n\nj fk > 0 and w(cid:62)\n\nProposition 2 guarantees that when G is a complement generator, under mild assumptions, a near-\noptimal D learns correct decision boundaries in each high-density subset Fk (de\ufb01ned by \u0001k) of the\ndata support in the feature space. Intuitively, the generator generates complement samples so the\nlogits of the true classes are forced to be low in the complement. As a result, the discriminator\nobtains class boundaries in low-density areas. This builds a connection between our approach with\nmanifold-based methods [2, 26] which also leverage the low-density boundary assumption.\nWith our theoretical analysis, we can now answer the questions raised in Section 1. First, the (K + 1)-\nclass formulation is effective because the generated complement samples encourage the discriminator\nto place the class boundaries in low-density areas (Proposition 2). Second, good semi-supervised\nlearning indeed requires a bad generator because a perfect generator is not able to improve the\ngeneralization performance (Proposition 1).\n\n4 Case Study on Synthetic Data\n\nIn the previous section, we have established the fact a complement generator, instead of a perfect\ngenerator, is what makes a good semi-supervised learning algorithm. Now, to get a more intuitive\nunderstanding, we conduct a case study based on two 2D synthetic datasets, where we can easily\nverify our theoretical analysis by visualizing the model behaviors. In addition, by analyzing how\nfeature matching (FM) [16] works in 2D space, we identify some potential problems of it, which\nmotivates our approach to be introduced in the next section. Speci\ufb01cally, two synthetic datasets are\nfour spins and two circles, as shown in Fig. 1.\n\n4\n\n\fFigure 1: Labeled and unlabeled data are denoted by\ncross and point respectively, and different colors indicate\nclasses.\n\nFigure 2: Left: Classi\ufb01cation decision boundary,\nwhere the white line indicates true-fake boundary;\nRight: True-Fake decision boundary\n\nFigure 3: Feature space\nat convergence\n\nFigure 4: Left: Blue points are generated data, and the black shadow indicates\nunlabeled data. Middle and right can be interpreted as above.\n\nSoundness of complement generator Firstly, to verify that the complement generator is a preferred\nchoice, we construct the complement generator by uniformly sampling from the a bounded 2D box\nthat contains all unlabeled data, and removing those on the manifold. Based on the complement\ngenerator, the result on four spins is visualized in Fig. 2. As expected, both the classi\ufb01cation\nand true-fake decision boundaries are almost perfect. More importantly, the classi\ufb01cation decision\nboundary always lies in the fake data area (left panel), which well matches our theoretical analysis.\n\nVisualization of feature space Next, to verify our analysis about the feature space, we choose the\nfeature dimension to be 2, apply the FM to the simpler dataset of two circles, and visualize the feature\nspace in Fig. 3. As we can see, most of the generated features (blue points) resides in between the\nfeatures of two classes (green and orange crosses), although there exists some overlap. As a result,\nthe discriminator can almost perfectly distinguish between true and generated samples as indicated\nby the black decision boundary, satisfying the our required Assumption 1. Meanwhile, the model\nobtains a perfect classi\ufb01cation boundary (blue line) as our analysis suggests.\n\nPros and cons of feature matching Finally, to further understand the strength and weakness of\nFM, we analyze the solution FM reaches on four spins shown in Fig. 4. From the left panel, we can\nsee many of the generated samples actually fall into the data manifold, while the rest scatters around\nin the nearby surroundings of data manifold. It suggests that by matching the \ufb01rst-order moment by\nSGD, FM is performing some kind of distribution matching, though in a rather weak manner. Loosely\nspeaking, FM has the effect of generating samples close to the manifold. But due to its weak power\nin distribution matching, FM will inevitably generate samples outside of the manifold, especially\nwhen the data complexity increases. Consequently, the generator density pG is usually lower than\nthe true data density p within the manifold and higher outside. Hence, an optimal discriminator\nPD\u2217 (K + 1 | x) = p(x)/(p(x) + pG(x)) could still distinguish between true and generated samples\nin many cases. However, there are two types of mistakes the discriminator can still make\n\n1. Higher density mistake inside manifold: Since the FM generator still assigns a signi\ufb01cant amount\nof probability mass inside the support, wherever pG > p > 0, an optimal discriminator will\nincorrectly predict samples in that region as \u201cfake\u201d. Actually, this problem has already shown up\nwhen we examine the feature space (Fig. 3).\n\n2. Collapsing with missing coverage outside manifold: As the feature matching objective for the\ngenerator only requires matching the \ufb01rst-order statistics, there exists many trivial solutions the\ngenerator can end up with. For example, it can simply collapse to mean of unlabeled features,\nor a few surrounding modes as along as the feature mean matches. Actually, we do see such\n\n5\n\n\fcollapsing phenomenon in high-dimensional experiments when FM is used (see Fig. 5a and\nFig. 5c) As a result, a collapsed generator will fail to cover some gap areas between manifolds.\nSince the discriminator is only well-de\ufb01ned on the union of the data supports of p and pG, the\nprediction result in such missing area is under-determined and fully relies on the smoothness of\nthe parametric model. In this case, signi\ufb01cant mistakes can also occur.\n\n5 Approach\n\nAs discussed in previous sections, feature matching GANs suffer from the following drawbacks: 1)\nthe \ufb01rst-order moment matching objective does not prevent the generator from collapsing (missing\ncoverage); 2) feature matching can generate high-density samples inside manifold; 3) the discriminator\nobjective does not encourage realization of condition (3) in Assumption 1 as discussed in Section 3.2.\nOur approach aims to explicitly address the above drawbacks.\nFollowing prior work [16, 6], we employ a GAN-like implicit generator. We \ufb01rst sample a latent\nvariable z from a uniform distribution U(0, 1) for each dimension, and then apply a deep convolutional\nnetwork to transform z to a sample x.\n\n5.1 Generator Entropy\n\nFundamentally, the \ufb01rst drawback concerns the entropy of the distribution of generated features,\nH(pG(f )). This connection is rather intuitive, as the collapsing issue is a clear sign of low entropy.\nTherefore, to avoid collapsing and increase coverage, we consider explicitly increasing the entropy.\nAlthough the idea sounds simple and straightforward, there are two practical challenges. Firstly, as\nimplicit generative models, GANs only provide samples rather than an analytic density form. As a\nresult, we cannot evaluate the entropy exactly, which rules out the possibility of naive optimization.\nMore problematically, the entropy is de\ufb01ned in a high-dimensional feature space, which is changing\ndynamically throughout the training process. Consequently, it is dif\ufb01cult to estimate and optimize the\ngenerator entropy in the feature space in a stable and reliable way. Faced with these dif\ufb01culties, we\nconsider two practical solutions.\nThe \ufb01rst method is inspired by the fact that input space is essentially static, where estimating and\noptimizing the counterpart quantities would be much more feasible. Hence, we instead increase the\ngenerator entropy in the input space, i.e., H(pG(x)), using a technique derived from an information\ntheoretical perspective and relies on variational inference (VI). Specially, let Z be the latent variable\nspace, and X be the input space. We introduce an additional encoder, q : X (cid:55)\u2192 Z, to de\ufb01ne\na variational upper bound of the negative entropy [3], \u2212H(pG(x)) \u2264 \u2212Ex,z\u223cpG log q(z|x) =\nLVI. Hence, minimizing the upper bound LVI effectively increases the generator entropy. In our\nimplementation, we formulate q as a diagonal Gaussian with bounded variance, i.e. q(z|x) =\nN (\u00b5(x), \u03c32(x)), with 0 < \u03c3(x) < \u03b8, where \u00b5(\u00b7) and \u03c3(\u00b7) are neural networks, and \u03b8 is the threshold\nto prevent arbitrarily large variance.\nAlternatively, the second method aims at increasing the generator entropy in the feature space by\noptimizing an auxiliary objective. Concretely, we adapt the pull-away term (PT) [25] as the auxiliary\ncost, LPT =\n, where N is the size of a mini-batch and x are\nsamples. Intuitively, the pull-away term tries to orthogonalize the features in each mini-batch by\nminimizing the squared cosine similarity. Hence, it has the effect of increasing the diversity of\ngenerated features and thus the generator entropy.\n\n(cid:16) f (xi)(cid:62)f (xj )\n\n(cid:107)f (xi)(cid:107)(cid:107)f (xj )(cid:107)\n\n(cid:80)N\n\n(cid:80)\n\n1\n\nN (N\u22121)\n\ni=1\n\nj(cid:54)=i\n\n(cid:17)2\n\n5.2 Generating Low-Density Samples\n\nThe second drawback of feature matching GANs is that high-density samples can be generated in the\nfeature space, which is not desirable according to our analysis. Similar to the argument in Section\n5.1, it is infeasible to directly minimize the density of generated features. Instead, we enforce the\ngeneration of samples with low density in the input space. Speci\ufb01cally, given a threshold \u0001, we\nminimize the following term as part of our objective:\n\nEx\u223cpG log p(x)I[p(x) > \u0001]\n\n(3)\n\n6\n\n\fwhere I[\u00b7] is an indicator function. Using a threshold \u0001, we ensure that only high-density samples are\npenalized while low-density samples are unaffected. Intuitively, this objective pushes the generated\nsamples to \u201cmove\u201d towards low-density regions de\ufb01ned by p(x). To model the probability distribution\nover images, we simply adapt the state-of-the-art density estimation model for natural images, namely\nthe PixelCNN++ [17] model. The PixelCNN++ model is used to estimate the density p(x) in Eq. (3).\nThe model is pretrained on the training set, and \ufb01xed during semi-supervised training.\n\n5.3 Generator Objective and Interpretation\n\nCombining our solutions to the \ufb01rst two drawbacks of feature matching GANs, we have the following\nobjective function of the generator:\n\n\u2212H(pG) + Ex\u223cpG log p(x)I[p(x) > \u0001] + (cid:107)Ex\u223cpG f (x) \u2212 Ex\u223cU f (x)(cid:107)2.\n\n(4)\n\nmin\n\nG\n\nThis objective is closely related to the idea of complement generator discussed in Section 3. To see\nthat, let\u2019s \ufb01rst de\ufb01ne a target complement distribution in the input space as follows\n\n(cid:40) 1\n\nZ\nC\n\np\u2217(x) =\n\n1\n\np(x)\n\nif p(x) > \u0001 and x \u2208 Bx\nif p(x) \u2264 \u0001 and x \u2208 Bx,\n\n(cid:0)I[p(x) > \u0001] log Z\u2212I[p(x) \u2264 \u0001] log C(cid:1).\n\nwhere Z is a normalizer, C is a constant, and Bx is the set de\ufb01ned by mapping B from the feature\nspace to the input space. With the de\ufb01nition, the KL divergence (KLD) between pG(x) and p\u2217(x) is\nKL(pG(cid:107)p\u2217) = \u2212H(pG)+Ex\u223cpG log p(x)I[p(x) > \u0001]+Ex\u223cpG\nThe form of the KLD immediately reveals the aforementioned connection. Firstly, the KLD shares\ntwo exactly the same terms with the generator objective (4). Secondly, while p\u2217(x) is only de\ufb01ned in\nBx, there is not such a hard constraint on pG(x). However, the feature matching term in Eq. (4) can\nbe seen as softly enforcing this constraint by bringing generated samples \u201cclose\u201d to the true data (Cf.\nSection 4). Moreover, because the identity function I[\u00b7] has zero gradient almost everywhere, the last\nterm in KLD would not contribute any informative gradient to the generator. In summary, optimizing\nour proposed objective (4) can be understood as minimizing the KL divergence between the generator\ndistribution and a desired complement distribution, which connects our practical solution to our\ntheoretical analysis.\n\n5.4 Conditional Entropy\n\nInstead, it only needs(cid:80)K\n\nIn order for the complement generator to work, according to condition (3) in Assumption 1, the\ndiscriminator needs to have strong true-fake belief on unlabeled data, i.e., maxK\nk f (x) > 0.\nHowever, the objective function of the discriminator in [16] does not enforce a dominant class.\nk=1 PD(k|x) > PD(K + 1|x) to obtain a correct decision boundary, while\nthe probabilities PD(k|x) for k \u2264 K can possibly be uniformly distributed. To guarantee the strong\ntrue-fake belief in the optimal conditions, we add a conditional entropy term to the discriminator\nobjective and it becomes,\n\nk=1 w(cid:62)\n\nEx,y\u223cL log pD(y|x, y \u2264 K) + Ex\u223cU log pD(y \u2264 K|x)+\n\nmax\n\nD\n\nEx\u223cpG log pD(K + 1|x) + Ex\u223cU\n\npD(k|x) log pD(k|x).\n\n(5)\n\nK(cid:88)\n\nk=1\n\nBy optimizing Eq. (5), the discriminator is encouraged to satisfy condition (3) in Assumption 1. Note\nthat the same conditional entropy term has been used in other semi-supervised learning methods\n[19, 13] as well, but here we motivate the minimization of conditional entropy based on our theoretical\nanalysis of GAN-based semi-supervised learning.\nTo train the networks, we alternatively update the generator and the discriminator to optimize Eq. (4)\nand Eq. (5) based on mini-batches. If an encoder is used to maximize H(pG), the encoder and the\ngenerator are updated at the same time.\n\n6 Experiments\n\nWe mainly consider three widely used benchmark datasets, namely MNIST, SVHN, and CIFAR-10.\nAs in previous work, we randomly sample 100, 1,000, and 4,000 labeled samples for MNIST, SVHN,\n\n7\n\n\fMNIST (# errors)\n\nSVHN (% errors) CIFAR-10 (% errors)\n\nMethods\nCatGAN [19]\nSDGM [12]\nLadder network [15]\nADGM [12]\nFM [16] \u2217\nALI [4]\nVAT small [13] \u2217\nOur best model \u2217\nTriple GAN [11] \u2217\u2021\n\u03a0 model [9] \u2020\u2021\nVAT+EntMin+Large [13]\u2020\n\n191 \u00b1 10\n132 \u00b1 7\n106 \u00b1 37\n96 \u00b1 2\n93 \u00b1 6.5\n\n-\n136\n\n79.5 \u00b1 9.8\n91\u00b1 58\n\n16.61 \u00b1 0.24\n\n-\n\n-\n\n22.86\n\n6.83\n\n8.11 \u00b1 1.3\n7.42 \u00b1 0.65\n4.25 \u00b1 0.03\n5.77 \u00b1 0.17\n5.43 \u00b1 0.25\n\n-\n\n-\n\n19.58 \u00b1 0.46\n20.40 \u00b1 0.47\n18.63 \u00b1 2.32\n17.99 \u00b1 1.62\n14.41 \u00b1 0.30\n16.99 \u00b1 0.36\n16.55 \u00b1 0.29\n\n14.87\n\n-\n-\n\n4.28\n\n13.15\n\nTable 1: Comparison with state-of-the-art methods on three benchmark datasets. Only methods without data\naugmentation are included. \u2217 indicates using the same (small) discriminator architecture, \u2020 indicates using a\nlarger discriminator architecture, and \u2021 means self-ensembling.\n\n(a) FM on SVHN\n\n(b) Ours on SVHN\n\n(c) FM on CIFAR\n\n(d) Ours on CIFAR\n\nFigure 5: Comparing images generated by FM and our model. FM generates collapsed samples, while our\nmodel generates diverse \u201cbad\u201d samples.\n\nand CIFAR-10 respectively during training, and use the standard data split for testing. We use the\n10-quantile log probability to de\ufb01ne the threshold \u0001 in Eq. (4). We add instance noise to the input of\nthe discriminator [1, 18], and use spatial dropout [20] to obtain faster convergence. Except for these\ntwo modi\ufb01cations, we use the same neural network architecture as in [16]. For fair comparison, we\nalso report the performance of our FM implementation with the aforementioned differences.\n\n6.1 Main Results\n\nWe compare the the results of our best model with state-of-the-art methods on the benchmarks in\nTable 1. Our proposed methods consistently improve the performance upon feature matching. We\nachieve new state-of-the-art results on all the datasets when only small discriminator architecture is\nconsidered. Our results are also state-of-the-art on MNIST and SVHN among all single-model results,\neven when compared with methods using self-ensembling and large discriminator architectures.\nFinally, note that because our method is actually orthogonal to VAT [13], combining VAT with our\npresented approach should yield further performance improvement in practice.\n\n6.2 Ablation Study\n\nWe report the results of ablation study in Table 2. In the following, we analyze the effects of several\ncomponents in our model, subject to the intrinsic features of different datasets.\nFirst, the generator entropy terms (VI and PT) (Section 5.1) improve the performance on SVHN and\nCIFAR by up to 2.2 points in terms of error rate. Moreover, as shown in Fig 5, our model signi\ufb01cantly\nreduces the collapsing effects present in the samples generated by FM, which also indicates that\nmaximizing the generator entropy is bene\ufb01cial. On MNIST, probably due to its simplicity, no\ncollapsing phenomenon was observed with vanilla FM training [16] or in our setting. Under such\ncircumstances, maximizing the generator entropy seems to be unnecessary, and the estimation bias\nintroduced by approximation techniques can even hurt the performance.\n\n8\n\n\fSetting\nMNIST FM\nMNIST FM+VI\nMNIST FM+LD\nMNIST FM+LD+Ent\n\nSetting\nSVHN FM\nSVHN FM+VI\nSVHN FM+PT\nSVHN FM+PT+Ent\nSVHN FM+PT+LD+Ent\n\nSetting\n\nCIFAR FM+VI+Ent\n\nError\n85.0 \u00b1 11.7 CIFAR FM\n86.5 \u00b1 10.6 CIFAR FM+VI\n79.5 \u00b1 9.8\n89.2 \u00b1 10.5\nError\n6.83\n5.29\n4.63\n4.25\n4.19\n\nSetting\nMNIST FM\nMNIST FM+LD\nSVHN FM+PT+Ent\nSVHN FM+PT+LD+Ent\nSVHN 10-quant\n\nError\n16.14\n14.41\n15.82\n\nMax log-p\n-297\n-659\n-5809\n-5919\n-5622\n\nSetting \u0001 as q-th centile\nError on MNIST\n\nq = 2\n77.7 \u00b1 6.1\n\nq = 10\n79.5 \u00b1 9.8\n\nq = 20\n80.1 \u00b1 9.6\n\nq = 100\n85.0 \u00b1 11.7\n\nTable 2: Ablation study. FM is feature matching. LD is the low-density enforcement term in Eq. (3). VI and\nPT are two entropy maximization methods described in Section 5.1. Ent means the conditional entropy term in\nEq. (5). Max log-p is the maximum log probability of generated samples, evaluated by a PixelCNN++ model.\n10-quant shows the 10-quantile of true image log probability. Error means the number of misclassi\ufb01ed examples\non MNIST, and error rate (%) on others.\n\nSecond, the low-density (LD) term is useful when FM indeed generates samples in high-density areas.\nMNIST is a typical example in this case. When trained with FM, most of the generated hand written\ndigits are highly realistic and have high log probabilities according to the density model (Cf. max\nlog-p in Table 2). Hence, when applied to MNIST, LD improves the performance by a clear margin.\nBy contrast, few of the generated SVHN images are realistic (Cf. Fig. 5a). Quantitatively, SVHN\nsamples are assigned very low log probabilities (Cf. Table 2). As expected, LD has a negligible effect\non the performance for SVHN. Moreover, the \u201cmax log-p\u201d column in Table 2 shows that while LD\ncan reduce the maximum log probability of the generated MNIST samples by a large margin, it does\nnot yield noticeable difference on SVHN. This further justi\ufb01es our analysis. Based on the above\nconclusion, we conjecture LD would not help on CIFAR where sample quality is even lower. Thus,\nwe did not train a density model on CIFAR due to the limit of computational resources.\nThird, adding the conditional entropy term has mixed effects on different datasets. While the\nconditional entropy (Ent) is an important factor of achieving the best performance on SVHN, it hurts\nthe performance on MNIST and CIFAR. One possible explanation relates to the classic exploitation-\nexploration tradeoff, where minimizing Ent favors exploitation and minimizing the classi\ufb01cation loss\nfavors exploration. During the initial phase of training, the discriminator is relatively uncertain and\nthus the gradient of the Ent term might dominate. As a result, the discriminator learns to be more\ncon\ufb01dent even on incorrect predictions, and thus gets trapped in local minima.\nLastly, we vary the values of the hyper-parameter \u0001 in Eq. (4). As shown at the bottom of Table 2,\nreducing \u0001 clearly leads to better performance, which further justi\ufb01es our analysis in Sections 4 and 3\nthat off-manifold samples are favorable.\n\n6.3 Generated Samples\n\nWe compare the generated samples of FM and our approach in Fig. 5. The FM images in Fig. 5c are\nextracted from previous work [16]. While collapsing is widely observed in FM samples, our model\ngenerates diverse \u201cbad\u201d images, which is consistent with our analysis.\n\n7 Conclusions\n\nIn this work, we present a semi-supervised learning framework that uses generated data to boost\ntask performance. Under this framework, we characterize the properties of various generators and\ntheoretically prove that a complementary (i.e. bad) generator improves generalization. Empirically our\nproposed method improves the performance of image classi\ufb01cation on several benchmark datasets.\n\n9\n\n\fAcknowledgement\n\nThis work was supported by the DARPA award D17AP00001, the Google focused award, and the\nNvidia NVAIL award. The authors would also like to thank Han Zhao for his insightful feedback.\n\nReferences\n[1] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\nsarial networks. In NIPS 2016 Workshop on Adversarial Training. In review for ICLR, volume\n2016, 2017.\n\n[2] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434, 2006.\n\n[3] Zihang Dai, Amjad Almahairi, Philip Bachman, Eduard Hovy, and Aaron Courville. Calibrating\n\nenergy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.\n\n[4] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. arXiv\n\npreprint arXiv:1605.09782, 2016.\n\n[5] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier\narXiv preprint\n\nMastropietro, and Aaron Courville. Adversarially learned inference.\narXiv:1606.00704, 2016.\n\n[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[7] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, pages 3581\u20133589, 2014.\n\n[8] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[9] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint\n\narXiv:1610.02242, 2016.\n\n[10] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[11] Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. arXiv\n\npreprint arXiv:1703.02291, 2017.\n\n[12] Lars Maal\u00f8e, Casper Kaae S\u00f8nderby, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Auxiliary deep\n\ngenerative models. arXiv preprint arXiv:1602.05473, 2016.\n\n[13] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial train-\ning: a regularization method for supervised and semi-supervised learning. arXiv preprint\narXiv:1704.03976, 2017.\n\n[14] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional\n\nsmoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.\n\n[15] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-\nsupervised learning with ladder networks. In Advances in Neural Information Processing\nSystems, pages 3546\u20133554, 2015.\n\n[16] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training gans. In NIPS, 2016.\n\n10\n\n\f[17] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the\npixelcnn with discretized logistic mixture likelihood and other modi\ufb01cations. arXiv preprint\narXiv:1701.05517, 2017.\n\n[18] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised\n\nmap inference for image super-resolution. arXiv preprint arXiv:1610.04490, 2016.\n\n[19] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical genera-\n\ntive adversarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\n[20] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Ef\ufb01cient\nobject localization using convolutional networks. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 648\u2013656, 2015.\n\n[21] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Adversarial generator-encoder net-\n\nworks. arXiv preprint arXiv:1704.02304, 2017.\n\n[22] Jason Weston, Fr\u00e9d\u00e9ric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-\nsupervised embedding. In Neural Networks: Tricks of the Trade, pages 639\u2013655. Springer,\n2012.\n\n[23] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning\n\nwith graph embeddings. arXiv preprint arXiv:1603.08861, 2016.\n\n[24] Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William W Cohen. Semi-supervised qa with\n\ngenerative domain-adaptive nets. arXiv preprint arXiv:1702.02206, 2017.\n\n[25] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.\n\narXiv preprint arXiv:1609.03126, 2016.\n\n[26] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919, 2003.\n\n11\n\n\f", "award": [], "sourceid": 3272, "authors": [{"given_name": "Zihang", "family_name": "Dai", "institution": "Carnegie Mellon University"}, {"given_name": "Zhilin", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "Fan", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "William", "family_name": "Cohen", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}