{"title": "KDGAN: Knowledge Distillation with Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 775, "page_last": 786, "abstract": "Knowledge distillation (KD) aims to train a lightweight classifier suitable to provide accurate inference with constrained resources in multi-label learning. Instead of directly consuming feature-label pairs, the classifier is trained by a teacher, i.e., a high-capacity model whose training may be resource-hungry. The accuracy of the classifier trained this way is usually suboptimal because it is difficult to learn the true data distribution from the teacher. An alternative method is to adversarially train the classifier against a discriminator in a two-player game akin to generative adversarial networks (GAN), which can ensure the classifier to learn the true data distribution at the equilibrium of this game. However, it may take excessively long time for such a two-player game to reach equilibrium due to high-variance gradient updates. To address these limitations, we propose a three-player game named KDGAN consisting of a classifier, a teacher, and a discriminator. The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses. By simultaneously optimizing the distillation and adversarial losses, the classifier will learn the true data distribution at the equilibrium. We approximate the discrete distribution learned by the classifier (or the teacher) with a concrete distribution. From the concrete distribution, we generate continuous samples to obtain low-variance gradient updates, which speed up the training. Extensive experiments using real datasets confirm the superiority of KDGAN in both accuracy and training speed.", "full_text": "KDGAN: Knowledge Distillation with\n\nGenerative Adversarial Networks\n\nXiaojie Wang\n\nUniversity of Melbourne\nxiaojiew94@gmail.com\n\nYu Sun\n\nTwitter Inc.\n\nysun@twitter.com\n\nRui Zhang\u2217\n\nUniversity of Melbourne\n\nrui.zhang@unimelb.edu.au\n\nJianzhong Qi\n\nUniversity of Melbourne\n\njianzhong.qi@unimelb.edu.au\n\nAbstract\n\nKnowledge distillation (KD) aims to train a lightweight classi\ufb01er suitable to provide\naccurate inference with constrained resources in multi-label learning. Instead of\ndirectly consuming feature-label pairs, the classi\ufb01er is trained by a teacher, i.e., a\nhigh-capacity model whose training may be resource-hungry. The accuracy of the\nclassi\ufb01er trained this way is usually suboptimal because it is dif\ufb01cult to learn the\ntrue data distribution from the teacher. An alternative method is to adversarially\ntrain the classi\ufb01er against a discriminator in a two-player game akin to generative\nadversarial networks (GAN), which can ensure the classi\ufb01er to learn the true data\ndistribution at the equilibrium of this game. However, it may take excessively long\ntime for such a two-player game to reach equilibrium due to high-variance gradient\nupdates. To address these limitations, we propose a three-player game named\nKDGAN consisting of a classi\ufb01er, a teacher, and a discriminator. The classi\ufb01er and\nthe teacher learn from each other via distillation losses and are adversarially trained\nagainst the discriminator via adversarial losses. By simultaneously optimizing the\ndistillation and adversarial losses, the classi\ufb01er will learn the true data distribution\nat the equilibrium. We approximate the discrete distribution learned by the classi\ufb01er\n(or the teacher) with a concrete distribution. From the concrete distribution, we\ngenerate continuous samples to obtain low-variance gradient updates, which speed\nup the training. Extensive experiments using real datasets con\ufb01rm the superiority\nof KDGAN in both accuracy and training speed.\n\n1\n\nIntroduction\n\nIn machine learning, it is common that more resources such as input features [47] or computational\nresources [23], which we refer to as privileged provision, are available at the stage of training a model\nthan those available at the stage of running the deployed model (i.e., the inference stage). Figure 1\nshows an example application of image tag recommendation, where more input features (called\nprivileged information [47]) are available at the training stage than those available at the inference\nstage. Speci\ufb01cally, the training stage has access to images as well as image titles and comments\n(textual information) as shown in Figure 1a, whereas the inference stage only has access to images\nthemselves as shown in Figure 1b. After a smart phone user uploads an image and is about to provide\ntags for the image, it is inconvenient to type tags on the phone and thinking about tags for the image\nalso takes time, so it is very useful to recommend tags based on the image as shown in Figure 1b.\nAnother example application is unlocking mobile phones by face recognition. We usually deploy face\nrecognition models on mobile phones so that legit users can unlock the phones without depending\n\n\u2217Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fs1: A user uploads an image.\n\ns3: The user adds a title.\n\nLake\n\nLake mead\n\nNice lake.\n\ns2: The user adds a tag.\n\ns4: Another user comments.\n\nNew post\nAdd some tags...\nRecommended Tags\n\nlake\n\nsky\n\n(a) Training: After a user uploads an image, additional text such as\ncomments and titles besides the labeled tags is accumulated.\n\n(b) Inference: We recommend bay and\nsky right after an image is uploaded.\n\nFigure 1: Image tag recommendation where the additional text is only available for training.\n\non remote services or internet connections. The training stage may be done on a powerful server\nwith signi\ufb01cantly more computational resources than the inference stage, which is done on a mobile\nphone. Here, a key problem is how to use privileged provision, i.e., resources only accessible for\ntraining, to train a model with great inference performance [29].\nTypical approaches to the problem are based on knowledge distillation (KD) [7, 9, 23]. As shown\nby the left half of Figure 2, KD consists of a classi\ufb01er and a teacher [29]. To operate for resource-\nconstrained inference, the classi\ufb01er does not use privileged provision. On the other hand, the teacher\nuses privileged provision by, e.g., having a larger model capacity or taking more features as input.\nOnce trained, the teacher outputs a distribution over labels called soft labels [29] for each training\ninstance. Then, the teacher trains the classi\ufb01er to predict the soft labels via a distillation loss such as\nthe L2 loss on logits [7]. This training process is often called \u201cdistilling\u201d the knowledge in the teacher\ninto the classi\ufb01er [23]. Since the teacher normally cannot perfectly model the true data distribution, it\nis dif\ufb01cult for the classi\ufb01er to learn the true data distribution from the teacher.\nGenerative adversarial networks (GAN) provide an alternative way to learn the true data distribution.\nInspired by Wang et al. [49], we \ufb01rst present a naive GAN (NaGAN) with two players. As shown by\nthe right part of Figure 2, NaGAN consists of a classi\ufb01er and a discriminator. The classi\ufb01er serves as\na generator that generates relevant labels given an instance while the discriminator aims to distinguish\nthe true labels from the generated ones. The classi\ufb01er learns from the discriminator to perfectly\nmodel the true data distribution at the equilibrium via adversarial losses. One limitation of NaGAN is\nthat a large number of training instances and epochs is normally required to reach equilibrium [15],\nwhich restricts its applicability to domains where collecting labeled data is expensive. The slow\ntraining speed is because in such a two-player framework, the gradients from the discriminator to\nupdate the classi\ufb01er often vanish or explode during the adversarial training [4]. It is challenging to\ntrain a classi\ufb01er to learn the true data distribution with limited training instances and epochs.\nTo address this challenge, we propose a three-player framework named KDGAN to distill knowledge\nwith generative adversarial networks. As shown in Figure 2, KDGAN consists of a classi\ufb01er, a\nteacher, and a discriminator. In addition to the distillation loss in KD and the adversarial losses\nin NaGAN mentioned above, we de\ufb01ne a distillation loss from the classi\ufb01er to the teacher and an\nadversarial loss between the teacher and the discriminator. Speci\ufb01cally, the classi\ufb01er and the teacher,\nserving as generators, aim to fool the discriminator by generating pseudo labels that resemble the true\nlabels. Meanwhile, the classi\ufb01er and the teacher try to reach an agreement on what pseudo labels to\ngenerate by distilling their knowledge into each other. By formulating the distillation and adversarial\nlosses as a minimax game, we enable the classi\ufb01er to learn the true data distribution at the equilibrium\n(see Section 3.2). Besides, the classi\ufb01er receives gradients from the teacher via the distillation loss\nand the discriminator via the adversarial loss. The gradients from the teacher often have low variance,\nwhich reduces the variance of gradients and thus speeds up the adversarial training (see Section 3.3).\nWe further consider reducing the variance of the gradients from the discriminator to accelerate the\ntraining of KDGAN. The gradients from the discriminator may have large variance when obtained\nthrough the widely used policy gradient methods [49, 52]. It is non-trivial to obtain low-variance\ngradients from the discriminator because the classi\ufb01er and the teacher generate discrete samples,\nwhich are not differentiable w.r.t. their parameters. We propose to relax the discrete distributions\nlearned by the classi\ufb01er and the teacher into concrete distributions [25, 31] with the Gumbel-Max\ntrick [20, 30]. We use the concrete distributions for generating continuous samples to enable end-\nto-end differentiability and suf\ufb01cient control over the variance of gradients. Given the continuous\nsamples, we obtain low-variance gradients from the discriminator to accelerate the KDGAN training.\nTo summarize, our contributions are as follows:\n\n2\n\n\f\u2022 We propose a novel framework named KDGAN for multi-label learning, which trains a lightweight\nclassi\ufb01er suitable for resource-constrained inference using resources available only for training.\n\u2022 We reduce the number of training epochs required to converge by decreasing the variance of\n\u2022 We conduct extensive experiments in two applications, image tag recommendation and deep model\ncompression. The experiments validate the superiority of KDGAN over state-of-the-art methods.\n\ngradients, which is achieved by the design of KDGAN and the Gumbel-Max trick.\n\n2 Related Work\n\nWe brie\ufb02y review studies on knowledge distillation (KD) and generative adversarial networks (GAN).\nKD aims to transfer the knowledge in a powerful teacher to a lightweight classi\ufb01er [9]. For example,\nBa and Caruana [7] train a shallow classi\ufb01er network to mimic a deep teacher network by matching\nlogits via the L2 loss. Hinton et al. [23] generalize this work by training a classi\ufb01er to predict\nsoft labels provided by a teacher. Sau and Balasubramanian [39] further add random perturbations\ninto soft labels to simulate learning from multiple teachers. Instead of using soft labels, Romero\net al. [36] propose to use middle layers of a teacher to train a classi\ufb01er. Unlike previous work on\nclassi\ufb01cation problems, Chen et al. [10] apply KD and hint learning to object detection problems.\nThere also exists work that leverages KD to transfer knowledge between different domains [21],\ne.g., between high-quality and low-quality images [41]. Lopez-Paz et al. [29] unify KD with\nprivileged information [35, 47, 48] as generalized distillation where a teacher is pretrained by taking\nas input privileged information. Compared to KD, the proposed KDGAN framework introduces a\ndiscriminator to guarantee that the classi\ufb01er can learn the true data distribution at the equilibrium.\nGAN is initially proposed to generate continuous data by training a generator and a discriminator\nadversarially in a minimax game [17]. GAN has only recently been introduced to generate discrete\ndata [16, 54, 55] because discrete data makes it dif\ufb01cult to pass gradients from a discriminator\nbackward to update a generator. For example, sequence GAN (SeqGAN) [52] models the process\nof token sequence generation as a stochastic policy and adopts Monte Carlo search to update a\ngenerator. Different from these GANs with two players, Li et al. propose a GAN with three players\ncalled Triple-GAN [13]. Our KDGAN also consists of three players including two generators\nand a discriminator, but differs from Triple-GAN in that: (1) Both generators in KDGAN learn a\nconditional distribution over labels given features. However, the generators in Triple-GAN learn a\nconditional distribution over labels given features and a conditional distribution over features given\nlabels, respectively. (2) The samples from both generators in KDGAN are all discrete data while\nthe samples from the generators in Triple-GAN include both discrete and continuous data. These\ndifferences lead to different objective functions and training techniques, e.g., KDGAN can use the\nGumbel-Max trick [20, 30] to generate samples from both generators while Triple-GAN cannot do\nthis. There is also a rich body of studies on improving the training of GAN [5, 33, 56] such as feature\nmatching [38], which are orthogonal to our work and can be used to improve the training of KDGAN.\nWe explore the idea of integrating KD and GAN. A similar idea has been studied in [51] where a\ndiscriminator is introduced to train a classi\ufb01er. This previous study [51] differs from ours in that their\ndiscriminator trains the classi\ufb01er to learn the data distribution produced by the teacher, while our\ndiscriminator trains the classi\ufb01er to learn the true data distribution.\nWe apply the proposed KDGAN to address the problem of deep model compression and image\ntag recommendation. We can also apply KDGAN to address the other problems where privileged\nprovision is available [44]. For example, we can consider contextual signals in the intent tracking\nproblem [42, 43] or user reviews in the movie recommendation problem [50] as privileged provision.\n\n3 Methods\n\nWe study the problem of training a lightweight classi\ufb01er from a teacher that is trained with privileged\nprovision (denoted by \u0001) to satisfy stringent inference requirements. The inference requirements may\ninclude (1) running in real time with limited computational resources, where privileged provision\nis computational resources [23]; (2) lacking a certain type of input features, where privileged\nprovision is privileged information [47]. Following existing work [29], we use multi-label learning\nproblems [12, 18, 53] as the target application scenarios of our methods for illustration purpose.\n\n3\n\n\fTeacher\n\nLc\n\nDS\n\nx\n\nKDGAN\n\nClassi\ufb01er\n\nKD\nt (y|x)\nst = p\u0001\nsc = pc(y|x)\nyt \u223c q\u0001\nt (y|x)\n\nx\n\nNaGAN\n\nDS\n\nLt\nyc \u223c qc(y|x)\n\nDiscriminator\n\nx\n\nAD\n\nLn\nLp\nLn\ny \u223c pu(y|x)\n\nAD\n\nAD\n\nFigure 2: Comparison among KD, NaGAN, and KDGAN. The classi\ufb01er (C) and the teacher (T )\nlearn discrete categorical distributions pc(y|x) and p\u0001\nt (y|x); y is a true label generated from the true\ndata distribution pu(y|x); yc and yt are continuous samples generated from concrete distributions\nqc(y|x) and q\u0001\nDS are distillation\nlosses for C and T ; Lp\nAD are adversarial losses for positive and negative feature-label pairs.\n\nt (y|x); sc and st are soft labels produced by C and T ; Lc\n\nDS and Lt\n\nAD and Ln\n\nSince privileged provision is only available at the training stage, the goal of the problem is to train a\nlightweight classi\ufb01er that does not use privileged provision for effective inference.\nTo achieve this goal, we start with NaGAN, a naive adaptation of the two-player framework proposed\nby Wang et al. in information retrieval (Section 3.1). Similar to other two-player frameworks [49],\nthe naive adaptation requires a large number of training instances and epochs [15], which is dif\ufb01cult\nto satisfy in practice [4]. To address the limitation, we propose a three-player framework named\nKDGAN that can speed up the training while preserving the equilibrium (Sections 3.2 and 3.3).\n\n3.1 NaGAN Formulation\n\nWe begin with NaGAN that combines a classi\ufb01er C with a discriminator D in a minimax game.\nSince D is not meant for inference, it can leverage privileged provision. For example, D may have a\nlarger model capacity than C or take as input more features than those available to C. In NaGAN,\nC generates pseudo labels y given features x following a categorical distribution pc(y|x), while D\nd(x, y) of a label y being from the true data distribution pu(y|x) given\ncomputes the probability p\u0001\nfeatures x. With a slight abuse of notation, we also use x to refer to features including privileged\ninformation when the context is clear. Following the value function of IRGAN [49], we de\ufb01ne the\nvalue function V (c, d) for the minimax game in NaGAN as\n\nmin\n\nc\n\nmax\n\nd\n\nV (c, d) = Ey\u223cpu [log p\u0001\n\nd(x, y)] + Ey\u223cpc [log(1 \u2212 p\u0001\n\nd(x, y))].\n\n(1)\n\nand p\u0001\n\nd(x, y) = sigmoid(g(x, y)).\n\npc(y|x) = softmax(h(x, y))\n\nLet h(x, y) and g(x, y) be the scoring functions for C and D. We de\ufb01ne pc(y|x) and p\u0001\n\nd(x, y) as\n(2)\nThe scoring functions can be implemented in various ways, e.g., h(x, y) can be a multilayer per-\nceptron [27]. We will detail the scoring functions for speci\ufb01c applications in Section 4. Such a\ntwo-player framework is trained by updating C and D alternatively [49]. The training will proceed\nuntil the equilibrium is reached, where C learns the true data distribution. At that point, D can do no\nbetter than random guesses at deciding whether a given label is generated by C or not [6].\nOur key observation is that the advantages and the disadvantages of KD and NaGAN are com-\nplementary: (1) KD usually requires a small number of training instances and epochs but cannot\nensure the equilibrium where pc(y|x) = pu(y|x). (2) NaGAN ensures the equilibrium where\npc(y|x) = pu(y|x) [49] but normally requires a large number of training instances and epochs. We\naim to retain the advantages and avoid the disadvantages of both methods in a single framework.\n\n3.2 KDGAN Formulation\n\nWe formulate KDGAN as a minimax game with a classi\ufb01er C, a teacher T , and a discriminator D.\nSimilar to the classi\ufb01er C, the teacher T generates pseudo labels based on a categorical distribution\nt (y|x) = softmax(f (x, y)) where f (x, y) is also a scoring function. Both T and D use privileged\np\u0001\nprovision, e.g., by having a large model capacity or taking privileged information as input. In KDGAN,\nD aims to maximize the probability of correctly distinguishing the true and pseudo labels, whereas C\nand T aim to minimize the probability that D rejects their generated pseudo labels. Meanwhile, C\nlearns from T by mimicking the learned distribution of T . To build a general framework, we also\nenable T to learn from C because, in reality, a teacher\u2019s ability can also be enhanced by interacting\nwith students (see Figure 6 in Appendix D for empirical evidence that T bene\ufb01ts from learning from\n\n4\n\n\fAlgorithm 1: Minibatch stochastic gradient descent training of KDGAN.\n1 Pretrain a classi\ufb01er C, a teacher T , and a discriminator D with the training data {(x1, y1), ..., (xn, yn)}.\n2 for the number of training epochs do\n3\n4\n5\n\nfor the number of training steps for the discriminator do\n\nk} from pu(y|x), qc(y|x), and q\u0001\n\nk}, and {yt\n\nt (y|x).\n\n1, ..., yc\n\n1, ..., yt\n\ni=1\n\n(cid:0)\u2207d log p\u0001\n\nSample labels {y1, ..., yk}, {yc\nUpdate D by ascending along its gradients\n1\nk\n\nd(x, yi) + \u03b1\u2207d log(1 \u2212 p\u0001\nfor the number of training steps for the teacher do\ni|x) log(1 \u2212 p\u0001\n\n(cid:80)k\n(cid:80)k\n1, ..., yt\ni=1(1 \u2212 \u03b1)\u2207t log q\u0001\n(cid:80)k\n1, ..., yc\ni=1 \u03b1\u2207c log qc(yc\n\nfor the number of training steps for the classi\ufb01er do\n\nSample labels {yc\n\nSample labels {yt\n\nd(x, zc\n\n1\nk\n\n1\nk\n\nt (y|x) and update the teacher by descending along its gradients\n\nk} from q\u0001\nt (yt\nk} from qc(y|x) and update C by descending along its gradients\ni|x) log(1 \u2212 p\u0001\n\nt (y|x), pc(y|x)).\n\nDS(pc(y|x), p\u0001\n\ni )) + \u03b2\u2207cLc\n\ni )) + \u03b3\u2207tLt\n\nt (y|x)).\n\nd(x, zt\n\nDS(p\u0001\n\nd(x, zc\n\ni )) + (1 \u2212 \u03b1)\u2207d log(1 \u2212 p\u0001\n\nd(x, zt\n\ni ))(cid:1).\n\n6\n7\n8\n\n9\n10\n11\n\n12\n\nC). Such a mutual learning helps C and T reduce their probability of generating different pseudo\nlabels. Formally, we de\ufb01ne the value function U (c, t, d) for the minimax game in KDGAN as\n\nU (c, t, d) = Ey\u223cpu [log p\u0001\n\nd(x, y)] + \u03b1Ey\u223cpc[log(1 \u2212 p\u0001\n\nt\n\nd\n\nmax\n\n[log(1 \u2212 p\u0001\n\nDS(pc(y|x), p\u0001\n\nd(x, y))] + \u03b2Lc\n\nd(x, y))]\nt (y|x)) + \u03b3Lt\n\nmin\nc,t\n+ (1 \u2212 \u03b1)Ey\u223cp\u0001\nwhere \u03b1 \u2208 (0, 1), \u03b2 \u2208 (0, +\u221e), and \u03b3 \u2208 (0, +\u221e) are hyperparameters. We collectively refer to\nthe expectation terms as the adversarial losses and refer to Lc\nDS as the distillation losses.\nThe distillation losses can be de\ufb01ned in several ways [39], e.g., the L2 loss [7] or Kullback\u2013Leibler\ndivergence [23]. Note that Lc\nDS are used to train the classi\ufb01er and the teacher, respectively.\nTheoretical Analysis. We show that the classi\ufb01er perfectly learns the true data distribution at the\nt (y|x). It can be shown\nequilibrium of KDGAN. To see this, let p\u0001\n\u03b1(y|x):\nthat the adversarial losses w.r.t. pc(y|x) and p\u0001\n\nt (y|x) are equal to an adversarial loss w.r.t. p\u0001\n\n\u03b1(y|x) = \u03b1pc(y|x) + (1 \u2212 \u03b1)p\u0001\n\nt (y|x), pc(y|x)),\n\nDS and Lt\n\nDS and Lt\n\nDS(p\u0001\n\n(3)\n\n\u03b1Ey\u223cpc[log(1 \u2212 p\u0001\n\nd(x, y))] + (1 \u2212 \u03b1)Ey\u223cp\u0001\n\ny pc(y|x) log(1 \u2212 p\u0001\n\n(cid:0)\u03b1pc(y|x) + (1 \u2212 \u03b1)p\u0001\n\nd(x, y)) + (1 \u2212 \u03b1)(cid:80)\nt (y|x)(cid:1) log(1 \u2212 p\u0001\n\n[log(1 \u2212 p\u0001\ny p\u0001\nd(x, y))\n\nt\n\n= \u03b1(cid:80)\n=(cid:80)\n\ny\n\nd(x, y))]\nt (y|x) log(1 \u2212 p\u0001\n\nd(x, y))\n\n(4)\n\n= Ey\u223cp\u0001\n\n\u03b1 [log(1 \u2212 p\u0001\n\nd(x, y))].\nDS(pc(y|x), p\u0001\n\nTherefore, let LMD = \u03b2Lc\nShannon divergence, the value function U (c, t, d) of the minimax game can be rewritten as\n\nt (y|x), pc(y|x)) and LJS be the Jensen-\n\nt (y|x)) + \u03b3Lt\n\nDS(p\u0001\n\nmin\n\n\u03b1\n\nmax\n\n\u03b1 [log(1 \u2212 p\u0001\nDS(pc(y|x), p\u0001\n\nd(x, y))] + LMD\nt (y|x)) + \u03b3Lt\n\nEy\u223cpu[log p\u0001\n2LJS(pu(y|x)||p\u0001\n\nd(x, y)] + Ey\u223cp\u0001\n\u03b1(y|x)) + \u03b2Lc\n\nDS(p\u0001\n\u03b1(y|x) = pu(y|x) and Lc\n\n(5)\n\u03b1\nd\nt (y|x), pc(y|x)) \u2212 log(4).\n= min\nHere, LJS reaches the minimum if and only if p\u0001\nDS) reaches the\nminimum if and only if pc(y|x) = p\u0001\nt (y|x). Hence, the KDGAN equilibrium is reached if and only\nt (y|x) = pu(y|x) where the classi\ufb01er learns the true data distribution. We summarize\nif pc(y|x) = p\u0001\nthe above discussions in Lemma 4.1 (the necessary and suf\ufb01cient conditions of maximizing the value\nfunction) and Theorem 4.2 (achieving the equilibrium), respectively (see Appendix A for proofs).\nLemma 4.1. For any \ufb01xed classi\ufb01er and teacher, the value function U (c, t, d) is maximized if and\nonly if the distribution of the discriminator is given by p\u0001\nTheorem 4.2. The equilibrium of the minimax game minc,t maxd U (c, t, d) is achieved if and only\nif pc(y|x) = p\u0001\n\nd(x, y) = pu(y|x)/(pu(y|x)+p\u0001\nt (y|x) = pu(y|x). At that point, U (c, t, d) reaches the value \u2212 log(4).\n\nDS (or Lt\n\n\u03b1(y|x)).\n\n3.3 KDGAN Training\n\nIn this section, we detail techniques for accelerating the training speed of KDGAN via reducing the\nnumber of training epochs needed. As discussed in earlier studies [8, 46], the training speed is closely\nrelated to the variance of gradients. Comparing with NaGAN, the KDGAN framework by design\ncan reduce the variance of gradients. This is because the high variance of a random variable can\n\n5\n\n\fbe reduced by a low-variance random variable (detailed in Lemma 4.3) and as we will discuss, T\nprovides gradients of lower variance than D does. To reduce the variance of gradients from D and\nattain suf\ufb01cient control over the variance, we further propose to obtain gradients from a continuous\nspace by relaxing the discrete samples, i.e., pseudo labels, propagated between the classi\ufb01er (or the\nteacher) and the discriminator into continuous samples with a reparameterization trick [25, 31].\nFirst, we show how KDGAN reduces the variance of gradients. As discussed above, C only receives\ngradients \u2207cV from D in NaGAN while it receives gradients \u2207cU from both D and T in KDGAN:\nAD, \u2207cU = \u03bb\u2207cLn\n(6)\nwhere \u03bb \u2208 (0, 1), \u2207cLn\nDS are gradients from D and T , respectively. Consistent with the\n\ufb01ndings in existing work [23, 39], we also observe that \u2207cLc\nDS usually has a lower variance than\n\u2207cLn\nAD (see Figure 7 in Appendix D for empirical evidence that the variance of \u2207cLc\nDS is smaller\nthan that of \u2207cLn\nAD during the training process). Hence, it can be easily shown that the gradients\nw.r.t. C in KDGAN have a lower variance than that in NaGAN (refer to Lemma 4.3):\n\n\u2207cV = \u2207cLn\nAD and \u2207cLc\n\nAD + (1 \u2212 \u03bb)\u2207cLc\n\nDS,\n\nVar(\u2207cLc\n\nDS) \u2264 Var(\u2207cLn\n\nAD) \u21d2 Var(\u2207cU ) \u2264 Var(\u2207cV ).\n\n(7)\n\n\u2207cLn\n\nNext, we further reduce the variance of gradients with a reparameterization trick, in particular, the\nGumbel-Max trick [20, 30]. The essence of the Gumbel-Max trick is to reparameterize generating\ndiscrete samples into a differentiable function of its parameters and an additional random variable\nof a Gumbel distribution. To perform the Gumbel-Max trick on generating discrete samples from\nthe categorical distribution pc(y|x), a concrete distribution [25, 31] can be used. We use a concrete\ndistribution qc(y|x) to generate continuous samples and use the continuous samples to compute the\ngradients \u2207cLn\n(8)\nHere, z = onehot(argmax y) is a discrete pseudo label where y \u223c qc(y|x). We de\ufb01ne qc(y|x) as\n\nAD = \u2207cEy\u223cpc[log(1 \u2212 p\u0001\nd(x, y))] = Ey\u223cqc [\u2207c log qc(y|x) log(1 \u2212 p\u0001\n(cid:32)\nlog pc(y|x) + g\n\nAD of the adversarial loss w.r.t. the classi\ufb01er as\n\n(9)\nHere, \u03c4 \u2208 (0, +\u221e) is a temperature parameter and Gumbel(0, 1) is the Gumbel distribution2 [31].\nWe leverage the temperature parameter \u03c4 to control the variance of gradients over the training.\nWith a high temperature, the samples from the concrete distribution are smooth, which give low-\nvariance gradient estimates. Note that a disadvantage of the concrete distribution is that with a high\ntemperature, it becomes a less accurate approximation to the original categorical distribution, which\ncauses biased gradient estimates. We will discuss how to tune the temperature parameter in Section 4.\nIn addition to improving the training of C, we also apply the same techniques to improve the training\nof T . We update D with the back-propagation algorithm [37] (detailed in Appendix B). The overall\nlogic of the KDGAN training is summarized in Algorithm 1. The three players can be \ufb01rst pretrained\nseparately and then trained alternatively via minibatch stochastic gradient descent.\n\nqc(y|x) = softmax\n\ng \u223c Gumbel(0, 1).\n\nd(x, z))].\n\n(cid:33)\n\n,\n\n\u03c4\n\n4 Experiments\n\nThe proposed KDGAN framework can be applied to a wide range of multi-label learning tasks where\nprivileged provision is available. To show the applicability of KDGAN, we conduct experiments\nwith the tasks of deep model compression (Section 4.1) and image tag recommendation (Section 4.2).\nNote that privileged provision is referred to as computational resources in deep model compression\nand privileged information in image tag recommendation, respectively.\nWe implement KDGAN based on Tensor\ufb02ow [1] and here we brie\ufb02y describe our experimental setup3.\nWe use two formulations of the distillation losses including the L2 loss [7] and the Kullback\u2013Leibler\ndivergence [23]. The two formulations exhibit comparable results and the results presented are based\non the L2 loss [7]. Since both T and D can use privileged provision, we implement their scoring\nfunctions f (x, y) and g(x, y) using the same function s(x, y) but with different sets of parameters.\nWe search for the optimal values for the hyperparameters \u03b1 in [0.0, 1.0], \u03b2 in [0.001, 1000], and \u03b3 in\n[0.0001, 100] based on validation performance. We \ufb01nd that a reasonable annealing schedule for the\ntemperature parameter \u03c4 is to start with a large value (1.0) and exponentially decay it to a small value\n(0.1). We leave the exploration of the optimal schedule for future work.\n2 The Gumbel distribution can be sampled by drawing u \u223c Uniform(0, 1) and computing g = \u2212 log(\u2212 log u).\n3 The code and the data are made available at https://github.com/xiaojiew1/KDGAN/.\n\n6\n\n\fTable 1: Average accuracy over 10 runs in model compression (n is the number of training instances).\n\nMethod\n\nn = 100\n74.02 \u00b1 0.13\nCODIS\n68.34 \u00b1 0.06\nDISTN\n66.53 \u00b1 0.18\nNOISY\n67.35 \u00b1 0.15\nMIMIC\nNaGAN 64.90 \u00b1 0.31\nKDGAN 77.95 \u00b1 0.05\n\nMNIST\n\nn = 1, 000\n95.77 \u00b1 0.10\n93.97 \u00b1 0.08\n93.45 \u00b1 0.11\n93.78 \u00b1 0.13\n93.60 \u00b1 0.22\n96.42 \u00b1 0.05\n\nn = 10, 000\n98.89 \u00b1 0.08\n98.79 \u00b1 0.07\n98.58 \u00b1 0.11\n98.65 \u00b1 0.05\n98.95 \u00b1 0.19\n99.25 \u00b1 0.02\n\nn = 500\n54.17 \u00b1 0.20\n50.92 \u00b1 0.18\n50.18 \u00b1 0.28\n51.74 \u00b1 0.23\n46.29 \u00b1 0.32\n57.56 \u00b1 0.13\n\nCIFAR-10\n\nn = 5, 000\n77.82 \u00b1 0.14\n76.59 \u00b1 0.15\n75.42 \u00b1 0.19\n75.66 \u00b1 0.17\n76.11 \u00b1 0.24\n79.36 \u00b1 0.04\n\nn = 50, 000\n85.12 \u00b1 0.11\n83.32 \u00b1 0.08\n82.99 \u00b1 0.12\n84.33 \u00b1 0.10\n85.34 \u00b1 0.27\n86.50 \u00b1 0.04\n\n(a) Deep model compression over MNIST.\n\n(b) Image tag recommendation on YFCC100M.\n\nFigure 3: Training curves of the classi\ufb01er in the proposed NaGAN and KDGAN.\n\n4.1 Deep Model Compression\n\nDeep model compression aims to reduce the storage and runtime complexity of deep models and\nto improve the deployability of such models on portable devices such as smart phones. Extensive\ncomputational resources available for training are considered privileged provision in this task.\nDataset and Setup. We use the widely adopted MNIST [27] and CIFAR-10 [26] datasets. The MNIST\ndataset has 60,000 grayscale images (50,000 for training and 10,000 for testing) with 10 different label\nclasses. Following an earlier work [39], we do not preprocess the images on MNIST. The CIFAR-10\ndataset has 60,000 colored images (50,000 for training and 10,000 for testing) with 10 different\nlabel classes. We preprocess the images by subtracting per-pixel mean, and we augment the training\ndata by mirrored images. We vary the number of training instances in [100, 10000] on MNIST and\nin [500, 50000] on CIFAR-10. The scoring functions h(x, y) and s(x, y) are implemented as an\nMLP (1.2M parameters) and a LeNet (3.1M parameters) on MNIST; while h(x, y) and s(x, y) are\nimplemented as a LeNet (0.5M parameters) and a ResNet (1.7M parameters) on CIFAR-10 (detailed\nin Appendix C). We evaluate various methods over 10 runs with different initialization of C and\nreport the mean accuracy and the standard deviation. Since the focus of this paper is to achieve a\nbetter accuracy for a given architecture of the classi\ufb01er, we defer the discussion on the classi\ufb01er\u2019s\nratio of compression and loss of accuracy w.r.t. the teacher to Table 3 in Appendix D.\nResults and Discussions. First, we compare the proposed NaGAN and KDGAN with KD-based\nmethods including MIMIC [7], DISTN [23], NOISY [39], and CODIS [2]. The results obtained by\nvarying the number of training images on MNIST and CIFAR-10 are summarized in Table 1. On both\ndatasets, KDGAN consistently outperforms the KD-based methods by a large margin. For example,\nKDGAN achieves as much as 5.31% performance gain with 100 training images on MNIST. We\nfurther compare NaGAN with the KD-based methods. We observe that NaGAN performs better\nwhen a large amount of training data are available (e.g., 50,000 training images on CIFAR-10) while\nKD-based methods perform better when a small number of training images are available (e.g., 500\ntraining images on CIFAR-10). This is consistent with our analysis in Section 3.1 that NaGAN can\nlearn the true data distribution better, although this requires a large amount of training data.\nThen, we compare NaGAN with KDGAN. As shown in Table 1, KDGAN achieves a larger per-\nformance gain over NaGAN with fewer training instances. This indicates that KDGAN requires a\nsmaller number of training instances than NaGAN does to reach the same level of accuracy. This\ncan be explained by that KDGAN introduces T to provide soft labels for training C. The soft labels\ngenerally have high entropy and reveal much useful information about each training instance. Hence,\nthe soft labels impose much more constraint on the parameters of C than the true labels, which can\nreduce the number of training instances required to train C. We further investigate the training speed\n\n7\n\n04080120160200Training Epochs0.00.20.40.60.8AccuracyDISTNCODISNaGANKDGAN-WO-GMKDGAN080160240320400Training epochs0.00.10.20.3P@3TPROPREXMPNaGANKDGAN-WO-GMKDGAN\f(a) Effect of varying \u03b1\n\n(b) Effect of varying \u03b2\n\n(c) Effect of varying \u03b3\n\nFigure 4: Effects of hyperparameters in KDGAN on MNIST for deep model compression.\n\nof NaGAN and KDGAN by the number of training epochs. Typical learning curves of C in NaGAN\nand KDGAN are shown in Figure 3a. Due to the page limit, we only show the results using 100\ntraining images on MNIST. We \ufb01nd that KDGAN converges to a better accuracy with a smaller\nnumber of training epochs (about 25 epochs) than NaGAN (about 135 epochs). After convergence,\nthe training curve in KDGAN is more stable than that in NaGAN. Moreover, we investigate the\nbene\ufb01t provided by the Gumbel-Max trick for the KDGAN training. We perform the KDGAN\ntraining without using the Gumbel-Max trick (referred to as KDGAN-WO-GM) and also plot the\naccuracy against training epochs in Figure 3a. By comparing KDGAN with KDGAN-WO-GM, we\ncan see that the Gumbel-Max trick speeds up the training process by around 45% in terms of training\nepochs. The Gumbel-Max trick also helps improve the accuracy from 0.7605 to 0.7795 (by around\n2.5%). One possible reason is that the Gumbel-Max trick effectively reduces the gradient variance\nfrom the discriminator as discussed in Section 3.3. This is also observed in our experiments, e.g., by\ncomparing the gradient variance from the adversarial loss not using the Gumbel-Max trick in Figure\n7a with the one using the Gumbel-Max trick in Figure 7b (see Appendix D for details).\nNext, we study the reasons for the higher accuracy of KDGAN. We present how the accuracy of\nKDGAN varies against the hyperparameters on the MNIST dataset in Figure 4 (Note the logarithmic\nscale of the x-axis in Figures 4b and 4c). We \ufb01nd that \u03b1 and \u03b2 have a relatively small effect on the\naccuracy, which suggests that KDGAN is a robust framework. Besides, if we set \u03b2 to a small value\n(0.0001), we get more than 2% accuracy drop when KDGAN is trained with 100 training instances.\nThis shows that T is important in training C when the number of training instances is small. We\nfurther \ufb01nd that a large value of \u03b3 causes the accuracy to deteriorate rapidly. This is because the\nsoft labels provided by C are usually noisy. Emphasizing on training T to predict the noisy labels\ndecreases the accuracy of T , which in turn decreases the accuracy of C. We obtain similar results for\nthe effects of the hyperparameters on the CIFAR-10 dataset.\n\n4.2\n\nImage Tag Recommendation\n\nImage tag recommendation aims to recommend relevant tags (i.e., labels) after a user uploads an\nimage to image-hosting websites such as Flickr4. As discussed before, we aim to recommend relevant\ntags right after a user uploads an image. This way, the user can just select from the recommended\ntags instead of inputting tags. Users may continue to add additional text for an uploaded image such\nas image titles and descriptions. We only use such additional text at the training stage as privileged\ninformation used by the teacher and the discriminator only. At the inference stage, our trained model\n(i.e., the classi\ufb01er) only takes an image as input to make tag recommendations.\nDataset and Setup. We use the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset5\nin the experiments [45]. To simulate the case where additional text about images is available for\ntraining, we randomly sample 20,000 images with titles or descriptions for training and another 2,000\nimages for testing. We create a dataset of images labeled with the 200 most popular tags and another\ndataset of images labeled with 200 randomly sampled tags. Following an earlier study [3], we use a\nVGGNet [40] pretrained on ImageNet [14] to extract image features and a LSTM [24] with pretrained\nword embeddings [34] to learn text features. We implement h(x, y) as an MLP with image features\nas input and implement s(x, y) as an MLP with the element-wise product of image and text features\nas input (detailed in Appendix C). We use precision (P@N), F-score (F@N), mean average precision\n(MAP), and mean reciprocal ranking (MRR) to evaluate performance.\n\n4 https://www.flickr.com/.\n\n5 Yahoo Webscope Program. http://webscope.sandbox.yahoo.com/.\n\n8\n\n.95.97.99n=100n=1000n=100000.00.20.40.60.81.0\u03b1.73.75.77Accuracy.95.97.99n=100n=1000n=10000-3-2-10123log10 \u03b2.73.75.77Accuracy-4-3-2-1012log10 \u03b30.60.70.80.91.0n=100n=1000n=10000Accuracy\fTable 2: Performance of various methods on the YFCC100M dataset in tag recommendation.\n\nMethod\n\nP@3\n.2320\nKNN\n.2420\nTPROP\n.2560\nTFEAT\n.2720\nREXMP\nNaGAN .2892\nKDGAN .3047\n\nMost Popular Tags\n\nRandomly Sampled Tags\n\nP@5\n.1680\n.1636\n.1752\n.1800\n.1880\n.1968\n\nF@3\n.2339\n.2811\n.2871\n.3324\n.3516\n.3678\n\nF@5 MAP MRR P@3\n.1633\n.1623\n.1883\n.1949\n.2002\n.1999\n.2228\n.2295\n.2415\n.2352\n.2526\n.2572\n\n.5755\n.6177\n.6417\n.7015\n.7432\n.7787\n\n.5852\n.6270\n.6503\n.7122\n.7555\n.7905\n\nP@5\n.1198\n.1372\n.1420\n.1378\n.1495\n.1666\n\nF@3\n.1575\n.1810\n.2195\n.2427\n.2693\n.2946\n\nF@5 MAP MRR\n.1088\n.4092\n.4636\n.1252\n.5309\n.1495\n.5331\n.1669\n.5911\n.1867\n.2009\n.6452\n\n.3970\n.4512\n.5149\n.5205\n.5791\n.6302\n\n(a) Effect of varying \u03b1\n\n(b) Effect of varying \u03b2\n\n(c) Effect of varying \u03b3\n\nFigure 5: Effects of hyperparameters in KDGAN on YFCC100M for image tag recommendation.\n\nResults and Discussions. First, we compare C in KDGAN with KNN [32], TPROP [19], TFEAT [11],\nand REXMP [28]. The overall results are presented in Table 2. We \ufb01nd that KDGAN achieves\nsigni\ufb01cant improvements over the other methods across all the measures. Although KDGAN does\nnot explicitly model the semantic similarity between two labels like what REXMP does, it still makes\nbetter recommendations than REXMP does. The reason is that in KDGAN, T provides C with soft\nlabels at training. The soft labels contain a rich similarity structure over tags which cannot be modeled\nwell by any pairwise similarity between tags used in REXMP. For example, an image labeled with a\ntag volleyball is supplied with a soft label assigning a probability of 10\u22122 to basketball, 10\u22124\nto baseball, and 10\u22128 to dragonfly. The reason that T generalizes is re\ufb02ected in the relative\nprobabilities over tags, which can be used for guiding C to generalize better.\nNext, we compare the training curves of NaGAN, KDGAN-WO-GM, and KDGAN. We only plot\nthe performance measured by P@3 in Figure 3b because the other measures exhibit similar training\ncurves. We \ufb01nd that KDGAN learns a more accurate classi\ufb01er with a smaller number of training\nepochs (about 100 epochs) than NaGAN (about 220 epochs) and KDGAN-WO-GM (about 150\nepochs). After convergence, KDGAN consistently outperforms the best baseline REXMP.\nLast, we investigate how the performance of KDGAN varies against the hyperparameters over\nthe YFCC100M dataset. The results are summarized in Figure 5, which are consistent with our\nobservations in the task of deep model compression.\n\n5 Conclusion\n\nWe proposed a framework named KDGAN to distill knowledge with generative adversarial networks\nfor multi-label learning with privileged provision. We have de\ufb01ned the KDGAN framework as a\nminimax game where a classi\ufb01er, a teacher, and a discriminator are trained adversarially. We have\nproved that the minimax game has an equilibrium where the classi\ufb01er perfectly models the true\ndata distribution. We use the concrete distribution to control the variance of gradients during the\nadversarial training and obtained low-variance gradient estimates to accelerate the training. We have\nshown that KDGAN outperforms the state-of-the-art methods in two important applications, image\ntag recommendation and deep model compression. We show that KDGAN learns a more accurate\nclassi\ufb01er at a faster speed than a naive GAN (NaGAN) does. For future work, we will explore\nadaptive methods for determining model hyperparameters to achieve better training dynamics.\n\n9\n\n.80.84.88P@3F@3MAPMRR0.00.20.40.60.81.0\u03b1.32.36.40Score.80.84.88P@3F@3MAPMRR-3-2-10123log10 \u03b2.32.36.40Score.72.80.88P@3F@3MAPMRR-5-4-3-2-1012log10 \u03b3.28.34.40Score\fAcknowledgement\n\nThis work is supported by Australian Research Council Future Fellowship Project FT120100832 and\nDiscovery Project DP180102050. We thank the anonymous reviewers for their feedback on the paper.\nWe have incorporated responses to reviewers\u2019 comments in the paper.\n\nReferences\n\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\n\nM. Isard, et al. Tensor\ufb02ow: a system for large-scale machine learning. In OSDI, 2016.\n\n[2] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton. Large scale distributed\n\nneural network training through online distillation. In ICLR, 2018.\n\n[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa:\n\nVisual question answering. In ICCV, 2015.\n\n[4] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial\n\nnetworks. In ICLR, 2017.\n\n[5] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[6] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative\n\nadversarial nets (gans). In ICML, 2017.\n\n[7] J. Ba and R. Caruana. Do deep nets really need to be deep? In NeurIPS, 2014.\n\n[8] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.\n\narXiv preprint arXiv:1606.04838, 2016.\n\n[9] C. Bucilu\u02c7a, R. Caruana, and A. Niculescu-Mizil. Model compression. In SIGKDD, 2006.\n\n[10] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker. Learning ef\ufb01cient object detection\n\nmodels with knowledge distillation. In NeurIPS, 2017.\n\n[11] L. Chen, D. Xu, I. W. Tsang, and J. Luo. Tag-based image retrieval improved by augmented\n\nfeatures and group-based re\ufb01nement. IEEE Transactions on Multimedia, 2012.\n\n[12] W. Cheng, E. H\u00fcllermeier, and K. J. Dembczynski. Label ranking methods based on the\n\nplackett-luce model. In ICML, 2010.\n\n[13] L. Chongxuan, T. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. In NeurIPS,\n\n2017.\n\n[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, 2009.\n\n[15] S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding gans: the lqg setting. arXiv preprint\n\narXiv:1710.10793, 2017.\n\n[16] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative\n\nadversarial networks. In NeurIPS, 2017.\n\n[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NeurIPS, 2014.\n\n[18] M. Grbovic, N. Djuric, S. Guo, and S. Vucetic. Supervised clustering of label ranking data\n\nusing label preference information. Machine learning, 2013.\n\n[19] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning\n\nin nearest neighbor models for image auto-annotation. In ICCV, 2009.\n\n10\n\n\f[20] E. Gumbel. Statistical theory of extreme values and some practical applications: A series of\n\nlectures. US Government Printing Of\ufb01ce, Washington, 1954.\n\n[21] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR,\n\n2016.\n\n[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[23] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NeurIPS\n\nworkshop, 2014.\n\n[24] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.\n\n[25] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In ICLR,\n\n2017.\n\n[26] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, University of Toronto, 2009.\n\n[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 1998.\n\n[28] X. Li and C. G. Snoek. Classifying tag relevance with relevant positive and negative examples.\n\nIn ACMMM, 2013.\n\n[29] D. Lopez-Paz, L. Bottou, B. Sch\u00f6lkopf, and V. Vapnik. Unifying distillation and privileged\n\ninformation. In ICLR, 2016.\n\n[30] C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In NeurIPS, 2014.\n\n[31] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. In ICLR, 2017.\n\n[32] A. Makadia, V. Pavlovic, and S. Kumar. Baselines for image annotation. IJCV, 2010.\n\n[33] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In\n\nICLR, 2017.\n\n[34] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NeurIPS, 2013.\n\n[35] D. Pechyony and V. Vapnik. On the theory of learnining with privileged information. In\n\nNeurIPS, 2010.\n\n[36] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for\n\nthin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[37] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error\npropagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,\n1985.\n\n[38] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\n\ntechniques for training gans. In NeurIPS, 2016.\n\n[39] B. B. Sau and V. N. Balasubramanian. Deep model compression: Distilling knowledge from\n\nnoisy teachers. arXiv preprint arXiv:1610.09650, 2016.\n\n[40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In ICLR, 2015.\n\n[41] J.-C. Su and S. Maji. Cross quality distillation. arXiv preprint arXiv:1604.00433, 2016.\n\n[42] Y. Sun, N. J. Yuan, Y. Wang, X. Xie, K. McDonald, and R. Zhang. Contextual intent tracking\n\nfor personal assistants. In SIGKDD, 2016.\n\n11\n\n\f[43] Y. Sun, N. J. Yuan, X. Xie, K. McDonald, and R. Zhang. Collaborative nowcasting for contextual\n\nrecommendation. In WWW, 2016.\n\n[44] Y. Sun, N. J. Yuan, X. Xie, K. McDonald, and R. Zhang. Collaborative intent prediction with\n\nreal-time contextual data. TOIS, 2017.\n\n[45] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li.\n\nYfcc100m: the new data in multimedia research. Communications of the ACM, 2016.\n\n[46] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein. Rebar: Low-variance,\n\nunbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017.\n\n[47] V. Vapnik and R. Izmailov. Learning using privileged information: similarity control and\n\nknowledge transfer. JMLR, 2015.\n\n[48] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information.\n\nNeural networks, 2009.\n\n[49] J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang. Irgan: A minimax\ngame for unifying generative and discriminative information retrieval models. In SIGIR, 2017.\n\n[50] X. Wang, J. Qi, K. Ramamohanarao, Y. Sun, B. Li, and R. Zhang. A joint optimization approach\n\nfor personalized recommendation diversi\ufb01cation. In PAKDD, 2018.\n\n[51] Z. Xu, Y.-C. Hsu, and J. Huang. Learning loss for knowledge distillation with conditional\n\nadversarial networks. arXiv preprint arXiv:1709.00513, 2017.\n\n[52] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy\n\ngradient. In AAAI, 2017.\n\n[53] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learning algorithms. TKDE, 2014.\n\n[54] Y. Zhang, Z. Gan, and L. Carin. Generating text via adversarial training. In NeurIPS workshop\n\non Adversarial Training, 2016.\n\n[55] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin. Adversarial feature\n\nmatching for text generation. In ICML, 2017.\n\n[56] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR,\n\n2017.\n\n12\n\n\f", "award": [], "sourceid": 433, "authors": [{"given_name": "Xiaojie", "family_name": "Wang", "institution": "The University of Melbourne"}, {"given_name": "Rui", "family_name": "Zhang", "institution": "\" University of Melbourne, Australia\""}, {"given_name": "Yu", "family_name": "Sun", "institution": "Twitter Inc."}, {"given_name": "Jianzhong", "family_name": "Qi", "institution": "The University of Melbourne"}]}