{"title": "MetaGAN: An Adversarial Approach to Few-Shot Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2365, "page_last": 2374, "abstract": "In this paper, we propose a conceptually simple and general framework called MetaGAN for few-shot learning problems. Most state-of-the-art few-shot classification models can be integrated with MetaGAN in a principled and straightforward way. By introducing an adversarial generator conditioned on tasks, we augment vanilla few-shot classification models with the ability to discriminate between real and fake data.  We argue that this GAN-based approach can help few-shot classifiers to learn sharper decision boundary, which could generalize better. We show that with our MetaGAN framework, we can extend supervised few-shot learning models to naturally cope with unsupervised data. Different from previous work in semi-supervised few-shot learning, our algorithms can deal with semi-supervision at both sample-level and task-level. We give theoretical justifications of the strength of MetaGAN, and validate the effectiveness of MetaGAN on challenging few-shot image classification benchmarks.", "full_text": "MetaGAN: An Adversarial Approach to Few-Shot\n\nLearning\n\nRuixiang Zhang\u2217\u2020\n\nMILA, Universit\u00e9 de Montr\u00e9al\n\nsodabeta7@gmail.com\n\nTong Che\u2217\n\nMILA, Universit\u00e9 de Montr\u00e9al\ntongcheprivate@gmail.com\n\nZoubin Ghahramani\nUniversity of Cambridge\n\nzoubin@cam.ac.uk\n\nYoshua Bengio\n\nMILA, Universit\u00e9 de Montr\u00e9al, CIFAR Senior Fellow\n\nyoshua.bengio@mila.quebec\n\nYangqiu Song\n\nHKUST\n\nyqsong@cse.ust.hk\n\nAbstract\n\nIn this paper, we propose a conceptually simple and general framework called\nMetaGAN for few-shot learning problems. Most state-of-the-art few-shot classi\ufb01-\ncation models can be integrated with MetaGAN in a principled and straightforward\nway. By introducing an adversarial generator conditioned on tasks, we augment\nvanilla few-shot classi\ufb01cation models with the ability to discriminate between real\nand fake data. We argue that this GAN-based approach can help few-shot classi-\n\ufb01ers to learn sharper decision boundary, which could generalize better. We show\nthat with our MetaGAN framework, we can extend supervised few-shot learning\nmodels to naturally cope with unlabeled data. Different from previous work in\nsemi-supervised few-shot learning, our algorithms can deal with semi-supervision\nat both sample-level and task-level. We give theoretical justi\ufb01cations of the strength\nof MetaGAN, and validate the effectiveness of MetaGAN on challenging few-shot\nimage classi\ufb01cation benchmarks.\n\n1\n\nINTRODUCTION\n\nDeep neural networks have achieved great success in many arti\ufb01cial intelligence tasks. However, they\ntend to struggle when data is scarce or when they need to adapt to new tasks within a few numbers of\nsteps. On the other hand, humans are able to learn new concepts quickly, given just a few examples.\nThe reason for this performance gap between human and arti\ufb01cial learners is usually explained as\nthat humans can effectively utilize prior experiences and knowledge when learning a new task, while\narti\ufb01cial learners usually seriously over\ufb01t without the necessary prior knowledge.\nMeta-learning [Thrun, 1998, Hochreiter et al., 2001] addresses this problem by training a particular\nadaptation strategy to a distribution of similar tasks, trying to extract transferable patterns useful\nfor many tasks. Recently, many different meta-learning or few-shot learning algorithms have been\nproposed. These algorithms may take the forms of learning a shared metric [Sung et al., 2018, Snell\net al., 2017], a shared initialization of network parameters [Finn et al., 2017], shared optimization\nalgorithms [Ravi and Larochelle, 2017, Munkhdalai et al., 2017, Munkhdalai and Yu, 2017], or\ngeneric inference networks [Santoro et al., 2016, Mishra et al., 2018] . In the context of few-shot\nclassi\ufb01cation, these algorithms try to learn a good strategy to form a correct decision boundary\nbetween different classes from only a few samples of data in each class.\n\n\u2217Equal contribution.\n\u2020Work done at HKUST\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this work we present MetaGAN as a general and \ufb02exible framework for few-shot learning. Most\nstate-of-the-art few-shot learning models can be integrated into MetaGAN seamlessly. While most\nfew-shot learning models consider how to effectively utilize few labeled data in a supervised learning\nway, semi-supervised few-shot learning which is studied recently in [Ren et al., 2018] is proposed\nwhen unlabeled data are available. In this paper, we show that both supervised few-shot learning and\nsemi-supervised few-shot learning can be uni\ufb01ed naturally with our prpoposed MetaGAN framework.\nWe can further extend the sample-level semi-supervised learning proposed in [Ren et al., 2018] to\nthe task level. For sample-level semi-supervised few-shot learning, we allow some training samples\nto be unlabeled within a task. These training samples can either come from the same classes as the\nlabeled samples, or come from different \"distractor\" classes. For task-level semi-supervised few-shot\nlearning, we also allow purely unsupervised tasks, in which both support and query samples are all\nunlabeled. Task-level semi-supervised few-shot learning can be very natural in practice. For example,\nwe can have robots with cameras collecting data in different places. It is safe to assume that the data\ncollected by one robot in a short time range come from a speci\ufb01c distribution, so classifying these\nimages can be viewed as one task. But these tasks are completely unlabeled, both in the support and\nin the query sets. The MetaGAN algorithm is able to learn to infer the shape and boundaries of data\nmanifolds of the task-speci\ufb01c data distribution from both labeled and unlabeled examples.\nWe provide both intuitive and formal theoretical justi\ufb01cations on the key idea behind MetaGAN. The\nmain dif\ufb01culty in few-shot learning is how to form generalizable decision boundaries from a small\nnumber of training samples. We argue that adversarial training can help few-shot learning models\nby making it easier to learn better decision boundaries between different classes. Although training\ndata is usually very limited for each task, we show that how fake data generated by a non-perfect\ngenerator in MetaGAN can help the classi\ufb01er identify much tighter decision boundaries (real-fake\ndecision boundaries) and thus can help boost the performance of few-shot learning.\nWe demonstrate the effectiveness of MetaGAN on popular few-shot image classi\ufb01cation benchmarks\nin both supervised and semi-supervised settings. We choose two representative few-shot learning\nmodels, MAML[Finn et al., 2017] representing models that learn to adapt using gradients, and\nRelation Network[Sung et al., 2018] representing models that learn distance metrics, and combine\nthem with MetaGAN. 3 We show that MetaGAN can consistently improve the performance of popular\nfew-shot classi\ufb01ers in all of these scenarios.\n\n2 BACKGROUND\n\n2.1 FEW-SHOT LEARNING\nWe formally de\ufb01ne few-shot learning problems as following: Given a distribution of tasks P (T ), a\nsample task T from P (T ) is given by a joint distribution P T\nX\u00d7Y (x, y), where the task is to predict\ny given x. We have a set of training sample tasks {Ti}N\ni=1. Each training sample task T is a tuple\nT = (ST , QT ), where the support set is denoted as ST = SsT \u222a SuT , and the query set is denoted\nas QT = QsT \u222a QuT . The supervised support set SsT = {(x1, y1), (x2, y2),\u00b7\u00b7\u00b7 (xN\u00d7K, yN\u00d7K)}\ncontains K labeled samples from each of the N classes (this is usually known as K-shot N-way\nclassi\ufb01cation). The optional unlabeled support set SuT = {x1, x2,\u00b7\u00b7\u00b7 xM} contains unlabeled\nsamples from the same set of N classes, which can also be empty in purely supervised cases.\nQsT = {(x1, y1), (x2, y2),\u00b7\u00b7\u00b7 (xT , yT )} is the supervised query dataset. QuT = {x1, x2,\u00b7\u00b7\u00b7 xP}\nis the optional unlabeled query dataset. The objective of the model is to minimize the loss of its\npredictions on a query set, given the support set as input.\n\n2.2 ADVERSARIAL TRAINING\n\nThe generative adversarial networks [Goodfellow et al., 2014] framework is one of the most popular\napproaches to generative modeling. It tries to adversarially train two neural networks, a generator\nand a discriminator. Adversarial training has seen a vast range of applications in recent years, such\nas semi-supervised learning [Dai et al., 2017, Salimans et al., 2016], unsupervised representation\nlearning [Chen et al., 2016], imitation learning [Ho and Ermon, 2016] etc. However, few works have\nsuccessfully combined adversarial training with few-shot learning. [Antoniou et al., 2018] proposed\n\n3However, it is worth noticing that MetaGAN can also be easily combined with other models, such as\n\nprototypical networks or SNAIL.\n\n2\n\n\fto train a class conditioned GAN (DAGAN) to perform data augmentation. This is related to our\nproposal but is different in two aspects. 1) Their GAN model is trained separately from the classi\ufb01er,\nonly to provide additional data. 2) They treat generated data as real training data of the conditioned\nclass. There are two drawbacks of this approach. First, GANs still have trouble in generating realistic\nsamples in complex datasets such as ImageNet, so treating the generated images as real data in these\ndatasets is questionable. Second, DAGAN can very easily run into mode collapsing. In many cases it\nis easy to collapse to an identity function \u2014 it just reconstruct the input image. Our approach does\nnot require the generator to be perfect. Conversely, similar to the semi-supervised learning case [Dai\net al., 2017], it can even bene\ufb01t from an imperfect generator.\n\n3 OUR APPROACH\n\nMetaGAN is a conceptually simple and general framework for few-shot learning problems. Given\na decent K-shot N-way classi\ufb01er, similar to [Salimans et al., 2016] we introduce a conditional\ngenerative model with the objective to generate samples which are not distinguishable from true data\nsampled from a speci\ufb01c task. We increase the dimension of the classi\ufb01er output from N to N + 1, to\nmodel the probability that input data is fake. We train the discriminator (classi\ufb01er) and generator in\nan adversarial setup.\nThe key idea behind MetaGAN is that imperfect generators in GAN models can provide fake data\nbetween the manifolds of different real data classes, thus providing additional training signals to\nthe classi\ufb01er as well as making the decision boundaries much sharper. We \ufb01rst describe our basic\nmodel formally in section 3.1, then introduce details of different instances of MetaGAN in following\nsections.\n\n3.1 BASIC ALGORITHM\n\nWe \ufb01rst introduce the basic formulation of MetaGAN here. For a few-shot N-way classi\ufb01cation\nproblem P (T ) and dataset {Ti}M\ni=1, assume we have one of the state-of-the-art few-shot classi\ufb01ers\npD(x;T ) = (p1(x), p2(x),\u00b7\u00b7\u00b7 pN (x)). Note that D is conditioned on a speci\ufb01c task T . In prac-\ntice, this conditioning can be either via fast adaptation [Finn et al., 2017] or feeding the support\nset as input [Snell et al., 2017, Mishra et al., 2018, Sung et al., 2018]. We augment the classi-\n\ufb01er with an additional output, as done in semi-supervised learning with GANs [Salimans et al.,\n2016]: pD(x;T ) = (p1(x), p2(x),\u00b7\u00b7\u00b7 pN (x), pN +1(x)). We also train a task-conditioned generator\nG(z,T ) with generating distribution pT\nG(x) that tries to generate data for the speci\ufb01c task T . Then\nfor the training episode of task T we maximize the following combination of the N-way classi\ufb01cation\nobjective and the real/fake classi\ufb01cation objective for the discriminator:\n\n(1)\n(2)\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nLT\nD = Lsupervised + Lunsupervised,\n\nLsupervised = Ex,y\u223cQsT log pD(y|x, y \u2264 N )\nLunsupervised = Ex\u223cQuT log pD(y \u2264 N|x) + E\n\nlog pD(N + 1|x)\n\nx\u223cpT\n\nG\n\nFor the generator, we minimize the non-saturating generator loss\n\nG(D) = \u2212E\nLT\n\nx\u223cpT\n\nG\n\n[log(pD(y \u2264 N|x))].\n\nThen the overall objective for training MetaGAN is\n\nLD = max\nLG = min\n\nD\n\nET \u223cP (T )LT\nET \u223cP (T )LT\nG.\n\nD\n\nG\n\n3.2 DISCRIMINATOR\n\nMetaGAN generally doesn\u2019t impose restrictions on the design of discriminator. It can be adapted from\nalmost any state-of-the-art few-shot learners. We adopt two popular choices of few-shot classi\ufb01cation\nmodels as our disciminator, MAML[Finn et al., 2017] and Relation Networks [Sung et al., 2018],\nrepresenting learning to fast \ufb01ne-tune based models and learning shared embedding and metric based\nmodels respectively.\n\n3\n\n\f3.2.1 METAGAN WITH MAML\n\nMAML trains a transferable initialization that is able to quickly adapt to any speci\ufb01c task with one\nstep gradient descent. Formally the discriminator D(\u03b8d) is parametrized by parameters \u03b8d. For a\nspeci\ufb01c task T \u223c P (T ), we update the parameters to \u03b8(cid:48)\nd = \u03b8d \u2212 \u03b1\u2207\u03b8d (cid:96)T\nD according to the loss eq. 7\nlog pD(N + 1|x). (7)\nD = \u2212Ex,y\u223cSsT log pD(y|x, y \u2264 N )\u2212 Ex\u223cSuT log pD(y \u2264 N|x)\u2212 E\n(cid:96)T\nd) across tasks T to\nThen we minimize the expected loss on query set with adapted discriminator D(\u03b8(cid:48)\ntrain the discriminator\u2019s initial parameters \u03b8d, and we train the generator using adapted discriminator\nD(\u03b8(cid:48)\nd). Finally our whole model combinging MetaGAN with MAML can be trained using the loss\nintroduced in eq. 5 and eq. 6, as shown below:\nLD = max\nLG = min\n\nET \u223cP (T )LT\nET \u223cP (T )LL\n\nD(\u03b8(cid:48)\nd)\nG(D(\u03b8(cid:48)\n\nx\u223cpT\n\nG\n\n(8)\n\n(9)\n\nd)).\n\nD\n\nG\n\nWe put the detailed algorithms for training MetaGAN with MAML model in the supplemental\nmaterial.\n\n3.2.2 METAGAN WITH RELATION NETWORK\n\nThe Relation Network (RN) is a few-shot learning model aiming to do classi\ufb01cation via learning\na deep distance metric between images. MetaGAN can integrate with RN in a principled and\nstraightforward way.\nFor a speci\ufb01c task T \u223c P (T ), following [Sung et al., 2018] let ri,j = g\u03c8(C(f\u03c6(xi), f\u03c6(xj))), xi \u2208\nSsT , xj \u2208 QsT be the relevance score between query set image xj and support set image xi, where\ng\u03c8 is the relation module, f\u03c6 is the feature embedding network and C is the concatenation operator.\nDifferent from [Sung et al., 2018] we don\u2019t restrict ri,j to be in range of 0 to 1, we rather use ri,j as\nlogits used in softmax classi\ufb01cation\n\n(10)\n\npD(y = k|xj) =\n\n1 +(cid:80)N\n\nexp(rk,j)\n\ni=1 exp(ri,j)\n\nWe adopt the simple trick proposed in [Salimans et al., 2016] by setting the logit of the fake class to\n0, which is corresponding to the constant 1 appearing in denominator, to model pD(N + 1|x) which\nis the probability that input data is fake. Thus we can train our model, MetaGAN with RN, directly\nusing loss eq. 5 and eq. 6.\n\n3.3 GENERATOR\n\nWe use a conditional generative model to generate fake data that is close to the real data manifold in\none speci\ufb01c task T . To do so, we \ufb01rst compress the information in the task\u2019s support dataset with\na dataset encoder E into vector hT , which contains suf\ufb01cient statistics for the data distribution of\ntask T . Then hT is concatenated with random noise input z to be provided as input to the generator\nnetwork. Inspired by the statistic network proposed in [Edwards and Storkey, 2017], our dataset\nencoder is composed of two modules:\nInstance-Encoder Module The Instance-Encoder is a neural network that learns a feature represen-\ntation for each individual data example in the dataset SsT . It maps each data example xi \u2208 SsT to\nfeature space ei = Instance-Encoder(xi).\nFeature-Aggregation Module The Feature-Aggregation module takes each embedded feature vector\nei as input and produce the representation vector hT for the whole task training set. Feasible\naggregation methods include average pooling, max pooling and other element-wise aggregation\noperators. We use average pooling following [Edwards and Storkey, 2017] in our MetaGAN model.\nBy integrating an Instance-Encoder module and a Feature-Aggregation Module, the instance-encoder\nis encouraged to learn a representation such that averaging different samples in the learned feature\nspace makes sense. Also, feature-aggregation makes it harder for the generator to simply reconstruct\nits inputs, which can lead to mode dropping [Che et al., 2017].\n\n4\n\n\f3.4 LEARNING SETTINGS\n\nIn this section we show that both supervised few-shot learning and semi-supervised few-shot learning\ncan be uni\ufb01ed in the MetaGAN framework.\nSupervised Few-Shot Learning Supervised learning is the most common learning setting of few-\nshot classi\ufb01cation models. For a task T \u223c P (T ), since an unlabeled set SuT and QuT is not available,\nwe use the labeled set SsT and QsT to replace them respectively in loss eq. 1 and eq. 7.\nSample-Level Semi-Supervised Few-Shot Learning Sample-level semi-supervised learning fol-\nlows the same setup as [Ren et al., 2018], where unlabeled data examples are available in each task.\nWhile our model is \ufb02exible enough to deal with different sets of unlabeled examples in the support set\nand the query set, for a task T \u223c P (T ) we only use a single unlabeled set of examples UT to follow\nthe same training scheme in [Ren et al., 2018], for a better comparison with our baseline models.\nSpeci\ufb01cally, for MetaGAN with MAML, we set SuT = SsT and QuT = UT . For MetaGAN with RN,\nwe set SuT = \u2205 and QuT = UT in loss eq. 1 and eq. 7.\nTask-Level Semi-Supervised Few-Shot Learning For Task-level semi-supervised learning, the\ntraining dataset {Ti}M\ni=1 consisting of labeled tasks and unlabeled tasks. For labeled tasks we simply\nfollow the supervised learning setting described above. For unlabeled tasks, we omit the supervised\nloss term by setting QsT = \u2205 and SsT = \u2205 in loss eq. 1 and eq. 7.\nAs proposed in [Salimans et al., 2016] we adopt the \"feature matching loss\" as the generator loss LG\nin both sample-level and task-level semi-supervised few-shot learning.\n\n4 WHY DOES METAGAN WORK?\n\nIn this section, we introduce intuition as well as theoretical justi\ufb01cations of MetaGAN, which motivate\nvarious improvements we made on the model.\nIn a few-shot classi\ufb01cation problem, the model tries to optimize a decision boundary for each task\nwith just a few samples in each class. Obviously this problem is impossible if no information can\nbe learned from other tasks, as there are so many possible decision boundaries to separate the few\nsamples apart and most of them will not generalize. Meta-learning tries to learn a shared strategy\nacross different tasks to form decision boundaries from few samples, in the hope that this strategy is\nable to generalize to new tasks.\nAlthough this is reasonable, there can be some problems. For example, some objects look more\nsimilar than others. It may be easier to form a decision boundary between a cat and a car than between\na cat and a dog. If the training data does not contain tasks that try to separate a cat and a dog, it may\nfeels dif\ufb01cult to extract the correct features to separate these two classes of objects. However, on\nthe other hand, the expectation to have all kinds of class combinations during training leads to the\ncombinatorial explosion problem.\nThis is where our proposed MetaGAN formulation helps. Just as for the case of doing semi-supervised\nlearning with GANs, we don\u2019t expect our generator to generate data that is exactly on the true data\nmanifold. Instead, it is better that the generator is able to generate data a bit off the data manifold\nof each class, cf. \ufb01g. 1. This forces our discriminator to learn a much sharper decision boundary.\nInstead of only learning to separate cats and dogs, the discriminator of MetaGAN is forced to learn\nnot only what are real cats or dogs, but also what are fake data generated from where is a bit off the\ncat and dog manifold. The discriminator thus has to extract features strong enough to decide the\nboundary of the real data manifold, which helps to separate different classes apart. Moreover, the\nseparation between real/fake classes is independent of the class combinations selected during the\nfew-shot learning process.\nFollowing the ideas behind the theoretical justi\ufb01cations studied in the semi-supervised learning\nsetting, we provide similar justi\ufb01cations in the few-shot learning problem. We include the formal\nstatement of the assumptions in the supplemental material.\nFirst, as in [Dai et al., 2017], for a speci\ufb01c task T , we assume that the classi\ufb01er relies on a feature\nextractor fT to perform classi\ufb01cation. We also make the assumption that G(\u00b7;T ) is a \"separating\ncomplement generator\" (which we de\ufb01ne in the supplemental material) for each task T . Intuitively\nthis means that the generator G(z;T ) satis\ufb01es two conditions: 1) the generator distribution pT\nG has a\n\n5\n\n\fFigure 1: Left: decision boundary without metaGAN. Right: decision boundary with metaGAN. We\nuse red curves to denote the decision boundary. Blue area in \ufb01gure represents class A, green area\nrepresents class B, and gray area represents fake class. We use + to denote real samples and \u2212 to\ndenote fake samples generated.\n\nG can separate manifolds of different classes.\n\nhigh density region that is disjoint with the data manifold of all classes; 2) This high density region\nof pT\nThen by following arguments similar to those in [Dai et al., 2017], we can prove the following:\nTheorem 1 Let GT be a separating complement generator in each task T sampled from P (T ).\nDenote ST the support set and FT the generated fake dataset. We assume our learned meta-learner\nis able to learn a classi\ufb01er DT which obtains a strong correct decision boundary on the augmented\nsupport set(ST , FT ). Then if |FT | \u2192 +\u221e, then DT can almost surely correctly classify all real\nsamples from the data distribution pT (x) of the task.\n\nThe theorem is saying that if we have a generator that is neither too good nor too bad, but can generate\ndata around the the real class manifold and have a high density region that can help separating\ndifferent classes apart, then the generated data together with a few real data can help us determine the\ncorrect decision boundary.\n\n5 EXPERIMENTS\n\n5.1 DATASETS\n\nOmniglot is a dataset consisting of handwritten character images from 50 languages. There are 1623\nclasses of characters with 20 examples within each class. Following prior training and the evaluation\nprotocol used in [Vinyals et al., 2016], we downsampled all images to 28 \u00d7 28 and randomly split\nthe dataset into 1200 classes for traininig and 432 classes for testing. The same data augmentation\ntechniques proposed by [Santoro et al., 2016] are utilized, randomly rotating each image by a multiple\nof 90 degrees to form new classes.\nMini-Imagenet is a modi\ufb01ed subset of the well-known ILSVRC-12 dataset, consisting of 84 \u00d7 84\ncolored images from 100 classes with 600 random samples in each class. We follow the same class\nsplit as in [Ravi and Larochelle, 2017], that takes 64 classes for training, 16 classes for validation and\n20 classes for testing.\n\n5.2 SUPERVISED FEW-SHOT LEARNING\n\nOn the Omniglot dataset, MetaGAN with MAML shares the same discriminator network architecture\nand most model hyper-parameters setup with vanilla convolutional MAML[Finn et al., 2017]. We set\nthe meta batch-size to 16 for 5-way classi\ufb01cation and 8 for 20-way classi\ufb01cation to \ufb01t the memory\nlimit of the GPU. For MetaGAN with RN, we batch 15 query images for each class for both 1-shot\n5-way and 5-shot 5-way classi\ufb01cation, and we batch 5 query images for each class for 1-shot 20-way\nand 5-shot 20-way task. We set the meta batch-size of MetaGAN with RN model to 1 in our all\nexperiments.\nOn Mini-Imagenet dataset, we train our MetaGAN with the MAML model using the \ufb01rst-order\napproximation method with 1 gradient step as proposed in [Finn et al., 2017], due to the consideration\nof computational cost.\nFor the conditional generator we adopt a ResNet-like architecture inspired by [Gulrajani et al., 2017]\nin both models; see more details of the architecture of the generator in supplemental material.\n\n6\n\n\fModel\n\nNeural Statistician\nPrototypical Nets\n\nMAML\nOurs: MetaGAN + MAML\n\nRelation Net\nOurs: MetaGAN + RN\n\n5-way Acc.\n\n1-shot\n\n5-shot\n\n20-way Acc.\n\n1-shot\n\n5-shot\n\n98.1\n98.8\n\n99.5\n99.7\n\n93.2\n96.0\n\n98.1\n98.9\n\n98.7 \u00b1 0.4\n99.1 \u00b1 0.3\n99.6 \u00b1 0.2\n99.67 \u00b1 0.18\n\n99.9 \u00b1 0.1\n99.7 \u00b1 0.21\n99.8 \u00b1 0.1\n99.86 \u00b1 0.11\n\n95.8 \u00b1 0.3\n96.4 \u00b1 0.27\n97.6 \u00b1 0.2\n97.64 \u00b1 0.17\n\n98.9 \u00b1 0.2\n98.9 \u00b1 0.18\n99.1 \u00b1 0.1\n99.21 \u00b1 0.1\n\nTable 1: Few-shot classi\ufb01cation results on Omniglot.\n\nModel\n\nPrototypical Nets\n\nMAML(5 gradient steps)\nMAML(5 gradient steps, \ufb01rst order)\nMAML(1 gradient step, \ufb01rst order)\nOurs: MetaGAN + MAML(1 step, \ufb01rst order)\n\nRelation Net\nOurs: MetaGAN + RN\n\n5-way Acc.\n\n1-shot\n\n5-shot\n\n49.42 \u00b1 0.78\n48.70 \u00b1 1.84\n48.07 \u00b1 1.75\n43.64 \u00b1 1.91\n46.13 \u00b1 1.78\n50.44 \u00b1 0.82\n52.71 \u00b1 0.64\n\n68.20 \u00b1 0.66\n63.11 \u00b1 0.92\n63.15 \u00b1 0.91\n58.72 \u00b1 1.20\n60.71 \u00b1 0.89\n65.32 \u00b1 0.7\n68.63 \u00b1 0.67\n\nTable 2: Few-shot classi\ufb01cation results on Mini-Imagenet.\n\nWe use the Adam [Kingma and Ba, 2014] optimizer with initial learning rate as 0.001, \u03b21 = 0.5\nand \u03b22 = 0.9 to train both generator and discriminator networks. For Omniglot we decay the\nlearning rate starting from 10K batch updates, and cut it in half for every 10K following updates.\nFor Mini-Imagenet we decay the learning rate starting from 30K batch updates, and cut it in half for\nevery 10K updates.\nWe present our results of 5-way and 20-way few-shot classi\ufb01cation for Omniglot dataset in table\n1, and show results of Mini-Imagenet dataset in table 2. We see that our proposed MetaGAN\nconsistently improves over baseline classi\ufb01ers, and achieves comparable or outperforms state-of-the-\nart performance on the challenging Mini-Imagenet benchmark.\n\n5.3 SAMPLE-LEVEL SEMI-SUPERVISED FEW-SHOT LEARNING\n\nAs introduced in section 3.4, we evaluate the effectiveness of our proposed MetaGAN in the sample-\nlevel semi-supervised few-shot learning setting, following a similar training and evaluation scheme\nwithout \"distractors\" to that proposed in [Ren et al., 2018] (We will point out the differences in the\nscheme later on). For the Omniglot dataset we sample 10% of the images of each class to form the\nlabeled set, and take all remaining data as the unlabeled set. For Mini-Imagenet we sample 40%\nimages of each class as the labeled set, and sample 5 images of each class for each training episode.\nNote that our model only leverages unlabeled samples during the training phase, while the re\ufb01ning\nmodel proposed in [Ren et al., 2018] uses unlabeled samples in both training (5 samples for each\nclass) and evaluation phases (20 samples for each class). This makes our model acquire strictly\nless information during evaluation, compared to [Ren et al., 2018]. The classi\ufb01er trained with\nour proposed MetaGAN formulation is encouraged to form better decision boundaries by utilizing\nunlabeled and fake data, and is free from the demands of unlabeled samples during testing, different\nfrom the kmeans-based re\ufb01ning model [Ren et al., 2018] which strongly relies on the unlabeled data\nfor testing.\n\n7\n\n\fModel\n\nPrototypical Nets(Supervised)\nSemi-Supervised Inference(PN)\nSoft k-Means\nSoft k-Means+Cluster\nMasked Soft k-Means\n\nOurs: Relation Nets(Supervised)\nOurs: MetaGAN + RN\n\nOmniglot\n1-shot 5-way\n94.62 \u00b1 0.09\n97.45 \u00b1 0.05\n97.25 \u00b1 0.10\n97.68 \u00b1 0.07\n97.52 \u00b1 0.07\n94.81 \u00b1 0.08\n97.58 \u00b1 0.07\n\nMini-Imagenet\n\n1-shot 5-way\n43.61 \u00b1 0.27\n48.98 \u00b1 0.34\n50.09 \u00b1 0.45\n49.03 \u00b1 0.24\n50.41 \u00b1 0.31\n44.24 \u00b1 0.24\n50.35 \u00b1 0.23\n\n5-shot 5-way\n59.08 \u00b1 0.22\n63.77 \u00b1 0.20\n64.59 \u00b1 0.28\n63.08 \u00b1 0.18\n64.39 \u00b1 0.24\n58.72 \u00b1 0.31\n64.43 \u00b1 0.27\n\nTable 3: Sample-level Semi-Supervised Few-shot classi\ufb01cation results on Omniglot and Mini-\nImagenet.\n\nModel\n\nPrototypical Net(Supervised)\nRelation Net(Supervised)\n\nOurs: MetaGAN + RN\n\nOmniglot\n1-shot 5-way\n93.66 \u00b1 0.09\n93.82 \u00b1 0.07\n97.12 \u00b1 0.08\n\nMini-Imagenet\n1-shot 5-way\n42.28 \u00b1 0.32\n43.87 \u00b1 0.20\n47.43 \u00b1 0.27\n\nTable 4: Task-level Semi-Supervised 1-shot classi\ufb01cation results on Omniglot and Mini-Imagenet.\n\nWe display the results of sample-level semi-supervised few-shot classi\ufb01cation results on Omniglot\nand Mini-Imagenet in table 3. Though our model cannot be compared with the kmeans re\ufb01ning\nmodel directly as discussed above, we obtain comparable state-of-the-art results on both 1-shot and\n5-shot tasks, while signi\ufb01cantly improving the purely supervised baseline models.\n\n5.4 TASK-LEVEL SEMI-SUPERVISED FEW-SHOT LEARNING\n\nWe proposed a new learning setting for the few-shot learning problem in section 3.4: task-level\nsemi-supervised few-shot learning. In this learning setting, existing few-shot learning models[Ravi\nand Larochelle, 2017, Sung et al., 2018, Ren et al., 2018] are unable to effectively leverage purely\nunsupervised tasks, which consist of only unlabeled samples in both support set and query set.\nTo demonstrate that our proposed MetaGAN model can successfully learn from unsupervised tasks,\nwe create new splits of Omniglot and Mini-Imagenet datasets. For the Omniglot dataset we randomly\nsample 10% of classes from the training set as a labeled set of classes, and the remaining 90%\nclasses as an unlabeled set of classes. For Mini-Imagenet dataset we randomly sample 40% as\nlabeled classes and the remaining 60% are unlabeled. The validation set and test set of each dataset\nremains unchanged, using all classes to evaluate the performance of models. During training time, we\nsample supervised tasks only from the labeled set of classes, and sample unsupervised tasks from the\nunlabeled set of classes. We alternate between sampled supervised tasks and sampled unsupervised\ntasks for training the MetaGAN model, while we only use sampled supervised tasks to train the\nbaseline model.\nWe show the results of task-level semi-supervised few-shot classi\ufb01cation results on Omniglot and\nMini-Imagenet in table 4. By integrating the baseline model into the MetaGAN framework, the\nmodel effectively learned to utilize the unsupervised tasks for helping the classi\ufb01cation task, showing\nthat MetaGAN can learn transferable knowledge from totally unsupervised tasks.\n\n6 CONCLUSION\n\nWe propose MetaGAN, a simple and generic framework to boost the performance of few-shot learning\nmodels. Our approach is based on the idea that fake samples produced by the generator can help\nclassi\ufb01ers learn a sharper decision boundary between different classes from a few samples.\n\n8\n\n\fWe make an analogy between few-shot learning and semi-supervised learning- both of them have\nonly a few labeled data and both can bene\ufb01t from an imperfect generator. Then we modi\ufb01ed the\ntechniques used for semi-supervised learning with GANs to work in the few-shot learning scenario.\nWe give intuitive as well as theoretical justi\ufb01cations of the proposed approach.\nWe demonstrated the strength of our algorithm on a series of few-shot learning and semi-supervised\nfew-shot learning tasks. For future work, we plan to extend MetaGAN to the few-shot imitation\nlearning setting.\n\nACKNOWLEDGEMENT\n\nWe thank Intel Corporation for supporting our deep learning related research.\n\nReferences\nAnthreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial\n\nnetworks. 2018. URL https://openreview.net/forum?id=S1Auv-WRZ.\n\nTong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative\n\nadversarial networks. In International Conference on Learning Representations, 2017.\n\nXi Chen, Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\nInfogan: Interpretable representation learning by information maximizing generative adversarial\nnets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 29, pages 2172\u20132180. Curran Associates, Inc., 2016.\n\nZihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good semi-\nsupervised learning that requires a bad gan. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 30, pages 6510\u20136520. Curran Associates, Inc., 2017. URL http://papers.nips.cc/\npaper/7229-good-semi-supervised-learning-that-requires-a-bad-gan.pdf.\n\nHarrison Edwards and Amos Storkey. Towards a Neural Statistician. 5th International Conference\n\non Learning Representations (ICLR 2017), 2017.\n\nChelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of\ndeep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International\nConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages\n1126\u20131135, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR. URL\nhttp://proceedings.mlr.press/v70/finn17a.html.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling,\nC. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information\nProcessing Systems 27, pages 2672\u20132680. Curran Associates, Inc., 2014. URL http://papers.\nnips.cc/paper/5423-generative-adversarial-nets.pdf.\n\nIshaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\nJonathan Ho and Stefano Ermon. Generative adversarial imitation learning.\n\nIn D. D. Lee,\nM. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Infor-\nmation Processing Systems 29, pages 4565\u20134573. Curran Associates, Inc., 2016. URL http:\n//papers.nips.cc/paper/6391-generative-adversarial-imitation-learning.pdf.\n\nSepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient\ndescent. In Proceedings of the International Conference on Arti\ufb01cial Neural Networks, ICANN\n\u201901, pages 87\u201394, London, UK, UK, 2001. Springer-Verlag. ISBN 3-540-42486-5. URL http:\n//dl.acm.org/citation.cfm?id=646258.684281.\n\n9\n\n\fDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nNikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive\nmeta-learner. In International Conference on Learning Representations, 2018. URL https:\n//openreview.net/forum?id=B1DmUzWAW.\n\nTsendsuren Munkhdalai and Hong Yu. Meta networks.\n\nIn Doina Precup and Yee Whye Teh,\neditors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of\nProceedings of Machine Learning Research, pages 2554\u20132563, International Convention Centre,\nSydney, Australia, 06\u201311 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/\nmunkhdalai17a.html.\n\nTsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, Tong Wang, and Adam Trischler. Learning\n\nrapid-temporal adaptations. CoRR, abs/1712.09926, 2017.\n\nSachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In In International\n\nConference on Learning Representations (ICLR), 2017.\n\nMengye Ren, Sachin Ravi, Eleni Trianta\ufb01llou, Jake Snell, Kevin Swersky, Josh B. Tenenbaum, Hugo\nLarochelle, and Richard S. Zemel. Meta-learning for semi-supervised few-shot classi\ufb01cation. In\nInternational Conference on Learning Representations, 2018. URL https://openreview.net/\nforum?id=HJcSzz-CZ.\n\nTim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and\nXi Chen. Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29,\npages 2234\u20132242. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/\n6125-improved-techniques-for-training-gans.pdf.\n\nAdam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-\nlearning with memory-augmented neural networks. In Maria Florina Balcan and Kilian Q. Wein-\nberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48\nof Proceedings of Machine Learning Research, pages 1842\u20131850, New York, New York, USA,\n20\u201322 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/santoro16.html.\n\nJake Snell, Kevin Swersky, and Richard Zemel.\n\nPrototypical networks for few-shot learn-\ning.\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-\nwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,\npages 4077\u20134087. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/\n6996-prototypical-networks-for-few-shot-learning.pdf.\n\nFlood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.\nIn Proceedings of the IEEE\n\nLearning to compare: Relation network for few-shot learning.\nConference on Computer Vision and Pattern Recognition, 2018.\n\nSebastian Thrun. Learning to learn. chapter Lifelong Learning Algorithms, pages 181\u2013209. Kluwer\nAcademic Publishers, Norwell, MA, USA, 1998. ISBN 0-7923-8047-9. URL http://dl.acm.\norg/citation.cfm?id=296635.296651.\n\nOriol Vinyals, Charles Blundell, Tim Lillicrap, koray kavukcuoglu, and Daan Wierstra. Match-\ning networks for one shot\nIn D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29,\npages 3630\u20133638. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/\n6385-matching-networks-for-one-shot-learning.pdf.\n\nlearning.\n\n10\n\n\f", "award": [], "sourceid": 1207, "authors": [{"given_name": "Ruixiang", "family_name": "ZHANG", "institution": "MILA"}, {"given_name": "Tong", "family_name": "Che", "institution": "MILA"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "Uber and University of Cambridge"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}, {"given_name": "Yangqiu", "family_name": "Song", "institution": "Hong Kong University of Science and Technology"}]}