{"title": "Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 975, "page_last": 985, "abstract": "Deep neural networks suffer from over-fitting and catastrophic forgetting when trained with small data. One natural remedy for this problem is data augmentation, which has been recently shown to be effective. However, previous works either assume that intra-class variances can always be generalized to new classes, or employ naive generation methods to hallucinate finite examples without modeling their latent distributions. In this work, we propose Covariance-Preserving Adversarial Augmentation Networks to overcome existing limits of low-shot learning. Specifically, a novel Generative Adversarial Network is designed to model the latent distribution of each novel class given its related base counterparts. Since direct estimation on novel classes can be inductively biased, we explicitly preserve covariance information as the ``variability'' of base examples during the generation process. Empirical results show that our model can generate realistic yet diverse examples, leading to substantial improvements on the ImageNet benchmark over the state of the art.", "full_text": "Low-shot Learning via Covariance-Preserving\n\nAdversarial Augmentation Networks\n\nHang Gao1, Zheng Shou1, Alireza Zareian1, Hanwang Zhang2, Shih-Fu Chang1\n\n1Columbia University, 2Nanyang Technological University\n\n{hg2469, zs2262, az2407, sc250}@columbia.edu\n\nhanwangzhang@ntu.edu.sg\n\nAbstract\n\nDeep neural networks suffer from over-\ufb01tting and catastrophic forgetting when\ntrained with small data. One natural remedy for this problem is data augmentation,\nwhich has been recently shown to be effective. However, previous works either\nassume that intra-class variances can always be generalized to new classes, or\nemploy naive generation methods to hallucinate \ufb01nite examples without modeling\ntheir latent distributions. In this work, we propose Covariance-Preserving Adver-\nsarial Augmentation Networks to overcome existing limits of low-shot learning.\nSpeci\ufb01cally, a novel Generative Adversarial Network is designed to model the\nlatent distribution of each novel class given its related base counterparts. Since\ndirect estimation of novel classes can be inductively biased, we explicitly preserve\ncovariance information as the \u201cvariability\u201d of base examples during the generation\nprocess. Empirical results show that our model can generate realistic yet diverse\nexamples, leading to substantial improvements on the ImageNet benchmark over\nthe state of the art.\n\n1\n\nIntroduction\n\nThe hallmark of learning new concepts from very few examples characterizes human intelligence.\nThough constantly pushing limits forward in various visual tasks, current deep learning approaches\nstruggle in cases when abundant training data is impractical to gather. A straightforward idea to learn\nnew concepts is to \ufb01ne-tune a model pre-trained on base categories, using limited data from another\nset of novel categories. However, this usually leads to catastrophic forgetting [1], i.e., \ufb01ne-tuning\nmakes the model over-\ufb01tting on novel classes, and agnostic to the majority of base classes [2, 3],\ndeteriorating overall performance.\n\nOne way to address this problem is to augment data for novel classes. Since generating images could\nbe both unnecessary [4] and impractical [5] on large datasets, feature augmentation [6, 7] is more\npreferable in this scenario. Building upon learned representations [8, 9, 10], recently two variants\nof generative models show the promising capability of learning variation modes from base classes\nto imagine the missing pattern of novel classes. Hariharan et al. proposed Feature Hallucination\n(FH) [11], which can learn a \ufb01nite set of transformation mappings between examples in each base\ncategory and directly apply them to seed novel points for extra data. However, since mappings are\nenumerable (even in large amount), this model suffers from poor generalization. To address this issue,\nWang et al. [12] proposed Feature Imagination (FI), a meta-learning based generation framework\nthat can train an agent to synthesize extra data given a speci\ufb01c task. They circumvented the demand\nfor latent distribution of novel classes by end-to-end optimization. But the generation results usually\ncollapse into certain modes. Finally, it should be noted that both works erroneously assume that\nintra-class variances of base classes are shareable with any novel classes. For example, the visual\nvariability of the concept lemon cannot be generalized to other irrelevant categories such as raccoon.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ft\ne\ns\n \nl\ne\nv\no\nN\n\nt\ne\ns\n \ne\ns\na\nB\n\nIn this work, we propose a new approach to\naddressing the problem of low-shot learning\nby enabling better feature augmentation be-\nyond current limits. Our approaches are novel\nin two aspects: modeling and training strat-\negy. We propose Covariance-Preserving Ad-\nversarial Augmentation Networks (CP-AAN), a\nnew class of Generative Adversarial Networks\n(GAN) [14, 15] for feature augmentation. We\ntake inspiration from unpaired image-to-image\ntranslation [16, 17] and formulate our feature\naugmentation problem as an imbalanced set-to-\nset translation problem where the conditional\ndistribution of examples of each novel class can\nbe conceptually expressed as a mixture of re-\nlated base classes. We \ufb01rst extract all related\nbase-novel class pairs by an intuitive yet effec-\ntive approach called Neighborhood Batch Sam-\npling. Then, our model aims to learn the latent distribution of each novel class given its base\ncounterparts. Since the direct estimation of novel classes can be inductively biased during this\nprocess, we explicitly preserve the covariance base examples during the generation process.\n\nFigure 1: Conceptual illustration of our method.\nGiven an example from a novel class, we translate\nexamples from related base classes into the target\nclass for augmentation. Image by [13].\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nWe systematically evaluate our approach by considering a series of objective functions. Our model\nachieves the state-of-the-art performance over the challenging ImageNet benchmark [18]. With\nablation studies, we also demonstrate the effectiveness of each component in our method.\n\n2 Related Works\n\nLow-shot Learning\nFor quick adaptation when very few novel examples are available, the com-\nmunity has often used a meta-agent [19] to further tune base classi\ufb01ers [8, 9, 10]. Intuitive yet often\nignored, feature augmentation was recently brought into the \ufb01eld by Hariharan et al. [11] to ease the\ndata scarce scenario. Compared to traditional meta-learning based approaches, they have reported\nnoticeable improvement on not only the conventional setting (i.e., to test on novel examples only),\nbut also the more challenging generalized setting (i.e., to test on all classes). Yet the drawback\nis that both the original work and its variants [12] fail to synthesize diverse examples because of\nill-constrained generation processes. Our approach falls in this line of research while seeking more\nprincipal guidance from base examples in a selective, class-speci\ufb01c manner.\n\nGenerative Adversarial Network for Set-to-set Translation GANs [14] map each latent code\nfrom an easily sampled prior to a realistic sample of a complex target distribution. Zhu et al. [16]\nhave achieved astounding results on image-to-image translation without any paired training samples.\nIn our case, diverse feature augmentation is feasible through conditional translation given a pair of\nrelated novel and base classes. Yet two main challenges remain: practically, not all examples are\nsemantically translatable. Second, given extremely scarce data for novel classes, we are unable to\nestimate their latent distributions (see Figure 4). In this work, we thoroughly investigate conditional\nGAN variants inspired by previous works [5, 15, 17, 20] to enable low-shot generation. Furthermore,\nwe introduce a novel batch sampling technique for learning salient set-to-set mappings using unpaired\ndata with categorical conditions.\n\nGeneration from Limited Observations\nEstimation of latent distribution from a handful of\nobservations is biased and inaccurate [21, 22]. The Bayesian approaches aim to model latent\ndistributions of a variety of classes as hierarchical Gaussian mixture [23], or alternatively model\ngeneration as a sequential decision making process [24]. For GANs, Gaussian mixture noise has\nalso been incorporated for latent code sampling [25]. Recent works [26, 27] on integral probability\nmetrics provide theoretical guidance towards the high order feature matching. In this paper, building\nupon the assumption that related classes should have similar intra-class variance, we introduce a new\nloss term for preserving covariance during the translation process.\n\n2\n\n\f(a) Problem statement\n\n(b) Hariharan et al.[11]\n\n(c) Our intuitions\n\nFigure 2: Imbalanced set-to-set translation and our motivations. Examples of three base classes\nare visualized in a semantic space learned by Prototypical Networks [9], along with their centroids\nas class prototypes. (a): Given a novel example x, our goal is to translate base examples into the\nnovel class, to reconstruct an estimation of the novel class distribution. (b): Feature Hallucination\n[11] randomly applies transformation mappings, between sampled base pairs in the same class, to the\nseed novel example for extra data; (c): instead, we only refer to semantically similar base-novel class\npairs and model the distribution of data for novel classes by preserving base intra-class variances.\n\n3\n\nImbalanced Set-to-set Translation\n\nIn this section, we formulate our low-shot feature augmentation problem under an imbalanced set-\nto-set translation framework. Concretely, we are given two labeled datasets represented in the same\nD-dimensional semantic space: (1) a base set B = {(xb, yb) | xb \u2208 RD, yb \u2208 Yb} consisting of\nabundant samples and (2) a novel set N = {(xn, yn) | xn \u2208 RD, yn \u2208 Yn} with only a handful of\nobservations. Their discrete label spaces are assumed to be non-overlapping, i.e., Yb \u2229 Yn = \u2205. Our\ngoal is to learn a mapping function Gn : B 7\u2192 N in order to translate examples of the base classes\ninto novel categories. After the generation process, a \ufb01nal classi\ufb01er is trained using both original\nexamples of the base classes and all (mostly synthesized) examples of the novel classes.\n\nExisting works [11, 12] suffer from the use of arbitrary, and thus possibly unrelated, base classes\nfor feature augmentation. Moreover, their performances are degraded by naive generation methods\nwithout modeling the latent distribution of each novel class. Our insight, conversely, is to sample\nextra features from continuous latent distributions rather than certain modes from enumerations, by\nlearning a GAN model (see Figure 2).\n\nSpeci\ufb01cally, we address two challenges that impede good translation under imbalanced scenarios: (1)\nthrough which base-novel class pairs we can translate; and more fundamentally, (2) through what\nobjectives for GAN training we can estimate the latent distribution of novel classes with limited\nobservations. We here start by proposing a straightforward batch sampling technique to address\nthe \ufb01rst problem. Then we suggest a simple extension of existing methods and study its weakness,\nwhich motivates the development of our \ufb01nal approach. For clarity, we introduce a toy dataset for\nimbalanced set-to-set translation in Figure 3 as a conceptual demonstration of the proposed method\ncompared to baselines.\n\n3.1 Neighborhood Batch Sampling\n\nIt is widely acknowledged [28, 8, 9] that a metric-learned high dimensional space encodes relational\nsemantics between examples. Therefore, to de\ufb01ne which base classes are translatable to a novel class,\nwe can rank them by their distance in a semantic space. For simplicity, we formulate our approach on\ntop of Prototypical Networks [9], learned by the nearest neighbor classi\ufb01er on the semantic space\nmeasured by the Euclidean distance. We represent each class y as a cluster and encode its categorical\ninformation by the cluster prototype ly \u2208 RD:\n\nly = Pi\nPi\n\nxi \u00b7 \u2736[yi=y]\n\n\u2736[yi=y]\n\n(1)\n\nIt should be noted that by \u201cprototype\u201d we mean the centroid of examples of a class. It should not be\nconfused with the centroid of randomly sampled examples that is computed in each episode to train\noriginal Prototypical Networks.\n\nWe introduce translation mapping R : Yn 7\u2192 P(Yb) where P(Yb) is the powerset of the collection\nof all base classes. This de\ufb01nes a many-to-many relationship between novel and base classes, and\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 3: Generation results on our toy dataset. (a): Raw distribution of the \u201cspiral\u201d dataset,\nwhich consists of two base classes (top, bottom) and two novel classes (left, right). Novel classes are\ncolored with lower saturation to indicate they are not available for training. Instead, only 4 examples\n(black crosses) are available. Generated samples are colored with higher saturation than the real\ndata. (b): c-GAN; note that we also show the results of translating synthesized novel samples back to\noriginal base classes with another decoupled c-GAN for the visual consistency with the other variants;\n(c): cCyc-GAN; (d): cDeLi-GAN; (e): cCov-GAN. Results are best viewed in color with zoom.\n\nis used to translate data from selected base classes to each novel class. To this end, given a novel\nclass yn, we compute its similarity scores \u03b1 with all base classes yb using softmax over Euclidean\ndistances between prototypes,\n\n\u03b1(yb, yn) =\n\nexp (\u2212klyb \u2212 lynk2\n2)\n\n\u2208Yb\n\nexp (\u2212kly\u2032\n\nb\n\n\u2212 lynk2\n2)\n\nPy\u2032\n\nb\n\n(2)\n\nThis results in a soft mapping (NBS-S) between base and novel classes, in which each novel\nclass is paired with all base classes with soft scores. In practice, translating from all base classes\nis unnecessary, and computationally expensive. Alternatively, we consider a hard version of R\nbased on k-nearest neighbor search, where the top k base classes are selected and treated as equal\n(\u03b1(yb, yn) = 1/k). This hard mapping (NBS-H) saves memory, but introduces an extra hyper-\nparameter.\n\n3.2 Adversarial Objective\n\nAfter constraining our translation process to selected class pairs, we develop a baseline based on\nConditional GAN (c-GAN) [15]. To this end, a discriminator Dn is trained to classify real examples\nas the corresponding N = |Yn| novel classes, and classify synthesized examples as an auxiliary\n\u201cfake\u201d class. [5]. The generator Gn takes an example from base classes R(yn) that are paired with yn\nvia NBS, and aims to fool the discriminator into classifying the generated example as yn instead of\nthe \u201cfake\u201d. More speci\ufb01cally, the adversarial objective can be written as:\n\nLadv(Gn, Dn, B, N ) = Eyn\u223cYn(cid:20)Exn\u223cNyn(cid:2) log Dn(yn|xn)(cid:3)\n\n+ Exb,yb\u223cBR(yn )(cid:2)\u03b1(yb, yn) log Dn(N + 1|Gn(yn; xb, yb))(cid:3)(cid:21)\n\n(3)\n\n(4)\n\nwhere Nyn consists of all novel examples labeled with yn in N while BR(yn) consists all base\nexamples labeled by one of the classes in R(yn).\n\nWe train c-GAN by solving the minimax game of the adversarial loss. In this scenario, there is no\nexplicit way to incorporate base classes intra-class variance into the generation of new novel examples.\nAlso, any mappings that collapse synthesized features into existing observations yield the optimal\nsolution [14]. These facts lead to unfavorable generation results as shown in Figure 3b. We next\nexplore different ways to explicitly force the generator to learn the latent conditional distributions.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 4: The importance of covariance in low-shot settings. Suppose we have access to a\nsingle cat image during training. (a): conventional models can easily fail since there are in\ufb01nite\ncandidate distributions that cannot be discriminated; (b): related classes should have similar intra-\nclass variances. Thus, we preserve covariance information during translation, to transfer knowledge\nfrom base classes to novel ones.\n\n3.3 Cycle-consistency Objective\n\nA natural idea for preventing modes from getting dropped is to apply the cycle-consistency constraint\nwhose effectiveness has been proven over image-to-image translation tasks [16]. Besides extra\nsupervision, it eliminates the demand for paired data, which is impossible to acquire for the low-\nshot learning setting. We extend this method for our conditional scenario and derive cCyc-GAN.\nSpeci\ufb01cally, we learn two generators: Gn, which is our main target, and Gb : N 7\u2192 B as an auxiliary\nmapping that reinforces Gn. We train the generators such that, the translation cycle recovers the\noriginal embedding in either a forward cycle N 7\u2192 B 7\u2192 N or a backward cycle B 7\u2192 N 7\u2192 B. Our\ncycle-consistency objective could then be derived as,\n\nLcyc(Gn, Gb) =Eyn\u223cYn(cid:20)Exn\u223cNyn ,xb,yb\u223cBR(yn ),z\u223cZ \u03b1(yb, yn)h\n\nkGn(yn; Gb(yb; xn, yn, z), yb)k2\n\n2i(cid:21)\n2 + kGb(yb; Gn(yn; xb, yb), yn, z)k2\n\n(5)\n\n(6)\n\nwhere a Z-dimensional noise vector sampled from a distribution Z is injected into Gb\u2019s input since\nnovel examples xn lack variability given the very limited amount of data. Z is a normal distribution\nN (0, 1) for our cCyc-GAN model.\n\nWhile Gn is hard to train due to the extremely small data volume; Gb has more to learn from,\nand can thus indirectly guide Gn through its gradient. During our experiments, we found that\ncycle-consistency is indispensable for stabilizing the training procedure. Swaminathan et al. [25]\nobserve that incorporating extra noise from a mixture of Gaussian distributions could result in more\ndiverse results. Hence, we also report a variant called cDeLi-GAN which uses the same objective as\ncCyc-GAN, but sample the noise vector z from a mixture of C different Gaussian distributions,\n\nZ d=\n\n1\nC\n\nC\n\nXi=1\n\nf (z|\u00b5i, \u03a3i), where f (z|\u00b5, \u03a3) =\n\nexp(\u2212 1\n\n2 (z \u2212 \u00b5)T \u03a3\u22121(z \u2212 \u00b5))\np(2\u03c0)Z|\u03a3|\n\n(7)\n\nWe follow the initialization setup in the previous work [25]. For each \u00b5, we sample from a uniform\ndistribution U (\u22121, 1). And for each \u03a3, we \ufb01rst sample a vector \u03c3 from a Gaussian distribution\nN (0, 0.2), then we simply set \u03a3 = diag(\u03c3)\n\nGeneration results of the two aforementioned methods are shown in Figure 3c and 3d. Both methods\nimprove the diversity of generation compared to the naive c-GAN, yet they either under- or over-\nestimate the intra-class variance.\n\n5\n\n\f3.4 Covariance-preserving Objective\n\nWhile cycle-consistency alone can transfer certain degrees of intra-class variance from base classes,\nwe \ufb01nd it rather weak and unreliable since there are still in\ufb01nite candidate distributions that cannot\nbe discriminated based on limited observations (See Figure 4).\n\nBuilding upon the assumption that similar classes share similar intra-class variance, one straightfor-\nward idea is to penalize the change of \u201cvariability\u201d during translation. Hierarchical Bayesian models\n[23], prescribe each class as a multivariate Gaussian, where intra-class variability is embedded in a\ncovariance matrix. We generalize this idea and try to maintain covariance in the translation process,\nalthough we model the class distribution by GAN instead of any prescribed distributions.\n\nTo compute the difference between two covariance matrices [26], one typical way is to measure\nthe worst case distance between them using Ky Fan m-norm, i.e., the sum of singular values of\nm-truncated SVD, which we denote as k[\u00b7]mk\u2217. To this end, we de\ufb01ne the pseudo-prototype \u02c6lyn of\neach novel class yn as the centroid of all synthetic samples \u02c6xn = Gn(yn; xb, yb) translated from\nrelated base classes. The covariance distance dcov(yb, yn) between a base-novel class pair can then\nbe formulated as,\n\ndcov(yb, yn) = (cid:13)(cid:13)[\u03a3x(Pyb ) \u2212 \u03a3G(Pyn )]m(cid:13)(cid:13)\u2217, where \uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n\u03a3x(Py) =\n\n\u03a3G(Py) =\n\nPi(xi\u2212lyi )(xi\u2212lyi )T \u2736[yi=y]\n\nPi\n\n\u2736[yi=y]\n\nPj (\u02c6xj \u2212\u02c6lyj )(\u02c6xj \u2212\u02c6lyj )T \u2736[yj =y]\n\nPj\n\n\u2736[yj =y]\n\n(8)\n\nConsequently, our covariance-preserving objective can be written as the expectation of the weighted\ncovariance distance using NBS-S,\n\nLcov(Gn) = Eyn\u223cYn(cid:20)E\n\nyb\u223cR(yn)(cid:2)\u03b1(yb, yn)dcov(yb, yn)(cid:3)(cid:21)\n\n(9)\n\nNote that, for a matrix X, k[X]mk\u2217 is non-differentiable with respect to itself, thus in practice,\nwe calculate its subgradient instead. Speci\ufb01cally, we \ufb01rst compute the unitary matrices U, V by\nm-truncated SVD [29], and then back-propagate UVT for sequential parameter updates. Proof of\nthe correctness is provided in the supplementary material.\n\nFinally, we propose our covariance-preserving conditional cycle-GAN, cCov-GAN, as:\n\nG\u2217\n\nn = arg min\nGn,Gb\n\nmax\nDn,Db\n\nLadv(Gn, Dn, B, N )\n\n+ Ladv(Gb, Db, N , B) + \u03bbcycLcyc(Gn, Gb) + \u03bbcovLcov(Gn)\n\n(10)\n\nAs illustrated in Figure 3e, preserving covariance information from relevant base classes to a novel\nclass can improve low-shot generation quality. We attribute this empirical result to the interplay of\nadversarial learning, cycle consistency, and covariance preservation, that respectively lead to realistic\ngeneration, semantic consistency, and diversity.\n\n3.5 Training\n\nFollowing recent works on meta-learning [30, 11, 12], we design a two-stage training procedure.\nDuring the \u201cmeta-training\u201d phase, we train our generative model with base examples only, by\nmimicking the low-shot scenario it would encounter later. After that, in the \u201cmeta-testing\u201d phase,\nwe are given novel classes as well as their low-shot examples. We use the trained Gn to augment\neach class until it has the average capacity of the base classes. Then we train a classi\ufb01er as one would\nnormally do in a supervised setting using both real and synthesized data. For the choice of this \ufb01nal\nclassi\ufb01er, we apply the same one as in the original representation learning stage. For examples, we\nuse the nearest neighbor classi\ufb01er for embeddings from Prototypical Networks, and a normal linear\nclassi\ufb01er for those from ResNets.\n\nWe follow the episodic procedure used by [12] during meta-training. In each episode, we sample\nNb \u201cmeta-novel\u201d classes from B, and use the rest of B as \u201cmeta-base\u201d classes. Then we sample Kb\n\n6\n\n\fexamples from each meta-novel class as meta-novel examples. We compute the prototypes of each\nclass and similarity scores between each \u201cmeta-novel\u201d and \u201cmeta-base\u201d class. To sample a batch of\nsize B, we \ufb01rst include all \u201cmeta-novel\u201d examples, and sample B \u2212 Nb \u00b7 Kb examples uniformly\nfrom the \u201cmeta-base\u201d classes retrieved by translation mapping R. Next, we push our samples through\ngenerations and discriminators to compute the loss. Finally, we update their weights for the current\nepisode and start the next one.\n\n4 Experiments\n\nThis section is organized as follows. In Section 4.1, we conduct low-shot learning experiments on the\nchallenging ImageNet benchmark. In Section 4.2, we further discuss with ablation, both quantitatively\nand qualitatively, to better understand the performance gain. We demonstrate our model\u2019s capacity to\ngenerate diverse and reliable examples and its effectiveness in low-shot classi\ufb01cation.\n\nDataset We evaluate our method on the real-world benchmark proposed by Hariharan et al. [11].\nThis is a challenging task because it requires us to learn a large variety of ImageNet [18] given\na few exemplars for each novel classes. To this end, our model must be able to model the visual\ndiversity of a wide range of categories and transfer knowledge between them without confusing\nunrelated classes. Following [11], we split the 1000 ImageNet classes into four disjoint class\nsets Y test\nn , which consist of 193, 300, 196, 311 classes respectively. All of our\nparameter tuning is done on validation splits, while \ufb01nal results are reported using held-out test splits.\n\n, Y test\n\nn\n\nb\n\n, Y val\n\n, Y val\n\nb\n\nEvaluation We repeat sampling novel examples \ufb01ve times for held-out novel sets and report results\nof mean top-5 accuracy in both conventional low-shot learning (LSL, to test on novel classes only)\nand its generalized setting (GLSL, to test on all categories including base classes).\n\nBaselines We compare our results to the exact numbers reported by Feature Hallucination [11]\nand Feature Imagination [12]. We also compared to other non-generative methods including classical\nSiamese Networks [31], Prototypical Networks [9], Matching Networks [8], and MAML [32] as\nwell as more recent Prototypical Matching Networks [12] and Attentive Weight Generators [33]. For\nstricter comparison, we provide two extra baselines to exclude the bias induced by different embedding\nmethods: P-FH builds on Feature Hallucinating by substituting their non-episodic representation\nwith learned prototypical features. Another baseline (\ufb01rst row in Table 1), on the contrary, replaces\nprototypical features with raw ResNet-10 embeddings. The results for MAML and SN are reported\nusing their published codebases online.\n\nImplementation details Our implementation is based on PyTorch [34]. Since deeper networks\nwould unsurprisingly result in better performance, we con\ufb01ne all experiments in a ResNet-10\nbackbone1 with a 512-d output layer. We \ufb01ne-tune the backbone following the procedure described\nin [11]. For all generators, we use three-layer MLPs with all hidden layers\u2019 dimensions \ufb01xed at\n512 as well as their output for synthesized features. Our discriminators are accordingly designed as\nthree-layer MLPs to predict probabilities over target classes plus an extra fake category. We use leaky\nReLU of slope 0.1 without batch normalization. Our GAN models are trained for 100000 episodes\nby ADAM [35] with initial learning rate \ufb01xed at 0.0001 which anneals by 0.5 every 20000 episodes.\nWe \ufb01x the hyper-parameter m = 10 for computing truncated SVD. For loss term contributions, we\nset \u03bbcyc = 5 and \u03bbcov = 0.5 for all \ufb01nal objectives. We choose Z = 100 as the dimension of noise\nvectors for Gb\u2019s input, and C = 50 for the Gaussian mixture. We inject prototype embeddings\ninstead of one-hot vectors as categorical information for all networks (prototypes for novel classes are\ncomputed using the low-shot examples only). We empirically set batch size B = 1000, and Nb = 20\nand Kb = 10 for all training, no matter what would the number of shots be in the test. This is more\nef\ufb01cient but possibly less accurate than [9] who trained separate models for each testing scenario, so\nthe number of shots in train and test always match. All hyper-parameters are cross-validated on the\nvalidation set using a coarse grid search.\n\n4.1 Main Results\n\nFor comparisons, we include numbers reported in previous works under the same experimental\nsettings. Note that the results for MAML and SN are reported using their published codebases\n\n1Released on https://github.com/facebookresearch/low-shot-shrink-hallucinate\n\n7\n\n\fTable 1: Low-shot classi\ufb01cation top-5 accuracy% of all comparing methods under LSL and GLSL\nsettings on ImageNet dataset. All results are averaged over \ufb01ve trials separately, and omit standard\ndeviation for all numbers are of the order of 0.1%. The best and second best methods under each\nsetting are marked in according formats.\n\nMethod Representation\n\nGeneration\n\nK = 1\n\n2\n\nBaseline\n\nResNet-10 [36]\n\nSN [31]\n\nMAML [32]\n\nPN [9]\nMN [8]\n\nPMN [12]\nAWG [33]\n\nResNet-10\n\nPN\nPN\nPMN\n\nResNet-10\n\nPN\nPN\nPN\nPN\n\nFH [11]\n\nP-FH\nFI [12]\n\nCP-AAN\n\n(Ours)\n\n-\n-\n-\n-\n-\n-\n-\n\nLR w/ A.\nLR w/ A.\n\nmeta-learned LR\nmeta-learned LR\n\ncCov-GAN\n\nc-GAN\n\ncCyc-GAN\ncDeLi-GAN\ncCov-GAN\n\n38.5\n38.9\n39.2\n39.4\n43.6\n43.3\n46.0\n\n40.7\n41.5\n45.0\n45.8\n\n47.1\n38.6\n42.5\n46.0\n48.4\n\n51.2\n\n-\n-\n\n52.2\n54.0\n55.7\n57.5\n\n50.8\n52.2\n55.9\n57.8\n\n57.9\n51.8\n54.6\n58.1\n59.3\n\nLSL\n\n5\n\n64.7\n64.6\n64.2\n66.6\n66.0\n68.4\n69.2\n\n62.0\n63.5\n67.3\n69.0\n\n68.9\n64.9\n66.7\n68.8\n70.2\n\n10\n\n71.6\n\n-\n-\n\n72.0\n72.5\n74.0\n74.8\n\n69.3\n71.8\n73.0\n74.3\n\n76.0\n71.9\n74.3\n74.6\n76.5\n\n20 K = 1\n\n2\n\n76.3\n76.4\n76.8\n76.5\n76.9\n77.0\n78.1\n\n76.4\n76.4\n76.5\n77.4\n\n79.3\n76.2\n76.8\n77.4\n79.3\n\n40.6\n48.7\n49.5\n49.3\n54.4\n55.8\n58.2\n\n52.2\n53.6\n56.9\n57.6\n\n52.1\n49.4\n57.6\n58.0\n58.5\n\n49.8\n\n-\n-\n\n61.0\n61.0\n63.1\n65.2\n\n59.7\n61.7\n63.2\n64.7\n\n60.3\n61.5\n65.1\n65.1\n65.8\n\nGLSL\n\n5\n\n64.3\n68.3\n69.6\n69.6\n69.0\n71.1\n72.7\n\n68.6\n69.0\n70.6\n71.9\n\n69.2\n69.7\n72.2\n72.4\n73.5\n\n10\n\n20\n\n72.1\n\n-\n-\n\n72.8\n73.7\n75.0\n76.5\n\n73.3\n73.5\n74.5\n75.2\n\n72.4\n73.0\n73.9\n74.8\n76.0\n\n76.7\n73.8\n74.2\n74.7\n76.5\n77.1\n78.7\n\n76.9\n75.9\n76.5\n77.5\n\n76.8\n75.1\n76.0\n76.9\n78.1\n\nLR w/ A.: Logistic Regressor with Analogies.\n\nonline. We decompose each method into stage-wise operations for breaking performance gain down\nto detailed choices made in each stage.\n\nWe provide four models constructed with different GAN choices as justi\ufb01ed in Section 3. All of our\nintroduced CP-AAN approaches are trained with NBS-S which would be further investigated with\nablation in the next subsection. Results are shown in Table 1. Our best method consistently achieves\nsigni\ufb01cant improvement over the previous augmentation-based approaches for different values of K\nunder both LSL and GLSL settings, achieving almost 2% performance gain compared to baselines.\nWe also notice that apart from overall improvement, our best model achieves its largest boost (\u02dc9%)\nat the lowest shot over naive baseline and 2.6% over Feature Imagination (FI) [12] under the LSL\nsetting, even though we use a simpler embedding technique (PN compared to their PMN). We believe\nsuch performance gain can be attributed to our advanced generation methods since at low shots, FI\napplies discrete transformations that its generator has previously learned while we can now sample\nthrough a smooth distribution combining all related base classes\u2019 covariance information.\n\nNote that in the LSL setting, all generative methods assume we still have access to original base\nexamples when learning \ufb01nal classi\ufb01ers while non-generative baselines usually don\u2019t have this\nconstraint.\n\n4.2 Discussions\n\nIn this subsection, we carefully examine our design choices for the \ufb01nal version of our CP-AAN. We\nstart by unpacking performance gain over the standard batch sampling procedure and proceed by\nshowing both quantitative and qualitative evaluations on generation quality.\n\nAblation on NBS\nTo validate the effectiveness of the NBS strategy over standard batch sampling\nfor feature augmentation, we conduct an ablation study to show our absolute performance gain in\nFigure 5a. In general, we empirically demonstrate that applying NBS improves the performance of\nlow-shot recognition. We also show that the performance of NBS-H is sensitive to the hyper-parameter\nk in the k-nearest neighbor search. Therefore, the soft assignment is preferable if computational\nresources allow.\n\nQuantitative Generation Quality We next quantitatively evaluate the generation quality of the\nvariants introduced in Section 3 and previous works as shown in Figure 5b. Note that for FH, we used\ntheir published codebase online; for FI, we implemented the network and train with the procedure\ndescribed in the original paper. We measure the diversity of generation via the mean average pairwise\nEuclidean distance of generated examples within each novel class. We adopt same augmentation\nstrategies as used for ImageNet experiments. For reference, the mean average Euclidean distance\nover real examples is 0.163. In summary, the results are consistent with our expectation and support\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 5: Ablation analysis. (a): Unpacked performance gain for each NBS strategy; (b): accuracy\nvs. diversity; (c, d): Feature Hallucination and Feature Imagination lack diverse modes; (e): our best\nmethod could synthesize both diverse and realistic embeddings. Results are best viewed in color with\nzoom.\n\nour design choices. Feature Hallucination and Imagination show less diversity compared to real data.\nNaive c-GAN even under-performs those baselines due to the mode collapse. Cycle-consistency and\nGaussian mixture noise do help generation in both accuracy and diversity. However, they either under-\nor over-estimate the diversity. Our covariance-preserving objective leads to the best hallucination\nquality, since the generated distribution more closely resembles the real data diversity. Another\ninsight from Figure 5b is that not surprisingly, under-estimating data diversity is more detrimental to\nclassi\ufb01cation accuracy than over-estimating.\n\nQualitative Generation Quality\nFigure 5c, 5d, 5e show t-SNE [37] visualizations of the data\ngenerated by Feature Hallucination, Feature Imagination and our best model in the prototypical\nfeature space. We \ufb01x the number of examples per novel class K = 5 in all cases and plot their real\ndistribution with translucent point clouds. The 5 real examples are plotted in crosses and synthesized\nexamples are denoted by stars. Evidently, naive generators could only synthesize novel examples\nthat are largely pulled together. Although t-SNE might visually drag similar high dimensional points\ntowards one mode, our model shows more diverse generation results that are better aligned with\nthe latent distribution, improving overall recognition performance by spreading seed examples in\nmeaningful directions.\n\n5 Conclusion\n\nIn this paper, we have presented a novel approach to low-shot learning that augments data for novel\nclasses by training a cyclic GAN model, while shaping intra-class variability through similar base\nclasses. We introduced and compared several GAN variants in a logical process and demonstrated\nthe increasing performance of each model variant. Our proposed model signi\ufb01cantly outperforms\nthe state of the art on the challenging ImageNet benchmark in various settings. Quantitative and\nqualitative evaluations show the effectiveness of our method in generating realistic and diverse data\nfor low-shot learning, given very few examples.\n\nAcknowledgments This work was supported by the U.S. DARPA AIDA Program No. FA8750-\n18-2-0014. The views and conclusions contained in this document are those of the authors and\nshould not be interpreted as representing the of\ufb01cial policies, either expressed or implied, of the U.S.\nGovernment. The U.S. Government is authorized to reproduce and distribute reprints for Government\npurposes notwithstanding any copyright notation here on.\n\n9\n\n\fReferences\n\n[1] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical\ninvestigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint\narXiv:1312.6211, 2013.\n\n[2] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,\nAndrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.\nOvercoming catastrophic forgetting in neural networks. Proceedings of the National Academy\nof Sciences, 114(13):3521\u20133526, 2017.\n\n[3] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object\n\ndetectors without catastrophic forgetting. arXiv preprint arXiv:1708.06977, 2017.\n\n[4] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks\n\nfor zero-shot learning. arXiv preprint arXiv:1712.00981, 2017.\n\n[5] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[6] Relja Arandjelovi\u00b4c and Andrew Zisserman. Three things everyone should know to improve\nobject retrieval. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference\non, pages 2911\u20132918. IEEE, 2012.\n\n[7] Ying-Cong Chen, Xiatian Zhu, Wei-Shi Zheng, and Jian-Huang Lai. Person re-identi\ufb01cation\nby camera correlation aware feature augmentation. IEEE transactions on pattern analysis and\nmachine intelligence, 40(2):392\u2013408, 2018.\n\n[8] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one\nshot learning. In Advances in Neural Information Processing Systems, pages 3630\u20133638, 2016.\n\n[9] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nAdvances in Neural Information Processing Systems, pages 4080\u20134090, 2017.\n\n[10] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.\nLearning to compare: Relation network for few-shot learning. arXiv preprint arXiv:1711.06025,\n2017.\n\n[11] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinat-\n\ning features. arXiv preprint arXiv:1606.02819, 2016.\n\n[12] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning\n\nfrom imaginary data. arXiv preprint arXiv:1801.05401, 2018.\n\n[13] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-\n\nimage translation networks. arXiv preprint arXiv:1804.04732, 2018.\n\n[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[15] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[16] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n[17] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang,\nand Eli Shechtman. Toward multimodal image-to-image translation. In Advances in Neural\nInformation Processing Systems, pages 465\u2013476, 2017.\n\n[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[19] David Lei, Michael A Hitt, and Richard Bettis. Dynamic core competences through meta-\n\nlearning and strategic context. Journal of management, 22(4):549\u2013569, 1996.\n\n[20] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\n\nauxiliary classi\ufb01er gans. arXiv preprint arXiv:1610.09585, 2016.\n\n10\n\n\f[21] James Tobin. Estimation of relationships for limited dependent variables. Econometrica:\n\njournal of the Econometric Society, pages 24\u201336, 1958.\n\n[22] Emmanuel Candes, Terence Tao, et al. The dantzig selector: Statistical estimation when p is\n\nmuch larger than n. The Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[23] Ruslan Salakhutdinov, Joshua Tenenbaum, and Antonio Torralba. One-shot learning with a\nhierarchical nonparametric bayesian model. In Proceedings of ICML Workshop on Unsupervised\nand Transfer Learning, pages 195\u2013206, 2012.\n\n[24] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra.\n\nOne-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106, 2016.\n\n[25] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and V Babu Radhakrishnan. Deligan:\nGenerative adversarial networks for diverse and limited data. In The IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), volume 1, 2017.\n\n[26] Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature\n\nmatching gan. arXiv preprint arXiv:1702.08398, 2017.\n\n[27] Youssef Mroueh and Tom Sercu. Fisher gan. In Advances in Neural Information Processing\n\nSystems, pages 2510\u20132520, 2017.\n\n[28] Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan R Salakhutdinov. Neigh-\nbourhood components analysis. In Advances in neural information processing systems, pages\n513\u2013520, 2005.\n\n[29] Peiliang Xu. Truncated svd methods for discrete linear ill-posed problems. Geophysical Journal\n\nInternational, 135(2):505\u2013514, 1998.\n\n[30] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.\nOne-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065,\n2016.\n\n[31] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot\n\nimage recognition. In ICML Deep Learning Workshop, volume 2, 2015.\n\n[32] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. arXiv preprint arXiv:1703.03400, 2017.\n\n[33] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n4367\u20134375, 2018.\n\n[34] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[37] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n11\n\n\f", "award": [], "sourceid": 543, "authors": [{"given_name": "Hang", "family_name": "Gao", "institution": "Columbia University"}, {"given_name": "Zheng", "family_name": "Shou", "institution": "Columbia University"}, {"given_name": "Alireza", "family_name": "Zareian", "institution": "Columbia University"}, {"given_name": "Hanwang", "family_name": "Zhang", "institution": "NTU"}, {"given_name": "Shih-Fu", "family_name": "Chang", "institution": "Columbia University"}]}