{"title": "Unsupervised Meta-Learning for Few-Shot Image Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 10132, "page_last": 10142, "abstract": "Few-shot or one-shot learning of classifiers requires a significant inductive bias towards the type of task to be learned. One way to acquire this is by meta-learning on tasks similar to the target task. In this paper, we propose UMTRA, an algorithm that performs unsupervised, model-agnostic meta-learning for classification tasks.\n The meta-learning step of UMTRA is performed on a flat collection of unlabeled images. While we assume that these images can be grouped into a diverse set of classes and are relevant to the target task, no explicit information about the classes or any labels are needed. UMTRA uses random sampling and augmentation to create synthetic training tasks for meta-learning phase. Labels are only needed at the final target task learning step, and they can be as little as one sample per class.\n On the Omniglot and Mini-Imagenet few-shot learning benchmarks, UMTRA outperforms every tested approach based on unsupervised learning of representations, while alternating for the best performance with the recent CACTUs algorithm. Compared to supervised model-agnostic meta-learning approaches, UMTRA trades off some classification accuracy for a reduction in the required labels of several orders of magnitude.", "full_text": "Unsupervised Meta-Learning for Few-Shot Image\n\nClassi\ufb01cation\n\nSiavash Khodadadeh, Ladislau B\u00f6l\u00f6ni\n\nDept. of Computer Science\nUniversity of Central Florida\n\nsiavash.khodadadeh@knights.ucf.edu, lboloni@cs.ucf.edu\n\nMubarak Shah\n\nCenter for Research in Computer Vision\n\nUniversity of Central Florida\n\nshah@crcv.ucf.edu\n\nAbstract\n\nFew-shot or one-shot learning of classi\ufb01ers requires a signi\ufb01cant inductive bias\ntowards the type of task to be learned. One way to acquire this is by meta-learning\non tasks similar to the target task. In this paper, we propose UMTRA, an algorithm\nthat performs unsupervised, model-agnostic meta-learning for classi\ufb01cation tasks.\nThe meta-learning step of UMTRA is performed on a \ufb02at collection of unlabeled\nimages. While we assume that these images can be grouped into a diverse set of\nclasses and are relevant to the target task, no explicit information about the classes\nor any labels are needed. UMTRA uses random sampling and augmentation to\ncreate synthetic training tasks for meta-learning phase. Labels are only needed at\nthe \ufb01nal target task learning step, and they can be as little as one sample per class.\nOn the Omniglot and Mini-Imagenet few-shot learning benchmarks, UMTRA\noutperforms every tested approach based on unsupervised learning of representa-\ntions, while alternating for the best performance with the recent CACTUs algorithm.\nCompared to supervised model-agnostic meta-learning approaches, UMTRA trades\noff some classi\ufb01cation accuracy for a reduction in the required labels of several\norders of magnitude.\n\n1\n\nIntroduction\n\nMeta-learning or \u201clearning-to-learn\u201d approaches have been proposed in the neural networks literature\nsince the 1980s [29, 4]. The general idea is to prepare the network through several learning tasks\nT1 . . .Tn, in a meta-learning phase such that when presented with the target task Tn+1, the network\nwill be ready to learn it as ef\ufb01ciently as possible.\nRecently proposed model-agnostic meta-learning approaches [11, 23] can be applied to any differen-\ntiable network. When used for classi\ufb01cation, the target learning phase consists of several gradient\ndescent steps on a backpropagated supervised classi\ufb01cation loss. Unfortunately, these approaches\nrequire the learning tasks Ti to have the same supervised learning format as the target task. Acquiring\nlabeled data for a large number of tasks is not only a problem of cost and convenience but also puts\nconceptual limits on the type of problems that can be solved through meta-learning. If we need to\nhave labeled training data for tasks T1 . . .Tn in order to learn task Tn+1, this limits us to task types\nthat are variations of tasks known and solved (at least by humans).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSupervised MAML\n\nUMTRA\n\nFigure 1: The process of creation of the training and validation data of the meta-training task T .\n(top) Supervised MAML: We start from a dataset where the samples are labeled with their class.\nThe training data is created by sampling N distinct classes CLi, and choosing a random sample xi\nfrom each. The validation data is created by choosing a different sample x(cid:48)\ni from the same class.\n(bottom) UMTRA: We start from a dataset of unlabeled data. The training data is created by randomly\nchoosing N samples xi from the dataset. The validation data is created by applying the augmentation\nfunction A to each sample from the training data. For both MAML and UMTRA, arti\ufb01cial temporary\nlabels 1, 2 . . . N are used.\n\nIn this paper, we propose an algorithm called Unsupervised Meta-learning with Tasks constructed by\nRandom sampling and Augmentation (UMTRA) that performs meta-learning of one-shot or few-shot\nclassi\ufb01ers in an unsupervised manner on an unlabeled dataset. Instead of starting from a collection of\nlabeled tasks, {. . .Ti . . .}, UMTRA starts with a collection of unlabeled data U = {. . . xi . . .}. We\nhave only a set of relatively easy-to-satisfy requirements towards U: Its objects have to be drawn\nfrom the same distribution as the objects classi\ufb01ed in the target task and it must have a set of classes\nsigni\ufb01cantly larger than the number of classes of the \ufb01nal classi\ufb01er. Starting from this unlabeled\ndataset, UMTRA uses statistical diversity properties and domain speci\ufb01c augmentations to generate\nthe training and validation data for a collection of synthetic tasks, {. . .T (cid:48)\ni . . .}. These tasks are\nthen used in the meta-learning process based on a modi\ufb01ed classi\ufb01cation variant of the MAML\nalgorithm [11]. Figure 1 summarizes the differences between the original supervised MAML model\nand the process of generating synthetic tasks from unsupervised data in UMTRA.\nThe contributions of this paper can be summarized as follows:\n\nfew-shot classi\ufb01cation by generating synthetic meta-learning data with arti\ufb01cial labels.\n\n\u2022 We describe a novel algorithm that allows unsupervised, model-agnostic meta-learning for\n\u2022 From a theoretical point of view, we demonstrate a relationship between generalization error\nand the loss backpropagated from the validation set in MAML. Our intuition is that we can\ngenerate unsupervised validation tasks which can perform effectively if we are able to span\nthe space of the classes by generating useful samples with augmentation.\n\u2022 On all the Omniglot and Mini-Imagenet few-shot learning benchmarks, UMTRA outper-\nforms every tested approach based on unsupervised learning of representations. It also\n\n2\n\n\ud835\udc651 \ud835\udc652\ud835\udc653...\ud835\udc65nCL1.Updated ModelC1C2C3C1200...Sample N classes..123...N\u2112Update model parametersbased on outer loss123...NSample 2 data points from each class\u2112ModelCCCL2L3LN\ud835\udc65\u20321 \ud835\udc65\u20322\ud835\udc65\u20323...\ud835\udc65\u2032n\ud835\udc65\u20321 \ud835\udc65\u20322\ud835\udc65\u20323...\ud835\udc65\u2032n ( ) ( ) ( ) ( ) ( )\ud835\udc651 \ud835\udc652\ud835\udc653...\ud835\udc65nUpdated Model123...N\u2112Update model parametersbased on outer loss123...N\u2112ModelSample N data points\ud835\udc65\u2032i = (\ud835\udc65i)\fachieves a signi\ufb01cant percentage of the accuracy of the supervised MAML approach, while\nrequiring vastly fewer labels. For instance, for 5-way 5-shot classi\ufb01cation on the Omniglot\ndataset UMTRA obtains a 95.43% accuracy with only 25 labels, while supervised MAML\nobtains 98.83% with 24025. Compared with recent unsupervised meta-learning approaches\nbuilding on top of stock MAML, UMTRA alternates for the best performance with the\nCACTUs algorithm.\n\n2 Related Work\n\nFew-shot or one-shot learning of classi\ufb01ers has signi\ufb01cant practical applications. Unfortunately,\nthe few-shot learning model is not a good \ufb01t to the traditional training approaches of deep neural\nnetworks, which work best with large amounts of data. In recent years, signi\ufb01cant research targeted\napproaches to allow deep neural networks to work in few-shot learning settings. One possibility is to\nperform transfer learning, but it was found that the accuracy decreases if the target task diverges from\nthe trained task. One solution to mitigate this is through the use of an adversarial loss [18].\nA large class of approaches aim to enable few-shot learning by meta-learning - the general idea being\nthat the meta-learning prepares the network to learn from the small amount of training data available\nin the few-shot learning setting. Note that meta-learning can be also used in other computer vision\napplications, such as fast adaptation for tracking in video [25]. The mechanisms through which\nmeta-learning is implemented can be loosely classi\ufb01ed in two groups. One class of approaches use a\ncustom network architecture for encoding the information acquired during the meta-learning phase,\nfor instance in fast weights [3], neural plasticity values [21], custom update rules [20], the state of\ntemporal convolutions [22] or in the memory of an LSTM [27]. The advantage of this approach is that\nit allows us to \ufb01ne-tune the architecture for the ef\ufb01cient encoding of the meta-learning information. A\ndisadvantage, however, is that it constrains the type of network architectures we can use; innovations\nin network architectures do not automatically transfer into the meta-learning approach. In a custom\nnetwork architecture meta-learning model, the target learning phase is not the customary network\nlearning, as it needs to take advantage of the custom encoding.\nA second, model-agnostic class of approaches aim to be usable for any differentiable network\narchitecture. Examples of these algorithms are MAML [11] or Reptile [23], where the aim is to\nencode the meta-learning in the weights of the network, such that the network performs the target\nlearning phase with ef\ufb01cient gradients. Approaches that customize the learning rates [19] during\nmeta-training can also be grouped in this class. For this type of approaches, the target learning phase\nuses the well-established learning algorithms that would be used if learning from scratch (albeit it\nmight use speci\ufb01c hyperparameter settings, such as higher learning rates). We need to point out,\nhowever, that the meta-learning phase uses custom algorithms in these approaches as well (although\nthey might use the standard learning algorithm in the inner loop, such as in the case of MAML). A\nrecent work similar in spirit to ours is the CACTUs unsupervised meta-learning model described\nin [14].\nIn this paper, we perform unsupervised meta-learning. Our approach generates tasks from unlabeled\ndata which will help it to understand the structures of the relevant supervised tasks in the future. One\nshould note that these relevant supervised tasks in the future do not have any intersection with the\ntasks which are used during the meta-learning. For instance, Wu et al. perform unsupervised learning\nby recognizing a certain internal structure between dataset classes [32]. By learning this structure, the\napproach can be extended to semi-supervised learning. In addition, Pathak et al. propose a method\nwhich learns object features in an interesting unsupervised way by detecting movement patterns of\nsegmented objects [26]. These approaches are orthogonal to ours. We do not make assumptions\nthat the unsupervised data shares classes with the target learning (in fact, we explicitly forbid it).\nFinally, [13] de\ufb01ne unsupervised meta-learning in reinforcement learning context. The authors study\nhow to generate tasks with synthetic reward functions (without supervision) such that when the policy\nnetwork is meta trained on them, they can learn real tasks with manually de\ufb01ned reward functions\n(with supervision) much more quickly and with fewer samples.\n\n3\n\n\f3 The UMTRA algorithm\n\n3.1 Preliminaries\nWe consider the task of classifying samples x drawn from a domain X into classes yi \u2208 Y =\n{C1, . . . , CN}. The classes are encoded as one-hot vectors of dimensionality N. We are interested\nin learning a classi\ufb01er f\u03b8 that outputs a probability distribution over the classes. It is common to\nenvision f as a deep neural network parameterized by \u03b8, although this is not the only possible choice.\nWe package a certain supervised learning task, T , of type (N, K), that is with N classes of K training\nsamples each, as follows. The training data will have the form (xi, yi), where i = 1 . . . N \u00d7 K,\nxi \u2208 X and yi \u2208 Y , with exactly K samples for each value of yi. In the recent meta-learning\nliterature, it is often assumed that the task T has K samples of each class for training and (separately),\nK samples for validation (xv\nIn supervised meta-learning, we have access to a collection of tasks T1 . . .Tn drawn from a speci\ufb01c\ndistribution, with both supervised training and validation data. The meta-learning phase uses this\ncollection of tasks, while the target learning uses a new task T with supervised learning data but no\nvalidation data.\n\nj , yv\n\nj ).\n\n3.2 Model\n\nUnsupervised meta-learning retains the goal of meta-learning by preparing a learning system for the\nrapid learning of the target task T . However, instead of the collection of tasks T1 . . .Tn and their\nassociated labeled training data, we only have an unlabeled dataset U = {. . . xi . . .}, with samples\ndrawn from the same distribution as the target task. We assume that every element of this dataset\nis associated with a natural class C1 . . . Cc, \u2200xi \u2203j such that xi \u2208 Cj. We will assume that N (cid:28) c,\nthat is, the number of natural classes in the unsupervised dataset is much higher than the number of\nclasses in the target task. These requirements are much easier to satisfy than the construction of the\ntasks for supervised meta-learning - for instance, simply stripping the labels from datasets such as\nOmniglot and Mini-ImageNet satis\ufb01es them.\nThe pseudo-code of the UMTRA algorithm is described in Algorithm 1. In the following, we describe\nthe various parts of the algorithm in detail. In order to be able to run the UMTRA algorithm on\nunsupervised data, we need to create tasks Ti from the unsupervised data that can serve the same role\nas the meta-learning tasks serve in the full MAML algorithm. For such a task, we need to create both\nthe training data D and the validation data D(cid:48).\nCreating the training data: In the original form of the MAML algorithm, the training data of the\ntask T must have the form (x, y), and we need N \u00d7 K of them. The exact labels used during the\nmeta-training step are not relevant, as they are discarded during the meta-training phase. They can be\nthus replaced with arti\ufb01cial labels, by setting them y \u2208 {1, ...N}. It is however, important that the\nlabels maintain class distinctions: if two data points have the same label, they should also have the\nsame arti\ufb01cial labels, while if they have different labels, they should have different arti\ufb01cial labels.\nThe \ufb01rst difference between UMTRA and MAML is that during the meta-training phases, we always\nperform one-shot learning, with K = 1. Note that during the target learning phase we can still set\nvalues of K different from 1. The training data is created as the set Di = {(x1, 1), . . . (xN , N )},\nwith xi sampled randomly from U.\nLet us see how this training data construction satisfy the class distinction conditions. The \ufb01rst\ncondition is satis\ufb01ed because there is only one sample for each label. The second condition is satis\ufb01ed\nstatistically by the fact that N (cid:28) c, where c is the total number of classes in the dataset. If the number\nof samples is signi\ufb01cantly smaller than the number of classes, it is likely that all the samples will be\ndrawn from different classes. If we assume that the samples are equally distributed among the classes\n(e.g. m samples for each class), the probability that all samples are in a different class is equal to\n\nP =\n\n(c \u00b7 m) \u00b7 ((c \u2212 1) \u00b7 m)...((c \u2212 N + 1) \u00b7 m)\n(c \u00b7 m) \u00b7 (c \u00b7 m \u2212 1)...(c \u00b7 m \u2212 N + 1)\n\n=\n\nc! \u00b7 mN \u00b7 (c \u00b7 m \u2212 N )!\n(c \u2212 N )! \u00b7 (c \u00b7 m)!\n\n(1)\n\nTo illustrate this, the probability for 5-way classi\ufb01cation on the Omniglot dataset used with each of\nthe 1200 characters is a separate class (c = 1200, N = 5) is 99.21%. For Mini-ImageNet (c = 64),\nthe probability is 85.23%, while for the full ImageNet it would be about 99%.\n\n4\n\n\ffor i in 1 . . . NMB do\nSample N data points x1 . . . xN from U;\nTi \u2190 {x1, . . . xN};\n\nAlgorithm 1: Unsupervised Meta-learning with Tasks constructed by Random sampling and\nAugmentation (UMTRA)\nrequire :N: class-count, NMB: meta-batch size, NU : no. of updates\nrequire :U = {. . . xi . . .} unlabeled dataset\nrequire :\u03b1, \u03b2: step size hyperparameters\nrequire :A: augmentation function\n1 randomly initialize \u03b8;\n2 while not done do\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17 end\n\nLTi(f\u03b8(cid:48)\n);\nend\nGenerate validation set for the meta-update D(cid:48)\ni = {(A(x1), 1), . . . , (A(xN ), N )}\n) using each D(cid:48)\ni;\n\nGenerate training set Di = {(x1, 1), . . . , (xN , N )};\n\u03b8(cid:48)\ni = \u03b8;\nfor j in 1 . . . NU do\n\nend\nUpdate \u03b8 \u2190 \u03b8 \u2212 \u03b2\u2207\u03b8\n\nEvaluate \u2207\u03b8(cid:48)\nCompute adapted parameters with gradient descent: \u03b8(cid:48)\n\nLTi(f\u03b8(cid:48)\n\n);\n\nend\nforeach Ti do\n\n(cid:80)Ti\n\ni = \u03b8(cid:48)\n\ni \u2212 \u03b1\u2207\u03b8(cid:48)\n\ni\n\ni\n\ni\n\ni\n\nLTi(f\u03b8(cid:48)\n\ni\n\ni \u2208 C.\n\n1, 1), . . . (x(cid:48)\n\nCreating the validation data: For the MAML approach, the validation data of the meta-training\ntasks is actually training data in the outer loop. It is thus required that we create a validation dataset\nN , N )} for each task Ti. Thus we need to create appropriate validation data for\nD(cid:48)\ni = {(x(cid:48)\nthe synthetic task. A minimum requirement for the validation data is to be correctly labeled in the\ngiven context. This means that the synthetic numerical label should map in both cases to the same\nclass in the unlabeled dataset: (cid:64) C such that xi, x(cid:48)\nIn the original MAML model, these x(cid:48)\ni values are labeled examples part of the supervised dataset. In\nour case, picking such x(cid:48)\ni values is non-trivial, as we don\u2019t have access to the actual class. Instead,\nwe propose to create such a sample by augmenting the sample used in the training data using\ni = A(xi) which is a hyperparameter of the UMTRA algorithm. A\nan augmentation function x(cid:48)\nrequirement towards the augmentation function is to maintain class membership x \u2208 C \u21d2 A(x) \u2208 C.\nWe should aim to construct the augmentation function to verify this property for the given dataset\nU, based on what we know about the domain described by the dataset. However, as we do not have\naccess to the classes, such a veri\ufb01cation is not practically possible on a concrete dataset.\nAnother choice for the augmentation function A is to apply some kind of domain-speci\ufb01c change to\nthe images or videos. Examples of these include setting some of the pixel values to zero in the image\n(Figure 2, left), or translating the pixels of the training image by some amount (eg. between -6 and 6).\nThe overall process of generating the training data from the unlabeled dataset in UMTRA and the\ndifferences from the supervised MAML approach is illustrated in Figure 1.\n\n3.3 Some theoretical considerations\n\nWhile a full formal model of the learning ability of the UMTRA algorithm is beyond the scope of\nthis paper, we can investigate some aspects of its behavior that shed light into why the algorithm\nis working, and why augmentation improves its performance. Let us denote our network with a\nparameterized function f\u03b8. As we want to learn a few-shot classi\ufb01cation task, T we are searching\nfor the corresponding function fT , to which we do not have access. To learn this function, we use\nthe training dataset, DT = {(xi, yi)}n\u00d7k\ni=1 . For this particular task, we update our parameters (to\n\u03b8(cid:48)) to \ufb01t this task\u2019s training dataset. In other words, we want f\u03b8(cid:48) to be a good approximation of\n\n5\n\n\fFigure 2: Augmentation techniques on Omniglot (left) and Mini-Imagenet (right). Top row: Original\nimages in training data. Bottom: augmented images for the validation set, transformed with an\naugmentation function A. Auto Augment [8] applies augmentations from a learned policy based on\ncombinations of translation, rotation, or shearing.\n\n(cid:88)\n\nE [L(y0, f\u03b8(cid:48)(x0))] =(cid:0)E[f\u03b8(cid:48)(x0)] \u2212 fT (x0)(cid:1)2\n\n(cid:104)\n(f\u03b8(cid:48)(x0))2(cid:105) \u2212 E [f\u03b8(cid:48)(x0)]2 + \u03c32\n\n\u03b8\n\n(xi,yi)\u2208DT\n\nL(yi, f\u03b8(xi)) is ill-de\ufb01ned because there are more\nfT . Finding \u03b8(cid:48) such that, \u03b8(cid:48) = argmin\nthan one solution for it. In meta-learning, we search for the \u03b8(cid:48) value that gives us the minimum\ngeneralization error, the measure of how accurately an algorithm is able to predict outcome values\nfor unseen data [1]. We can estimate the generalization error based on sampled data points from\nthe same task. Without loss of generality, let us consider a sampled data point (x0, y0). We can\nestimate generalization error on this point as L(y0, f\u03b8(cid:48)(x0)). In case of mean squared error, and by\naccepting irreducible error \u0001 \u223c N (0, \u03c3), we can decompose the expected generalization error as\nfollows [16, 12]:\n\n+ E\n\n(2)\nIn this equation, when (x0, y0) /\u2208 DT we have E[(f\u03b8(cid:48)(x0))2] \u2212 E[f\u03b8(cid:48)(x0)]2 = 0, which means that\nthe estimation of the generalization error on these samples will be as unbiased as possible (only\nbiased by \u03c32). On the other hand, if (x0, y0) \u2208 DT , the estimation of the error is going to be highly\nbiased. We conjecture that similar results will be observed for other loss functions as well with the\nestimate of the loss function being more biased if the samples are from the training data rather than\noutside it. As in the outer loop of MAML estimates the generalization error on a validation set for\neach task in a batch of tasks, it is important to keep the validation set separate from the training set,\nas this estimate will be eventually applied to the starter network.\nIn contrast, if we pick our validation set as points in DT , our algorithm is going to learn to minimize\na biased estimation of the generalization error. Our experiments also show that if we choose the same\ndata for train and test (A(x) = x), we will end up with an accuracy almost the same as training from\nscratch. UMTRA, however, tries to improve the estimation of generalization error with augmentation\ntechniques. Our experiments show that by applying UMTRA with good choice of function for\naugmentation, we can achieve comparable results with supervised meta-learning algorithms. In our\nsupplementary material, we show that UMTRA is able to adapt very quickly with just few iterations\nto a new task. Last but not least, in comparison with CACTUs algorithm which applies advanced\nclustering algorithms such as DeepCluster [6], ACAI [5], and BiGAN [10] to generate train and\nvalidation set for each task, our method does not require clustering.\n\n4 Experiments\n\n4.1 UMTRA on the Omniglot dataset\n\nOmniglot [17] is a dataset of handwritten characters frequently used to compare few-shot learning\nalgorithms. It comprises 1623 characters from 50 different alphabets. Every character in Omniglot\nhas 20 different instances each was written by a different person. To allow comparisons with other\npublished results, in our experiments we follow the experimental protocol described in [28]: 1200\ncharacters were used for training, 100 characters were used for validation and 323 characters were\nused for testing.\nUMTRA, like the supervised MAML algorithm, is model-agnostic, that is, it does not impose\nconditions on the actual network architecture used in the learning. This does not, of course, mean that\n\n6\n\nTranslation + Zeroing PixelsSameFlipGrayscaleAuto AugmentRotate\fTable 1: The in\ufb02uence of augmentation function on the accuracy of UMTRA for 5-way one-shot\nclassi\ufb01cation on the (Left: Omniglot dataset, Right: Mini-Imagenet dataset). For all cases, we use\nmeta-batch size NMB = 4 and number of updates NU = 5, except the ones with best hyperparame-\nters.\n\nAugmentation Function A\nTraining from scratch\nA = 1\nA = randomly zeroed pixels\nA = randomly zeroed pixels\n(with best hyperparameters)\nA = randomly zeroed pixels\n\n+ random shift (with best\nhyperparameters)\n\nSupervised MAML\n\nAccuracy\n52.50\n52.93\n56.23\n67.00\n\n83.80\n\n98.7\n\nAugmentation Function A\nTraining from scratch\nA = 1\nA = Shift + random \ufb02ip\nA = Shift + random \ufb02ip + ran-\ndomly change to grayscale\nA = Shift + random \ufb02ip + ran-\ndom rotation + color distortions\nA = Auto Augment [8]\nSupervised MAML\n\nAccuracy\n24.17\n26.49\n30.16\n32.80\n\n35.09\n\n39.93\n46.81\n\nTable 2: The effect of hyperparameters meta-batch size, NM B, and number of updates, NU on\naccuracy. Omniglot 5-way one shot.\n\n# Updates\n\n1\n5\n10\n\nNMB\n\n1\n67.08\n76.08\n79.20\n\n2\n79.04\n76.68\n79.24\n\n4\n80.72\n77.20\n80.92\n\n8\n81.60\n79.56\n80.68\n\n16\n82.72\n81.12\n83.52\n\n25\n83.80\n83.32\n83.26\n\nthe algorithm performs identically for every network structure and dataset. In order to separate the\nperformance of the architecture and the meta-learner, we run our experiments using an architecture\noriginally proposed in [31]. This classi\ufb01er uses four 3 x 3 convolutional modules with 64 \ufb01lters each,\nfollowed by batch normalization [15], a ReLU nonlinearity and 2 x 2 max-pooling. On the resulting\nfeature embedding, the classi\ufb01er is implemented as a fully connected layer followed by a softmax\nlayer.\nUMTRA has a relatively large hyperparameter space that includes the augmentation function. As\npointed out in a recent study involving performance comparisons in semi-supervised systems [24],\nexcessive tuning of hyperparameters can easily lead to an overestimation of the performance of\nan approach compared to simpler approaches. Thus, for the comparison in the remainder of this\npaper, we keep a relatively small budget for hyperparameter search: beyond basic sanity checks, we\nonly tested 5-10 hyperparameter combinations per dataset, without specializing them to the N or\nK parameters of the target task. Table 1, left, shows several choices for the augmentation function\nfor the 5-way one-shot classi\ufb01cation on Omniglot. Based on this table, in comparing with other\napproaches, we use an augmentation function consisting of randomly zeroed pixels and random shift.\nIn our experiments, we realized two of the most important hyperparameters in meta-learning are\nmeta-batch size, NM B, and number of updates, NU . In table 2, we study the effects of these\nhyperparameters on the accuracy of the network for the randomly zeroed pixels and random shift\naugmentation. Based on this experiment, we decide to \ufb01x the meta-batch size to 25 and number of\nupdates to 1.\nIn order to \ufb01nd out the relationship between the level of the augmentation and accuracy, we apply\ndifferent levels of augmentation on images. If the generated samples are different from current\nobservation but within the same class manifold, UMTRA performs well. The results of this experiment\nare shown in table 3.\nThe second consideration is what sort of baseline we should use when evaluating our approach on a\nfew-shot learning task? Clearly, supervised meta-learning approaches such as an original MAML [11]\n\nTable 3: The effect of the augmentation level on UMTRA\u2019s accuracy on the Omniglot dataset. In\nall of the experiments we use random pixel zeroing with meta-batch size NMB = 25 and number of\nupdates NU = 1.\n\nTranslation Range (Pixels)\n\nAccuracy %\n\n0\n\n67.0\n\n0-3\n82.8\n\n3-6\n80.4\n\n0-6\n83.8\n\n6-9\n79.8\n\n9-12\n77\n\n0-9\n80.4\n\n7\n\n\fTable 4: Accuracy in % of N-way K-shot (N,K) learning methods on the Omniglot and Mini-\nImagenet datasets. The ACAI / DC label means ACAI Clustering on Omniglot and DeepCluster on\nMini-Imagenet. The source of non-UMTRA values is [14].\nOmniglot\n(5,5)\n74.78\n68.06\n68.72\n62.56\n58.62\n78.66\n71.69\n81.16\n81.82\n77.20\n71.09\n87.78\n83.58\n\nClustering\nAlgorithm (N, K)\nTraining from scratch\nN/A\nBiGAN\nknn-nearest neighbors\nBiGAN\nlinear classi\ufb01er\nBiGAN\nMLP with dropout\nBiGAN\ncluster matching\nBiGAN\nCACTUs-MAML\nBiGAN\nCACTUs-ProtoNets\nACAI / DC\nknn-nearest neighbors\nACAI / DC\nlinear classi\ufb01er\nACAI / DC\nMLP with dropout\nACAI / DC\ncluster matching\nACAI / DC\nCACTUs-MAML\nACAI / DC\nCACTUs-ProtoNets\nN/A\nUMTRA (ours)\nMAML (Supervised)\nN/A\nProtoNets (Supervised) N/A\n\nMini-Imagenet\n(5,20)\n(5,5)\n(5,50)\n(5,1)\n38.48\n51.53\n59.63\n27.59\n37.31\n31.10\n43.60\n25.56\n44.00\n33.91\n50.41\n27.08\n40.06\n29.06\n48.36\n22.91\n33.89\n29.49\n36.13\n24.63\n51.28\n61.33\n66.91\n36.24\n59.56\n50.16\n63.27\n36.62\n56.44\n42.25\n63.90\n28.90\n56.19\n39.79\n65.28\n29.44\n52.71\n39.67\n60.95\n29.03\n23.50\n24.97\n26.87\n22.20\n53.97 63.84 69.64\n39.90\n63.55\n39.18\n53.36\n39.93 50.73\n67.15\n75.54\n62.13\n46.81\n46.56\n62.29\n72.04\n\n61.54\n61.11\n71.03\n70.05\n\n(20,1)\n24.91\n27.37\n27.80\n19.92\n21.54\n35.56\n33.40\n39.73\n43.20\n30.65\n32.19\n48.09\n47.75\n\n(20,5)\n(5,1)\n47.62\n52.50\n46.70\n49.55\n45.82\n48.28\n40.71\n40.54\n31.06\n43.96\n58.62\n58.18\n50.62\n54.74\n66.38\n57.46\n66.33\n61.08\n58.62\n51.95\n45.93\n54.94\n73.36\n68.84\n68.12\n66.27\n83.80 95.43 74.25 92.12\n96.29\n94.46\n98.35\n98.81\n\n84.60\n95.31\n\n98.83\n99.58\n\nare expected to outperform our approach, as they use a labeled training set. A simple baseline is to\nuse the same network architecture being trained from scratch with only the \ufb01nal few-shot labeled set.\nIf our algorithm takes advantage of the unsupervised training set U, as expected, it should outperform\nthis baseline.\nA more competitive comparison can be made against networks that are \ufb01rst trained to obtain a\nfavorable embedding using unsupervised learning on U, with the resulting embedding used on the\nfew-shot learning task. These baselines are not meta-learning approaches, however, we can train\nthem with the same target task training set as UMTRA. Similar to [14], we compare the following\nunsupervised pre-training approaches: ACAI [5], BiGAN [10], DeepCluster [6] and InfoGAN [7].\nThese up-to-date approaches cover a wide range of the recent advances in the area of unsupervised\nfeature learning. Finally, we also compare against the CACTUs unsupervised meta-learning algorithm\nproposed in the [14], combined with MAML and ProtoNets [30]. As a note, another unsupervised\nmeta-learning approach related to UMTRA and CACTUs is AAL [2]. However, as [2] doesn\u2019t\ncompare against stock MAML, the results are not directly comparable.\nTable 4, columns three to six, shows the results of the experiments. For the UMTRA approach we\ntrained for 6000 meta-iterations for the 5-way, and 36,000 meta-iterations for the 20-way classi\ufb01ca-\ntions. Our approach, with the proposed hyperparameter settings outperforms, with large margins,\ntraining from scratch and the approaches based on unsupervised representation learning. UMTRA\nalso outperforms, with a smaller margin, the CACTUs approach on all metrics, and in combination\nwith both MAML and ProtoNets.\nAs expected, the supervised meta-learning baselines perform better than UMTRA. To put this value\nin perspective, we need to take into consideration the vast difference in the number of labels needed\nfor these approaches. In 5-way one-shot classi\ufb01cation, UMTRA obtains a 83.80% accuracy with\nonly 5 labels, while supervised MAML obtains 94.46% but requires 24005 labels. For 5-way 5-shot\nclassi\ufb01cation UMTRA obtains a 95.43% accuracy with only 25 labels, while supervised MAML\nobtains 98.83% with 24025.\n\n4.2 UMTRA on the Mini-Imagenet dataset\n\nThe Mini-Imagenet dataset was introduced by [27] as a subset of the ImageNet dataset [9], suitable\nas a benchmark for few-shot learning algorithms. The dataset is limited to 100 classes, each with 600\nimages. We divide our dataset into train, validation and test subsets according to the experimental\nprotocol proposed by [31]. The classi\ufb01er network is similar to the one used in [11].\nSince Mini-Imagenet is a dataset with larger images and more complex classes compared to Omniglot,\nwe need to choose augmentation functions suitable to the model. We had investigated several simple\n\n8\n\n\fchoices involving random \ufb02ips, shifts, rotation, and color changes. In addition to these hand-crafted\nalgorithms, we also investigated the learned auto-augmentation method proposed in [8]. Table 1, right,\nshows the accuracy results for the tested augmentation functions. We found that auto-augmentation\nprovided the best results, thus this approach was used in the remainder of the experiments.\nThe last four columns of Table 4 lists the experimental results for few-shot classi\ufb01cation learning on\nthe Mini-Imagenet dataset. Similar to the Omniglot dataset, UMTRA performs better than learning\nfrom scratch and all the approaches that use unsupervised representation learning. It performs\nweaker than supervised meta-learning approaches that use labeled data. Compared to the various\ncombinations involving the CACTUs unsupervised meta-learning algorithm, UMTRA performs better\non 5-way one-shot classi\ufb01cation, while it is outperformed by the CACTUs-MAML with DeepCluster\ncombination for the 5, 20 and 50 shot classi\ufb01cation.\nA possible question might be raised whether the improvements we see are due to the meta-learning\nprocess or due to the augmentation enriching the few shot dataset. To investigate this, we performed\nseveral experiments on Omniglot and Mini-Imagenet by training the target tasks from scratch on the\naugmented target dataset. For 5-way, 1-shot learning on Omniglot the accuracy was: training from\nscratch 52.5%, training from scratch with augmentation 55.8%, UMTRA 83.8%. For MiniImagenet\nthe numbers were: from scratch without augmentation 27.6%, from scratch with augmentation\n28.8%, UMTRA 39.93%. We conclude that while augmentation does provide a (minor) improvement\non the target training by itself, the majority of the improvement shown by UMTRA is due to the\nmeta-learning process.\nThe results on Omniglot and Mini-Imagenet allow us to draw the preliminary conclusions that\nunsupervised meta-learning approaches like UMTRA and CACTUs, which generate meta tasks Ti\nfrom the unsupervised training data tend to outperform other approaches for a given unsupervised\ntraining set U. UMTRA and CACTUs use different, orthogonal approaches for building T . UMTRA\nuses the statistical likelihood of picking different classes for the training data of Ti in case of K = 1\nand large number of classes, and an augmentation function T for the validation data. CACTUs relies\non an unsupervised clustering algorithm to provide a statistical likelihood of difference and sameness\nin the training and validation data of Ti. Except in the case of UMTRA with A = 1, both approaches\nrequire domain speci\ufb01c knowledge. The choice of the right augmentation function for UMTRA, the\nright clustering approach for CACTUs, and the other hyperparameters (for both approaches) have a\nstrong impact on the performance.\n\n5 Conclusions\n\nIn this paper, we described the UMTRA algorithm for few-shot and one-shot learning of classi\ufb01ers.\nUMTRA performs meta-learning on an unlabeled dataset in an unsupervised fashion, without\nputting any constraint on the classi\ufb01er network architecture. Experimental studies over the few-\nshot learning image benchmarks Omniglot and Mini-Imagenet show that UMTRA outperforms\nlearning-from-scratch approaches and approaches based on unsupervised representation learning.\nIt alternated in obtaining by best result with the recently proposed CACTUs algorithm that takes a\ndifferent approach to unsupervised meta-learning by applying clustering on an unlabeled dataset. The\nstatistical sampling and augmentation performed by UMTRA can be seen as a cheaper alternative to\nthe dataset-wide clustering performed by CACTUs. The results also open the possibility that these\napproaches might be orthogonal, and in combination might yield an even better performance. For\nall experiments, UMTRA performed worse than the equivalent supervised meta-learning approach\n- but requiring 3-4 orders of magnitude less labeled data. The supplemental material shows that\nUMTRA is not limited to image classi\ufb01cation but it can be applied to other tasks as well, such as\nvideo classi\ufb01cation.\n\n9\n\n\fAcknowledgements: This research is based upon work supported in parts by the National Science\nFoundation under Grant numbers IIS-1409823 and IIS-1741431 and Of\ufb01ce of the Director of National\nIntelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D\nContract No. D17PC00345. The views, \ufb01ndings, opinions, and conclusions or recommendations\ncontained herein are those of the authors and should not be interpreted as necessarily representing\nthe of\ufb01cial policies or endorsements, either expressed or implied, of the NSF, ODNI, IARPA, or\nthe U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for\nGovernmental purposes notwithstanding any copyright annotation thereon.\n\nReferences\n[1] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin. Learning from data, volume 4. AMLBook\n\nNew York, NY, USA, 2012.\n\n[2] A. Antoniou and A. Storkey. Assume, augment and learn: Unsupervised few-shot meta-learning\n\nvia random labels and data augmentation. arXiv preprint arXiv:1902.09884, 2019.\n\n[3] J. Ba, G. E. Hinton, V. Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the\nrecent past. In Proc. of Advances in Neural Information Processing Systems (NIPS), pages\n4331\u20134339, 2016.\n\n[4] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Universit\u00e9 de Montr\u00e9al,\n\nD\u00e9partement d\u2019informatique et de recherche op\u00e9rationnelle, 1990.\n\n[5] D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow. Understanding and improving interpolation\n\nin autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543, 2018.\n\n[6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning\nof visual features. In Proc. of the European Conf. on Computer Vision (ECCV), pages 132\u2013149,\n2018.\n\n[7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter-\npretable representation learning by information maximizing generative adversarial nets. In Proc.\nof Advances in Neural Information Processing Systems (NIPS), pages 2172\u20132180, 2016.\n\n[8] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning\n\naugmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.\n\n[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages\n248\u2013255. Ieee, 2009.\n\n[10] J. Donahue, P. Kr\u00e4henb\u00fchl, and T. Darrell. Adversarial feature learning. arXiv preprint\n\narXiv:1605.09782, 2016.\n\n[11] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. Proc. of Int\u2019l Conf. on Machine Leanring (ICML), 2017.\n\n[12] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics New York, 2001.\n\n[13] A. Gupta, B. Eysenbach, C. Finn, and S. Levine. Unsupervised meta-learning for reinforcement\n\nlearning. arXiv preprint arXiv:1806.04640, 2018.\n\n[14] K. Hsu, S. Levine, and C. Finn. Unsupervised learning via meta-learning. arXiv preprint\n\narXiv:1810.02334, 2018.\n\n[15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[16] G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning, volume\n\n112. Springer, 2013.\n\n10\n\n\f[17] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One shot learning of simple visual\nconcepts. In Proc. of the Annual Meeting of the Cognitive Science Society, volume 33, 2011.\n\n[18] Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label ef\ufb01cient learning of transferable repre-\nsentations across domains and tasks. In Proc. of Advances in Neural Information Processing\nSystems (NIPS), pages 165\u2013177, 2017.\n\n[19] F. Meier, D. Kappler, and S. Schaal. Online learning of a memory for learning rates. In Proc. of\n\nIEEE Int\u2019l Conf. on Robotics and Automation (ICRA), pages 2425\u20132432, 2018.\n\n[20] L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein. Learning unsupervised\n\nlearning rules. arXiv preprint arXiv:1804.00222, 2018.\n\n[21] T. Miconi, J. Clune, and K. O. Stanley. Differentiable plasticity: training plastic neural networks\n\nwith backpropagation. arXiv preprint arXiv:1804.02464, 2018.\n\n[22] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner.\n\narXiv preprint arXiv:1707.03141, 2018.\n\n[23] A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint\n\narXiv:1803.02999, 2018.\n\n[24] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow. Realistic evaluation of deep\nsemi-supervised learning algorithms. In Advances in Neural Information Processing Systems\n(NIPS), pages 3235\u20133246, 2018.\n\n[25] E. Park and A. C. Berg. Meta-tracker: Fast and robust online adaptation for visual object\ntrackers. In Proc. of the European Conf. on Computer Vision (ECCV), pages 569\u2013585, 2018.\n\n[26] D. Pathak, R. Girshick, P. Doll\u00e1r, T. Darrell, and B. Hariharan. Learning features by watching\nobjects move. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition(CVPR),\npages 2701\u20132710, 2017.\n\n[27] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. Proc. of Int\u2019l Conf.\n\non Learning Representations (ICLR), 2016.\n\n[28] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with\nmemory-augmented neural networks. In Proc. of Int\u2019lal Conf. on Machine Learning (ICML),\npages 1842\u20131850, 2016.\n\n[29] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to\n\nlearn: the meta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[30] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Proc. of\n\nAdvances in Neural Information Processing Systems (NIPS), pages 4077\u20134087, 2017.\n\n[31] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning.\nIn Proc. of Advances in Neural Information Processing Systems (NIPS), pages 3630\u20133638,\n2016.\n\n[32] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric\ninstance discrimination. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recogni-\ntion(CVPR), pages 3733\u20133742, 2018.\n\n11\n\n\f", "award": [], "sourceid": 5350, "authors": [{"given_name": "Siavash", "family_name": "Khodadadeh", "institution": "University of Central Florida"}, {"given_name": "Ladislau", "family_name": "Boloni", "institution": "University of Central Florida"}, {"given_name": "Mubarak", "family_name": "Shah", "institution": "University of Central Florida"}]}