{"title": "Matching Networks for One Shot Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3630, "page_last": 3638, "abstract": "Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 82.2% to 87.8% and from 88% accuracy to 95% accuracy on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.", "full_text": "Matching Networks for One Shot Learning\n\nOriol Vinyals\n\nGoogle DeepMind\n\nvinyals@google.com\n\nCharles Blundell\nGoogle DeepMind\n\ncblundell@google.com\n\nTimothy Lillicrap\nGoogle DeepMind\n\ncountzero@google.com\n\nKoray Kavukcuoglu\nGoogle DeepMind\n\nkorayk@google.com\n\nDaan Wierstra\nGoogle DeepMind\n\nwierstra@google.com\n\nAbstract\n\nLearning from a few examples remains a key challenge in machine learning.\nDespite recent advances in important domains such as vision and language, the\nstandard supervised deep learning paradigm does not offer a satisfactory solution\nfor learning new concepts rapidly from little data. In this work, we employ ideas\nfrom metric learning based on deep neural features and from recent advances\nthat augment neural networks with external memories. Our framework learns a\nnetwork that maps a small labelled support set and an unlabelled example to its\nlabel, obviating the need for \ufb01ne-tuning to adapt to new class types. We then de\ufb01ne\none-shot learning problems on vision (using Omniglot, ImageNet) and language\ntasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to\n93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches.\nWe also demonstrate the usefulness of the same model on language modeling by\nintroducing a one-shot task on the Penn Treebank.\n\n1\n\nIntroduction\n\nHumans learn new concepts with very little supervision \u2013 e.g. a child can generalize the concept\nof \u201cgiraffe\u201d from a single picture in a book \u2013 yet our best deep learning systems need hundreds or\nthousands of examples. This motivates the setting we are interested in: \u201cone-shot\u201d learning, which\nconsists of learning a class from a single labelled example.\nDeep learning has made major advances in areas such as speech [7], vision [13] and language [16],\nbut is notorious for requiring large datasets. Data augmentation and regularization techniques alleviate\nover\ufb01tting in low data regimes, but do not solve it. Furthermore, learning is still slow and based on\nlarge datasets, requiring many weight updates using stochastic gradient descent. This, in our view, is\nmostly due to the parametric aspect of the model, in which training examples need to be slowly learnt\nby the model into its parameters.\nIn contrast, many non-parametric models allow novel examples to be rapidly assimilated, whilst not\nsuffering from catastrophic forgetting. Some models in this family (e.g., nearest neighbors) do not\nrequire any training but performance depends on the chosen metric [1]. Previous work on metric\nlearning in non-parametric setups [18] has been in\ufb02uential on our model, and we aim to incorporate\nthe best characteristics from both parametric and non-parametric models \u2013 namely, rapid acquisition\nof new examples while providing excellent generalisation from common examples.\nThe novelty of our work is twofold: at the modeling level, and at the training procedure. We propose\nMatching Nets, a neural network which uses recent advances in attention and memory that enable\nrapid learning. Secondly, our training procedure is based on a simple machine learning principle:\ntest and train conditions must match. Thus to train our network to do rapid learning, we train it by\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Matching Networks architecture\n\nshowing only a few examples per class, switching the task from minibatch to minibatch, much like\nhow it will be tested when presented with a few examples of a new task.\nBesides our contributions in de\ufb01ning a model and training criterion amenable for one-shot learning,\nwe contribute by the de\ufb01nition of two new tasks that can be used to benchmark other approaches on\nboth ImageNet and small scale language modeling. We hope that our results will encourage others to\nwork on this challenging problem.\nWe organized the paper by \ufb01rst de\ufb01ning and explaining our model whilst linking its several compo-\nnents to related work. Then in the following section we brie\ufb02y elaborate on some of the related work\nto the task and our model. In Section 4 we describe both our general setup and the experiments we\nperformed, demonstrating strong results on one-shot learning on a variety of tasks and setups.\n2 Model\n\nOur non-parametric approach to solving one-shot learning is based on two components which we\ndescribe in the following subsections. First, our model architecture follows recent advances in neural\nnetworks augmented with memory (as discussed in Section 3). Given a (small) support set S, our\nmodel de\ufb01nes a function cS (or classi\ufb01er) for each S, i.e. a mapping S \u2192 cS(.). Second, we employ\na training strategy which is tailored for one-shot learning from the support set S.\n2.1 Model Architecture\n\nIn recent years, many groups have investigated ways to augment neural network architectures with\nexternal memories and other components that make them more \u201ccomputer-like\u201d. We draw inspiration\nfrom models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] and\npointer networks [27].\nIn all these models, a neural attention mechanism, often fully differentiable, is de\ufb01ned to access (or\nread) a memory matrix which stores useful information to solve the task at hand. Typical uses of\nthis include machine translation, speech recognition, or question answering. More generally, these\narchitectures model P (B|A) where A and/or B can be a sequence (like in seq2seq models), or, more\ninterestingly for us, a set [26].\nOur contribution is to cast the problem of one-shot learning within the set-to-set framework [26].\nThe key point is that when trained, Matching Networks are able to produce sensible test labels for\nunobserved classes without any changes to the network. More precisely, we wish to map from a\n(small) support set of k examples of input-label pairs S = {(xi, yi)}k\ni=1 to a classi\ufb01er cS(\u02c6x) which,\ngiven a test example \u02c6x, de\ufb01nes a probability distribution over outputs \u02c6y. Here, \u02c6x could be an image,\nand \u02c6y a distribution over possible visual classes. We de\ufb01ne the mapping S \u2192 cS(\u02c6x) to be P (\u02c6y|\u02c6x, S)\nwhere P is parameterised by a neural network. Thus, when given a new support set of examples S(cid:48)\nfrom which to one-shot learn, we simply use the parametric neural network de\ufb01ned by P to make\npredictions about the appropriate label distribution \u02c6y for each test example \u02c6x: P (\u02c6y|\u02c6x, S(cid:48)).\n\n2\n\n\fOur model in its simplest form computes a probability over \u02c6y as follows:\n\nk(cid:88)\n\nP (\u02c6y|\u02c6x, S) =\n\na(\u02c6x, xi)yi\n\n(1)\n\ni=1\n\nwhere xi, yi are the inputs and corresponding label distributions from the support set S =\n{(xi, yi)}k\ni=1, and a is an attention mechanism which we discuss below. Note that eq. 1 essen-\ntially describes the output for a new class as a linear combination of the labels in the support set.\nWhere the attention mechanism a is a kernel on X \u00d7 X, then (1) is akin to a kernel density estimator.\nWhere the attention mechanism is zero for the b furthest xi from \u02c6x according to some distance\nmetric and an appropriate constant otherwise, then (1) is equivalent to \u2018k \u2212 b\u2019-nearest neighbours\n(although this requires an extension to the attention mechanism that we describe in Section 2.1.2).\nThus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an attention\nmechanism and the yi act as values bound to the corresponding keys xi, much like a hash table. In\nthis case we can understand this as a particular kind of associative memory where, given an input,\nwe \u201cpoint\u201d to the corresponding example in the support set, retrieving its label. Hence the functional\nform de\ufb01ned by the classi\ufb01er cS(\u02c6x) is very \ufb02exible and can adapt easily to any new support set.\n\n2.1.1 The Attention Kernel\n\na(\u02c6x, xi) = ec(f (\u02c6x),g(xi))/(cid:80)k\n\nEquation 1 relies on choosing a(., .), the attention mechanism, which fully speci\ufb01es the classi-\n\ufb01er. The simplest form that this takes (and which has very tight relationships with common\nattention models and kernel functions) is to use the softmax over the cosine distance c, i.e.,\nj=1 ec(f (\u02c6x),g(xj )) with embedding functions f and g being appropri-\nate neural networks (potentially with f = g) to embed \u02c6x and xi. In our experiments we shall see\nexamples where f and g are parameterised variously as deep convolutional networks for image\ntasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language tasks (see\nSection 4).\nWe note that, though related to metric learning, the classi\ufb01er de\ufb01ned by Equation 1 is discriminative.\nFor a given support set S and sample to classify \u02c6x, it is enough for \u02c6x to be suf\ufb01ciently aligned with\npairs (x(cid:48), y(cid:48)) \u2208 S such that y(cid:48) = y and misaligned with the rest. This kind of loss is also related to\nmethods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or large margin\nnearest neighbor [28].\nHowever, the objective that we are trying to optimize is precisely aligned with multi-way, one-shot\nclassi\ufb01cation, and thus we expect it to perform better than its counterparts. Additionally, the loss is\nsimple and differentiable so that one can \ufb01nd the optimal parameters in an \u201cend-to-end\u201d fashion.\n\n2.1.2 Full Context Embeddings\n\nThe main novelty of our model lies in reinterpreting a well studied framework (neural networks with\nexternal memories) to do one-shot learning. Closely related to metric learning, the embedding func-\ntions f and g act as a lift to feature space X to achieve maximum accuracy through the classi\ufb01cation\nfunction described in eq. 1.\nDespite the fact that the classi\ufb01cation strategy is fully conditioned on the whole support set through\nP (.|\u02c6x, S), the embeddings on which we apply the cosine similarity to \u201cattend\u201d, \u201cpoint\u201d or simply\ncompute the nearest neighbor are myopic in the sense that each element xi gets embedded by g(xi)\nindependently of other elements in the support set S. Furthermore, S should be able to modify how\nwe embed the test image \u02c6x through f.\nWe propose embedding the elements of the set through a function which takes as input the full set\nS in addition to xi, i.e. g becomes g(xi, S). Thus, as a function of the whole support set S, g can\nmodify how to embed xi. This could be useful when some element xj is very close to xi, in which\ncase it may be bene\ufb01cial to change the function with which we embed xi \u2013 some evidence of this is\ndiscussed in Section 4. We use a bidirectional Long-Short Term Memory (LSTM) [8] to encode xi in\nthe context of the support set S, considered as a sequence (see Appendix).\nThe second issue to make f depend on \u02c6x and S can be \ufb01xed via an LSTM with read-attention over\nthe whole set S, whose inputs are equal to f(cid:48)(\u02c6x) (f(cid:48) is an embedding function, e.g. a CNN). To do\n\n3\n\n\fso, we de\ufb01ne the following recurrence over \u201cprocessing\u201d steps k, following work from [26]:\n\n\u02c6hk, ck = LSTM(f(cid:48)(\u02c6x), [hk\u22121, rk\u22121], ck\u22121)\n\nhk = \u02c6hk + f(cid:48)(\u02c6x)\n\n|S|(cid:88)\n\nrk\u22121 =\n\na(hk\u22121, g(xi))g(xi)\n\ni=1\n\na(hk\u22121, g(xi)) = ehT\n\nk\u22121g(xi)/\n\n|S|(cid:88)\n\nehT\n\nk\u22121g(xj )\n\n(2)\n(3)\n\n(4)\n\n(5)\n\nnoting that LSTM(x, h, c) follows the same LSTM implementation de\ufb01ned in [23] with x the input,\nh the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as \u201ccontent\u201d\nbased attention. We do K steps of \u201creads\u201d, so f (\u02c6x, S) = hK where hk is as described in eq. 3.\n\nj=1\n\n2.2 Training Strategy\n\nIn the previous subsection we described Matching Networks which map a support set to a classi\ufb01cation\nfunction, S \u2192 c(\u02c6x). We achieve this via a modi\ufb01cation of the set-to-set paradigm augmented with\nattention, with the resulting mapping being of the form P\u03b8(.|\u02c6x, S), noting that \u03b8 are the parameters\nof the model (i.e. of the embedding functions f and g described previously).\nThe training procedure has to be chosen carefully so as to match inference at test time. Our model\nhas to perform well with support sets S(cid:48) which contain classes never seen during training.\nMore speci\ufb01cally, let us de\ufb01ne a task T as distribution over possible label sets L. Typically we\nconsider T to uniformly weight all data sets of up to a few unique classes (e.g., 5), with a few\nexamples per class (e.g., up to 5). In this case, a label set L sampled from a task T , L \u223c T , will\ntypically have 5 to 25 examples.\nTo form an \u201cepisode\u201d to compute gradients and update our model, we \ufb01rst sample L from T (e.g.,\nL could be the label set {cats, dogs}). We then use L to sample the support set S and a batch B\n(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is then trained to\nminimise the error predicting the labels in the batch B conditioned on the support set S. This is a\nform of meta-learning since the training procedure explicitly learns to learn from a given support set\nto minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows:\n\n\uf8ee\uf8f0ES\u223cL,B\u223cL\n\n\uf8ee\uf8f0 (cid:88)\n\n(x,y)\u2208B\n\n\uf8f9\uf8fb\uf8f9\uf8fb .\n\n\u03b8 = arg max\n\n\u03b8\n\nEL\u223cT\n\nlog P\u03b8 (y|x, S)\n\n(6)\n\nTraining \u03b8 with eq. 6 yields a model which works well when sampling S(cid:48) \u223c T (cid:48) from a different\ndistribution of novel labels. Crucially, our model does not need any \ufb01ne tuning on the classes it has\nnever seen due to its non-parametric nature. Obviously, as T (cid:48) diverges far from the T from which we\nsampled to learn \u03b8, the model will not work \u2013 we belabor this point further in Section 4.1.2.\n\n3 Related Work\n\n3.1 Memory Augmented Neural Networks\n\nA recent surge of models which go beyond \u201cstatic\u201d classi\ufb01cation of \ufb01xed vectors onto their classes\nhas reshaped current research and industrial applications alike. This is most notable in the massive\nadoption of LSTMs [8] in a variety of tasks such as speech [7], translation [23, 2] or learning programs\n[4, 27]. A key component which allowed for more expressive models was the introduction of \u201ccontent\u201d\nbased attention in [2], and \u201ccomputer-like\u201d architectures such as the Neural Turing Machine [4] or\nMemory Networks [29]. Our work takes the metalearning paradigm of [21], where an LSTM learnt\nto learn quickly from data presented sequentially, but we treat the data as a set. The one-shot learning\ntask we de\ufb01ned on the Penn Treebank [15] relates to evaluation techniques and models presented in\n[6], and we discuss this in Section 4.\n\n4\n\n\f3.2 Metric Learning\n\nAs discussed in Section 2, there are many links between content based attention, kernel based nearest\nneighbor and metric learning [1]. The most relevant work is Neighborhood Component Analysis\n(NCA) [18], and the follow up non-linear version [20]. The loss is very similar to ours, except we\nuse the whole support set S instead of pair-wise comparisons which is more amenable to one-shot\nlearning. Follow-up work in the form of deep convolutional siamese [11] networks included much\nmore powerful non-linear mappings. Other losses which include the notion of a set (but use less\npowerful metrics) were proposed in [28].\nLastly, the work in one-shot learning in [14] was inspirational and also provided us with the invaluable\nOmniglot dataset \u2013 referred to as the \u201ctranspose\u201d of MNIST. Other works used zero-shot learning on\nImageNet, e.g. [17]. However, there is not much one-shot literature on ImageNet, which we hope to\namend via our benchmark and task de\ufb01nitions in the following section.\n\n4 Experiments\n\nIn this section we describe the results of many experiments, comparing our Matching Networks\nmodel against strong baselines. All of our experiments revolve around the same basic task: an N-way\nk-shot learning task. Each method is providing with a set of k labelled examples from each of N\nclasses that have not previously been trained upon. The task is then to classify a disjoint batch of\nunlabelled examples into one of these N classes. Thus random performance on this task stands at\n1/N. We compared a number of alternative models, as baselines, to Matching Networks.\nLet L(cid:48) denote the held-out subset of labels which we only use for one-shot. Unless otherwise speci\ufb01ed,\ntraining is always on (cid:54)=L(cid:48), and test in one-shot mode on L(cid:48).\nWe ran one-shot experiments on three data sets: two image classi\ufb01cation sets (Omniglot [14] and\nImageNet [19, ILSVRC-2012]) and one language modeling (Penn Treebank). The experiments on\nthe three data sets comprise a diverse set of qualities in terms of complexity, sizes, and modalities.\n\n4.1\n\nImage Classi\ufb01cation Results\n\nFor vision problems, we considered four kinds of baselines: matching on raw pixels, matching on\ndiscriminative features from a state-of-the-art classi\ufb01er (Baseline Classi\ufb01er), MANN [21], and our\nreimplementation of the Convolutional Siamese Net [11]. The baseline classi\ufb01er was trained to\nclassify an image into one of the original classes present in the training data set, but excluding the\nN classes so as not to give it an unfair advantage (i.e., trained to classify classes in (cid:54)=L(cid:48)). We then\ntook this network and used the features from the last layer (before the softmax) for nearest neighbour\nmatching, a strategy commonly used in computer vision [3] which has achieved excellent results\nacross many tasks. Following [11], the convolutional siamese nets were trained on a same-or-different\ntask of the original training data set and then the last layer was used for nearest neighbour matching.\n\nModel\n\nMatching Fn\n\nFine Tune\n\nPIXELS\nBASELINE CLASSIFIER\nBASELINE CLASSIFIER\nBASELINE CLASSIFIER\nMANN (NO CONV) [21]\nCONVOLUTIONAL SIAMESE NET [11]\nCONVOLUTIONAL SIAMESE NET [11]\nMATCHING NETS (OURS)\nMATCHING NETS (OURS)\n\nCosine\nCosine\nCosine\nSoftmax\nCosine\nCosine\nCosine\nCosine\nCosine\n\nN\nN\nY\nY\nN\nN\nY\nN\nY\n\n5-way Acc\n20-way Acc\n1-shot 5-shot\n1-shot 5-shot\n41.7% 63.2% 26.7% 42.6%\n80.0% 95.0% 69.5% 89.1%\n82.3% 98.4% 70.6% 92.0%\n86.0% 97.6% 72.9% 92.3%\n82.8% 94.9%\n\u2013\n96.7% 98.4% 88.0% 96.5%\n97.3% 98.4% 88.1% 97.0%\n98.1% 98.9% 93.8% 98.5%\n97.9% 98.7% 93.5% 98.7%\n\n\u2013\n\nTable 1: Results on the Omniglot dataset.\n\n5\n\n\fWe also tried further \ufb01ne tuning the features using only the support set S(cid:48) sampled from L(cid:48). This\nyields massive over\ufb01tting, but given that our networks are highly regularized, can yield extra gains.\nNote that, even when \ufb01ne tuning, the setup is still one-shot, as only a single example per class from\nL(cid:48) is used.\n\n4.1.1 Omniglot\n\nOmniglot [14] consists of 1623 characters from 50 different alphabets. Each of these was hand drawn\nby 20 different people. The large number of classes (characters) with relatively few data per class\n(20), makes this an ideal data set for testing small-scale one-shot classi\ufb01cation. The N-way Omniglot\ntask setup is as follows: pick N unseen character classes, independent of alphabet, as L. Provide\nthe model with one drawing of each of the N characters as S \u223c L and a batch B \u223c L. Following\n[21], we augmented the data set with random rotations by multiples of 90 degrees and used 1200\ncharacters for training, and the remaining character classes for evaluation.\nWe used a simple yet powerful CNN as the embedding function \u2013 consisting of a stack of modules,\neach of which is a 3 \u00d7 3 convolution with 64 \ufb01lters followed by batch normalization [10], a Relu\nnon-linearity and 2 \u00d7 2 max-pooling. We resized all the images to 28 \u00d7 28 so that, when we stack 4\nmodules, the resulting feature map is 1 \u00d7 1 \u00d7 64, resulting in our embedding function f (x). A fully\nconnected layer followed by a softmax non-linearity is used to de\ufb01ne the Baseline Classi\ufb01er.\nResults comparing the baselines to our model on Omniglot are shown in Table 1. For both 1-shot\nand 5-shot, 5-way and 20-way, our model outperforms the baselines. There are no major surprises in\nthese results: using more examples for k-shot classi\ufb01cation helps all models, and 5-way is easier than\n20-way. We note that the Baseline Classi\ufb01er improves a bit when \ufb01ne tuning on S(cid:48), and using cosine\ndistance versus training a small softmax from the small training set (thus requiring \ufb01ne tuning) also\nperforms well. Siamese nets fare well versus our Matching Nets when using 5 examples per class,\nbut their performance degrades rapidly in one-shot. Fully Conditional Embeddings (FCE) did not\nseem to help much and were left out of the table due to space constraints.\nLike the authors in [11], we also test our method trained on Omniglot on a completely disjoint task \u2013\none-shot, 10 way MNIST classi\ufb01cation. The Baseline Classi\ufb01er does about 63% accuracy whereas\n(as reported in their paper) the Siamese Nets do 70%. Our model achieves 72%.\n\n4.1.2 ImageNet\n\nOur experiments followed the same setup as Omniglot for testing, but we considered a rand and a\ndogs (harder) setup. In the rand setup, we removed 118 labels at random from the training set, then\ntested only on these 118 classes (which we denote as Lrand). For the dogs setup, we removed all\nclasses in ImageNet descended from dogs (totalling 118) and trained on all non-dog classes, then\ntested on dog classes (Ldogs). ImageNet is a notoriously large data set which can be quite a feat of\nengineering and infrastructure to run experiments upon it, requiring many resources. Thus, as well as\nusing the full ImageNet data set, we devised a new data set \u2013 miniImageNet \u2013 consisting of 60, 000\ncolour images of size 84 \u00d7 84 with 100 classes, each having 600 examples. This dataset is more\ncomplex than CIFAR10 [12], but \ufb01ts in memory on modern machines, making it very convenient for\nrapid prototyping and experimentation. We used 80 classes for training and tested on the remaining\n20 classes. In total, thus, we have randImageNet, dogsImageNet, and miniImageNet.\nThe results of the miniImageNet experiments are shown in Table 2. As with Omniglot, Matching\nNetworks outperform the baselines. However, miniImageNet is a much harder task than Omniglot\nwhich allowed us to evaluate Full Contextual Embeddings (FCE) sensibly (on Omniglot it made no\ndifference). As we an see, FCE improves the performance of Matching Networks, with and without\n\ufb01ne tuning, typically improving performance by around two percentage points.\nNext we turned to experiments based upon full size, full scale ImageNet. Our baseline classi\ufb01er for\nthis data set was Inception [25] trained to classify on all classes except those in the test set of classes\n(for randImageNet) or those concerning dogs (for dogsImageNet). We also compared to features from\nan Inception Oracle classi\ufb01er trained on all classes in ImageNet, as an upper bound. Our Baseline\nClassi\ufb01er is one of the strongest published ImageNet models at 79% top-1 accuracy on the standard\nImageNet validation set. Instead of training Matching Networks from scratch on these large tasks, we\ninitialised their feature extractors f and g with the parameters from the Inception classi\ufb01er (pretrained\non the appropriate subset of the data) and then further trained the resulting network on random 5-way\n\n6\n\n\fFigure 2: Example of two 5-way problem instance on ImageNet. The images in the set S(cid:48) contain\nclasses never seen during training. Our model makes far less mistakes than the Inception baseline.\n\nTable 2: Results on miniImageNet.\n\nModel\n\nMatching Fn\n\nFine Tune\n\nPIXELS\nBASELINE CLASSIFIER\nBASELINE CLASSIFIER\nBASELINE CLASSIFIER\nMATCHING NETS (OURS)\nMATCHING NETS (OURS)\nMATCHING NETS (OURS)\nMATCHING NETS (OURS)\n\nN\nCosine\nN\nCosine\nY\nCosine\nY\nSoftmax\nN\nCosine\nCosine\nY\nCosine (FCE) N\nCosine (FCE) Y\n\n5-way Acc\n1-shot 5-shot\n23.0% 26.6%\n36.6% 46.0%\n36.2% 52.2%\n38.4% 51.2%\n41.2% 56.2%\n42.4% 58.0%\n44.2% 57.0%\n46.6% 60.0%\n\n1-shot tasks from the training data set, incorporating Full Context Embeddings and our Matching\nNetworks and training strategy.\nThe results of the randImageNet and dogsImageNet experiments are shown in Table 3. The Inception\nOracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which is\nnot too surprising given its impressive top-1 accuracy. When trained solely on (cid:54)=Lrand, Matching\nNets improve upon Inception by almost 6% when tested on Lrand, halving the errors. Figure 2 shows\ntwo instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inception\nappears to sometimes prefer an image above all others (these images tend to be cluttered like the\nexample in the second column, or more constant in color). Matching Nets, on the other hand, manage\nto recover from these outliers that sometimes appear in the support set S(cid:48).\nMatching Nets manage to improve upon Inception on the complementary subset (cid:54)=Ldogs (although\nthis setup is not one-shot, as the feature extraction has been trained on these labels). However, on the\nmuch more challenging Ldogs subset, our model degrades by 1%. We hypothesize this to the fact\nthat the sampled set during training, S, comes from a random distribution of labels (from (cid:54)=Ldogs),\nwhereas the testing support set S(cid:48) from Ldogs contains similar classes, more akin to \ufb01ne grained\nclassi\ufb01cation. Thus, we believe that if we adapted our training strategy to samples S from \ufb01ne grained\nsets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvements\ncould be attained. We leave this as future work.\n\nTable 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that (cid:54)=Lrand and (cid:54)=Ldogs\nare sets of classes which are seen during training, but are provided for completeness.\n\nModel\n\nMatching Fn\n\nFine Tune\n\nPIXELS\nINCEPTION CLASSIFIER\nMATCHING NETS (OURS)\nINCEPTION ORACLE\n\nCosine\nCosine\nCosine (FCE)\nSoftmax (Full) Y (Full)\n\nN\nN\nN\n\nImageNet 5-way 1-shot Acc\n\n(cid:54)=Ldogs\nLrand (cid:54)=Lrand\nLdogs\n41.4% 43.0%\n42.0% 42.8%\n59.8% 90.0%\n87.6% 92.6%\n93.2% 97.0% 58.8% 96.4%\n\u2248 99% \u2248 99% \u2248 99% \u2248 99%\n\n7\n\nS\u2019MatchNetInception\f4.1.3 One-Shot Language Modeling\nWe also introduce a new one-shot language task which is analogous to those examined for images.\nThe task is as follows: given a query sentence with a missing word in it, and a support set of sentences\nwhich each have a missing word and a corresponding 1-hot label, choose the label from the support\nset that best matches the query sentence. Here we show a single example, though note that the words\non the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.\n1. an experimental vaccine can alter the immune response of people infected with the aids virus a\n u.s.\n2. the show one of five new nbc is the second casualty of the three networks so far\nthis fall.\n3. however since eastern first filed for chapter N protection march N it has consistently promised\nto pay creditors N cents on the .\n4. we had a lot of people who threw in the today said ellis a partner in\nbenjamin jacobson & sons a specialist in trading ual stock on the big board.\n5. it\u2019s not easy to roll out something that and make it pay mr. jacob says.\nQuery: in late new york trading yesterday the was quoted at N marks down from N\nmarks late friday and at N yen down from N yen late friday.\n\ncomprehensive\ndollar\n\nscientist said.\n\nprominent\n\nseries\n\ndollar\n\ntowel\n\nSentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the set\nand batch are populated with sentences that are non-overlapping. This means that we do not use\nwords with very low frequency counts; e.g. if there is only a single sentence for a given word we do\nnot use this data since the sentence would need to be in both the set and the batch. As with the image\ntasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batch\nsize of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensured\nthat the same number of sentences were available for each class in the set. We split the words into a\nrandomly sampled 9000 for training and 1000 for testing, and we used the standard test set to report\nresults. Thus, neither the words nor the sentences used during test time had been seen during training.\nWe compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]\ntrained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shot\nlearning but seeing all the data \u2013 thus, this should be taken as an upper bound. To do so, we examined\na similar setup wherein a sentence was presented to the model with a single word \ufb01lled in with 5\ndifferent possible words (including the correct answer). For each of these 5 sentences the model gave\na log-likelihood and the max of these was taken to be the choice of the model.\nAs with the other 5 way choice tasks, chance performance on this task was 20%. The LSTM language\nmodel oracle achieved an upper bound of 72.8% accuracy on the test set. Matching Networks\nwith a simple encoding model achieve 32.4%, 36.1%, 38.2% accuracy on the task with k = 1, 2, 3\nexamples in the set, respectively. Future work should explore combining parametric models such as\nan LSTM-LM with non-parametric components such as the Matching Networks explored here.\nTwo related tasks are the CNN QA test of entity prediction from news articles [5], and the Children\u2019s\nBook Test (CBT) [6]. In the CBT for example, a sequence of sentences from a book are provided\nas context. In the \ufb01nal sentence one of the words, which has appeared in a previous sentence, is\nmissing. The task is to choose the correct word to \ufb01ll in this blank from a small set of words given\nas possible answers, all of which occur in the preceding sentences. In our sentence matching task\nthe sentences provided in the set are randomly drawn from the PTB corpus and are related to the\nsentences in the query batch only by the fact that they share a word. In contrast to CBT and CNN\ndataset, they provide only a generic rather than speci\ufb01c sequential context.\n5 Conclusion\nIn this paper we introduced Matching Networks, a new neural architecture that, by way of its\ncorresponding training regime, is capable of state-of-the-art performance on a variety of one-shot\nclassi\ufb01cation tasks. There are a few key insights in this work. Firstly, one-shot learning is much\neasier if you train the network to do one-shot learning. Secondly, non-parametric structures in a\nneural network make it easier for networks to remember and adapt to new training sets in the same\ntasks. Combining these observations together yields Matching Networks. Further, we have de\ufb01ned\nnew one-shot tasks on ImageNet, a reduced version of ImageNet (for rapid experimentation), and a\nlanguage modeling task. An obvious drawback of our model is the fact that, as the support set S grows\nin size, the computation for each gradient update becomes more expensive. Although there are sparse\nand sampling-based methods to alleviate this, much of our future efforts will concentrate around this\nlimitation. Further, as exempli\ufb01ed in the ImageNet dogs subtask, when the label distribution has\nobvious biases (such as being \ufb01ne grained), our model suffers. We feel this is an area with exciting\nchallenges which we hope to keep improving in future work.\n\n8\n\n\fAcknowledgements\n\nWe would like to thank Nal Kalchbrenner for brainstorming around the design of the function g, and\nSander Dieleman and Sergio Guadarrama for their help setting up ImageNet. We would also like\nthank Simon Osindero for useful discussions around the tasks discussed in this paper, and Theophane\nWeber and Remi Munos for following some early developments. Karen Simonyan and David Silver\nhelped with the manuscript, as well as many at Google DeepMind. Thanks also to Geoff Hinton and\nAlex Toshev for discussions about our results, and to the anonymous reviewers for great suggestions.\n\nReferences\n[1] C Atkeson, A Moore, and S Schaal. Locally weighted learning. Arti\ufb01cial Intelligence Review, 1997.\n[2] D Bahdanau, K Cho, and Y Bengio. Neural machine translation by jointly learning to align and translate.\n\nICLR, 2014.\n\n[3] J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang, E Tzeng, and T Darrell. Decaf: A deep convolutional\n\nactivation feature for generic visual recognition. In ICML, 2014.\n\n[4] A Graves, G Wayne, and I Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.\n[5] K Hermann, T Kocisky, E Grefenstette, L Espeholt, W Kay, M Suleyman, and P Blunsom. Teaching\n\nmachines to read and comprehend. In NIPS, 2015.\n\n[6] F Hill, A Bordes, S Chopra, and J Weston. The goldilocks principle: Reading children\u2019s books with explicit\n\nmemory representations. arXiv preprint arXiv:1511.02301, 2015.\n\n[7] G Hinton et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of\n\nfour research groups. Signal Processing Magazine, IEEE, 2012.\n\n[8] S Hochreiter and J Schmidhuber. Long short-term memory. Neural computation, 1997.\n[9] E Hoffer and N Ailon. Deep metric learning using triplet network. Similarity-Based Pattern Recognition,\n\n[10] S Ioffe and C Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[11] G Koch, R Zemel, and R Salakhutdinov. Siamese neural networks for one-shot image recognition. In\n\nICML Deep Learning workshop, 2015.\n\n[12] A Krizhevsky and G Hinton. Convolutional deep belief networks on cifar-10. Unpublished, 2010.\n[13] A Krizhevsky, I Sutskever, and G Hinton. Imagenet classi\ufb01cation with deep convolutional neural networks.\n\n[14] BM Lake, R Salakhutdinov, J Gross, and J Tenenbaum. One shot learning of simple visual concepts. In\n\n2015.\n\nIn NIPS, 2012.\n\nCogSci, 2011.\n\n[15] MP Marcus, MA Marcinkiewicz, and B Santorini. Building a large annotated corpus of english: The penn\n\n[16] T Mikolov, M Kara\ufb01\u00e1t, L Burget, J Cernock`y, and S Khudanpur. Recurrent neural network based language\n\ntreebank. Computational linguistics, 1993.\n\nmodel. In INTERSPEECH, 2010.\n\n[17] M Norouzi, T Mikolov, S Bengio, Y Singer, J Shlens, A Frome, G Corrado, and J Dean. Zero-shot learning\n\nby convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.\n\n[18] S Roweis, G Hinton, and R Salakhutdinov. Neighbourhood component analysis. NIPS, 2004.\n[19] O Russakovsky, J Deng, H Su, J Krause, S Satheesh, S Ma, Z Huang, A Karpathy, A Khosla, M Bernstein,\n\nA Berg, and L Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.\n\n[20] R Salakhutdinov and G Hinton. Learning a nonlinear embedding by preserving class neighbourhood\n\n[21] A Santoro, S Bartunov, M Botvinick, D Wierstra, and T Lillicrap. Meta-learning with memory-augmented\n\n[22] K Simonyan and A Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv\n\nstructure. In AISTATS, 2007.\n\nneural networks. In ICML, 2016.\n\npreprint arXiv:1409.1556, 2014.\n\n[23] I Sutskever, O Vinyals, and QV Le. Sequence to sequence learning with neural networks. In NIPS, 2014.\n[24] C Szegedy, W Liu, Y Jia, P Sermanet, S Reed, D Anguelov, D Erhan, V Vanhoucke, and A Rabinovich.\n\nGoing deeper with convolutions. In CVPR, 2015.\n\n[25] C Szegedy, V Vanhoucke, S Ioffe, J Shlens, and Z Wojna. Rethinking the inception architecture for\n\ncomputer vision. arXiv preprint arXiv:1512.00567, 2015.\n\n[26] O Vinyals, S Bengio, and M Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint\n\n[27] O Vinyals, M Fortunato, and N Jaitly. Pointer networks. In NIPS, 2015.\n[28] K Weinberger and L Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation. JMLR,\n\n[29] J Weston, S Chopra, and A Bordes. Memory networks. ICLR, 2014.\n[30] W Zaremba, I Sutskever, and O Vinyals. Recurrent neural network regularization. arXiv preprint\n\narXiv:1511.06391, 2015.\n\n2009.\n\narXiv:1409.2329, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1804, "authors": [{"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google"}, {"given_name": "Charles", "family_name": "Blundell", "institution": "DeepMind"}, {"given_name": "Timothy", "family_name": "Lillicrap", "institution": "Google DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}, {"given_name": "Daan", "family_name": "Wierstra", "institution": "Google DeepMind"}]}