{"title": "Learning feed-forward one-shot learners", "book": "Advances in Neural Information Processing Systems", "page_first": 523, "page_last": 531, "abstract": "One-shot learning is usually tackled by using generative models or discriminative embeddings. Discriminative methods based on deep learning, which are very effective in other learning scenarios, are ill-suited for one-shot learning as they need large amounts of training data. In this paper, we propose a method to learn the parameters of a deep model in one shot. We construct the learner as a second deep network, called a learnet, which predicts the parameters of a pupil network from a single exemplar. In this manner we obtain an efficient feed-forward one-shot learner, trained end-to-end by minimizing a one-shot classification objective in a learning to learn formulation. In order to make the construction feasible, we propose a number of factorizations of the parameters of the pupil network. We demonstrate encouraging results by learning characters from single exemplars in Omniglot, and by tracking visual objects from a single initial exemplar in the Visual Object Tracking benchmark.", "full_text": "Learning feed-forward one-shot learners\n\nLuca Bertinetto\u2217\nUniversity of Oxford\n\nluca@robots.ox.ac.uk\n\nJo\u00e3o F. Henriques\u2217\nUniversity of Oxford\n\njoao@robots.ox.ac.uk\n\nJack Valmadre\u2217\nUniversity of Oxford\n\njvlmdr@robots.ox.ac.uk\n\nPhilip H. S. Torr\nUniversity of Oxford\n\nphilip.torr@eng.ox.ac.uk\n\nAbstract\n\nAndrea Vedaldi\n\nUniversity of Oxford\n\nvedaldi@robots.ox.ac.uk\n\nOne-shot learning is usually tackled by using generative models or discriminative\nembeddings. Discriminative methods based on deep learning, which are very\neffective in other learning scenarios, are ill-suited for one-shot learning as they\nneed large amounts of training data. In this paper, we propose a method to learn the\nparameters of a deep model in one shot. We construct the learner as a second deep\nnetwork, called a learnet, which predicts the parameters of a pupil network from\na single exemplar. In this manner we obtain an ef\ufb01cient feed-forward one-shot\nlearner, trained end-to-end by minimizing a one-shot classi\ufb01cation objective in\na learning to learn formulation. In order to make the construction feasible, we\npropose a number of factorizations of the parameters of the pupil network. We\ndemonstrate encouraging results by learning characters from single exemplars in\nOmniglot, and by tracking visual objects from a single initial exemplar in the Visual\nObject Tracking benchmark.\n\n1\n\nIntroduction\n\nDeep learning methods have taken by storm areas such as computer vision, natural language process-\ning and speech recognition. One of their key strengths is the ability to leverage large quantities of\nlabelled data and extract meaningful and powerful representations from it. However, this capability is\nalso one of their most signi\ufb01cant limitations since using large datasets to train deep neural network is\nnot just an option, but a necessity. It is well known, in fact, that these models are prone to over\ufb01tting.\nThus, deep networks seem less useful when the goal is to learn a new concept on the \ufb02y, from a few or\neven a single example as in one shot learning. These problems are usually tackled by using generative\nmodels [18, 13] or, in a discriminative setting, using ad-hoc solutions such as exemplar support vector\nmachines (SVMs) [14]. Perhaps the most common discriminative approach to one-shot learning is to\nlearn off-line a deep embedding function and then to de\ufb01ne on-line simple classi\ufb01cation rules such as\nnearest neighbors in the embedding space [5, 16]. However, computing an embedding is a far cry\nfrom learning a model of the new object.\nIn this paper, we take a very different approach and ask whether we can induce, from a single\nsupervised example, a full, deep discriminative model to recognize other instances of the same object\nclass. Furthermore, we do not want our solution to require a lengthy optimization process, but to be\ncomputable on-the-\ufb02y, ef\ufb01ciently and in one go. We formulate this problem as the one of learning a\ndeep neural network, called a learnet, that, given a single exemplar of a new object class, predicts the\nparameters of a second network that can recognize other objects of the same type.\n\n\u2217The \ufb01rst three authors contributed equally, and are listed in alphabetical order.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fOur model has several elements of interest. Firstly, if we consider learning to be any process that maps\na set of images to the parameters of a model, then it can be seen as a \u201clearning to learn\u201d approach.\nClearly, learning from a single exemplar is only possible given suf\ufb01cient prior knowledge on the\nlearning domain. This prior knowledge is incorporated in the learnet in an off-line phase by solving\nmillions of small one-shot learning tasks and back-propagating errors end-to-end. Secondly, our\nlearnet provides a feed-forward learning algorithm that extracts from the available exemplar the \ufb01nal\nmodel parameters in one go. This is different from iterative approaches such as exemplar SVMs or\ncomplex inference processes in generative modeling. It also demonstrates that deep neural networks\ncan learn at the \u201cmeta-level\u201d of predicting \ufb01lter parameters for a second network, which we consider\nto be an interesting result in its own right. Thirdly, our method provides a competitive, ef\ufb01cient, and\npractical way of performing one-shot learning using discriminative methods.\n\n1.1 Related work\nOur work is related to several others in the literature. However, we believe to be the \ufb01rst to look at\nmethods that can learn the parameters of complex discriminative models in one shot.\nOne-shot learning has been widely studied in the context of generative modeling, which unlike our\nwork is often not focused on solving discriminative tasks. One very recent example is by Rezende et\nal. [18], which uses a recurrent spatial attention model to generate images, and learns by optimizing a\nmeasure of reconstruction error using variational inference [9]. They demonstrate results by sampling\nimages of novel classes from this generative model, not by solving discriminative tasks. Another\nnotable work is by Lake et al. [13], which instead uses a probabilistic program as a generative model.\nThis model constructs written characters as compositions of pen strokes, so although more general\nprograms can be envisioned, they demonstrate it only on Optical Character Recognition (OCR)\napplications.\nA different approach to one-shot learning is to learn an embedding space, which is typically done\nwith a siamese network [2]. Given an exemplar of a novel category, classi\ufb01cation is performed in\nthe embedding space by a simple rule such as nearest-neighbor. Training is usually performed by\nclassifying pairs according to distance [5], or by enforcing a distance ranking with a triplet loss [16].\nOur work departs from the paradigms of generative modeling and similarity learning, instead predict-\ning the parameters of a neural network from a single exemplar image. It can be seen as a network\nthat effectively \u201clearns to learn\u201d, generalizing across tasks de\ufb01ned by different exemplars.\nThe idea of parameter prediction was, to our knowledge, \ufb01rst explored by Schmidhuber [20] in a\nrecurrent architecture with one network that modi\ufb01es the weights of another. Parameter prediction\nhas also been used for zero-shot learning (as opposed to one-shot learning), which is the related\nproblem of learning a new object class without a single example image, based solely on a description\nsuch as binary attributes or text. Whereas it is usually framed as a modality transfer problem and\nsolved through transfer learning [21], Noh et al. [15] recently employed parameter prediction to\ninduce the weights of an image classi\ufb01er from text for the problem of visual question answering.\nDenil et al. [4] investigated the redundancy of neural network parameters, showing that it is possible\nto linearly predict as many as 95% of the parameters in a layer given the remaining 5%. This is a\nvastly different proposition from ours, which is to predict all of the parameters of a layer given an\nexternal exemplar image, and to do so non-linearly.\n\n2 One-shot learning as dynamic parameter prediction\n\nSince we consider one-shot learning as a discriminative task, our starting point is standard discrimi-\nnative learning. It generally consists of \ufb01nding the parameters W that minimize the average loss L of\na predictor function \u03d5(x; W ), computed over a dataset of n samples xi and corresponding labels (cid:96)i:\n\nn(cid:88)\n\ni=1\n\nmin\nW\n\n1\nn\n\nL(\u03d5(xi; W ), (cid:96)i).\n\n(1)\n\nUnless the model space is very small, generalization also requires constraining the choice of model,\nusually via regularization. However, in the extreme case in which the goal is to learn W from a single\nexemplar z of the class of interest, called one-shot learning, even regularization may be insuf\ufb01cient\nand additional prior information must be injected into the learning process. The main challenge in\n\n2\n\n\fFigure 1: Our proposed architectures predict the parameters of a network from a single example,\nreplacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predicts\nthe parameters of an embedding function that is applied to both inputs, whereas the single-stream\nlearnet predicts the parameters of a function that is applied to the other input. Linear layers are\ndenoted by \u2217 and nonlinear layers by \u03c3. Dashed connections represent parameter sharing.\n\ndiscriminative one-shot learning is to \ufb01nd a mechanism to incorporate domain-speci\ufb01c information in\nthe learner, i.e. learning to learn. Another challenge, which is of practical importance in applications\nof one-shot learning, is to avoid a lengthy optimization process such as eq. (1).\nWe propose to address both challenges by learning the parameters W of the predictor from a single\nexemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function \u03c9 that maps\n(z; W (cid:48)) to W . Since in practice this function will be implemented using a deep neural network, we\ncall it a learnet. The learnet depends on the exemplar z, which is a single representative of the class of\ninterest, and contains parameters W (cid:48) of its own. Learning to learn can now be posed as the problem of\noptimizing the learnet meta-parameters W (cid:48) using an objective function de\ufb01ned below. Furthermore,\nthe feed-forward learnet evaluation is much faster than solving the optimization problem (1).\nIn order to train the learnet, we require the latter to produce good predictors given any possible\nexemplar z, which is empirically evaluated as an average over n training samples zi:\n\nL(\u03d5(xi; \u03c9(zi; W (cid:48))), (cid:96)i).\n\n(2)\n\nn(cid:88)\n\ni=1\n\nmin\nW (cid:48)\n\n1\nn\n\nIn this expression, the performance of the predictor extracted by the learnet from the exemplar zi is\nassessed on a single \u201cvalidation\u201d pair (xi, (cid:96)i), comprising another exemplar and its label (cid:96)i. Hence,\nthe training data consists of triplets (xi, zi, (cid:96)i). Notice that the meaning of the label (cid:96)i is subtly\ndifferent from eq. (1) since the class of interest changes depending on the exemplar zi: (cid:96)i is positive\nwhen xi and zi belong to the same class and negative otherwise. Triplets are sampled uniformly with\nrespect to these two cases. Importantly, the parameters of the original predictor \u03d5 of eq. (1) now\nchange dynamically with each exemplar zi.\nNote that the training data is reminiscent of that of siamese networks [2], which also learn from\nlabeled sample pairs. However, siamese networks apply the same model \u03d5(x; W ) with shared\nweights W to both xi and zi, and compute their inner-product to produce a similarity score:\n\nL((cid:104)\u03d5(xi; W ), \u03d5(zi; W )(cid:105), (cid:96)i).\n\n(3)\n\nn(cid:88)\n\ni=1\n\nmin\nW\n\n1\nn\n\nThere are two key differences with our model. First, we treat xi and zi asymmetrically, which results\nin a different objective function. Second, and most importantly, the output of \u03c9(z; W (cid:48)) is used to\nparametrize linear layers that determine the intermediate representations in the network \u03d5. This is\nsigni\ufb01cantly different to computing a single inner product in the last layer (eq. (3)).\nEq. (2) speci\ufb01es the optimization objective of one-shot learning as dynamic parameter prediction.\nBy application of the chain rule, backpropagating derivatives through the computational blocks\nof \u03d5(x; W ) and \u03c9(z; W (cid:48)) is no more dif\ufb01cult than through any other standard deep network.\nNevertheless, when we dive into concrete implementations of such models we face a peculiar\nchallenge, discussed next.\n\n2.1 The challenge of naive parameter prediction\nIn order to analyse the practical dif\ufb01culties of implementing a learnet, we will begin with one-shot\nprediction of a fully-connected layer, as it is simpler to analyse. This is given by\n\ny = W x + b,\n\n3\n\n(4)\n\nsiamesesiamese learnetlearnet\fFigure 2: Factorized convolutional layer (eq. (8)). The channels of the input x are projected to the\nfactorized space by M (a 1 \u00d7 1 convolution), the resulting channels are convolved independently\nwith a corresponding \ufb01lter prediction from w(z), and \ufb01nally projected back using M(cid:48).\ngiven an input x \u2208 Rd, output y \u2208 Rk, weights W \u2208 Rk\u00d7d and biases b \u2208 Rk.\nWe now replace the weights and biases with their functional counterparts, w(z) and b(z), representing\ntwo outputs of the learnet \u03c9(z; W (cid:48)) given the exemplar z \u2208 Rm as input (to avoid clutter, we omit\nthe implicit dependence on W (cid:48)):\n\ny = w(z)x + b(z).\n\n(5)\nWhile eq. (5) seems to be a drop-in replacement for linear layers, careful analysis reveals that it scales\nextremely poorly. The main cause is the unusually large output space of the learnet w : Rm \u2192 Rk\u00d7d.\nFor a comparable number of input and output units in a linear layer (d (cid:39) k), the output space of the\nlearnet grows quadratically with the number of units.\nWhile this may seem to be a concern only for large networks, it is actually extremely dif\ufb01cult also for\nnetworks with few units. Consider a simple linear learnet w(z) = W (cid:48)z. Even for a very small fully-\nconnected layer of only 100 units (d = k = 100), and an exemplar z with 100 features (m = 100),\nthe learnet already contains 1M parameters that must be learned. Over\ufb01tting and space and time\ncosts make learning such a regressor infeasible. Furthermore, reducing the number of features in the\nexemplar can only achieve a small constant-size reduction on the total number of parameters. The\nbottleneck is the quadratic size of the output space dk, not the size of the input space m.\n\ny = M(cid:48) diag (w(z)) M x + b(z).\n\n2.2 Factorized linear layers\nA simple way to reduce the size of the output space is to consider a factorized set of weights, by\nreplacing eq. (5) with:\n\n(6)\nThe product M(cid:48)diag (w(z)) M can be seen as a factorized representation of the weights, analogous\nto the Singular Value Decomposition. The matrix M \u2208 Rd\u00d7d projects x into a space where the\nelements of w(z) represent disentangled factors of variation. The second projection M(cid:48) \u2208 Rk\u00d7d\nmaps the result back from this space.\nBoth M and M(cid:48) contain additional parameters to be learned, but they are modest in size compared to\nthe case discussed in sect. 2.1. Importantly, the one-shot branch w(z) now only has to predict a set of\ndiagonal elements (see eq. (6)), so its output space grows linearly with the number of units in the\nlayer (i.e. w(z): Rm \u2192 Rd).\n2.3 Factorized convolutional layers\nThe factorization of eq. (6) can be generalized to convolutional layers as follows. Given an input\ntensor x \u2208 Rr\u00d7c\u00d7d, weights W \u2208 Rf\u00d7f\u00d7d\u00d7k (where f is the \ufb01lter support size), and biases b \u2208 Rk,\nthe output y \u2208 Rr(cid:48)\u00d7c(cid:48)\u00d7k of the convolutional layer is given by\n\ny = W \u2217 x + b,\n\nwhere \u2217 denotes convolution, and the biases b are applied to each of the k channels.\nProjections analogous to M and M(cid:48) in eq. (6) can be incorporated in the \ufb01lter bank in different ways\nand it is not obvious which one to pick. Here we take the view that M and M(cid:48) should disentangle\nthe feature channels (i.e. third dimension of x) so that the predicted \ufb01lters w(z) can operate on each\nchannel independently. As such, we consider the following factorization:\n\n(7)\n\n(8)\n\ny = M(cid:48) \u2217 w(z) \u2217d M \u2217 x + b(z),\n\n4\n\n\ud835\udc64(\ud835\udc67) \ud835\udc40 \ud835\udc40\u2032 \ud835\udc66 \ud835\udc65 \fz x\n\nPredicted \ufb01lters w(z)\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nActivations\n\nFigure 3: The predicted \ufb01lters and the output of a dynamic convolutional layer in a single-stream\nlearnet trained for the OCR task. Different exemplars z de\ufb01ne different \ufb01lters w(z). Applying the\n\ufb01lters of each exemplar to the same input x yields different responses. Best viewed in colour.\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nActivations\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nz\n\nx\n\nPredicted \ufb01lters w(z)\n\nFigure 4: The predicted \ufb01lters and the output of a dynamic convolutional layer in a siamese learnet\ntrained for the object tracking task. Best viewed in colour.\n\nwhere M \u2208 R1\u00d71\u00d7d\u00d7d, M(cid:48) \u2208 R1\u00d71\u00d7d\u00d7k, and w(z) \u2208 Rf\u00d7f\u00d7d. Convolution with subscript d\ndenotes independent \ufb01ltering of d channels, i.e. each channel of x \u2217d y is simply the convolution of\nthe corresponding channel in x and y. In practice, this can be achieved with \ufb01lter tensors that are\ndiagonal in the third and fourth dimensions, or using d \ufb01lter groups [12], each group containing a\nsingle \ufb01lter. An illustration is given in \ufb01g. 2. The predicted \ufb01lters w(z) can be interpreted as a \ufb01lter\nbasis, as described in the supplementary material (sec. A).\nNotice that, under this factorization, the number of elements to be predicted by the one-shot branch\nw(z) is only f 2d (the \ufb01lter size f is typically very small, e.g. 3 or 5 [5, 23]). Without the factorization,\nit would be f 2dk (the number of elements of W in eq. (7)). Similarly to the case of fully-connected\nlayers (sect. 2.2), when d (cid:39) k this keeps the number of predicted elements from growing quadratically\nwith the number of channels, allowing them to grow only linearly.\nExamples of \ufb01lters that are predicted by learnets are shown in \ufb01gs. 3 and 4. The resulting activations\ncon\ufb01rm that the networks induced by different exemplars do indeed possess different internal\nrepresentations of the same input.\n\n3 Experiments\n\nWe evaluate learnets against baseline one-shot architectures (sect. 3.1) on two one-shot learning\nproblems in Optical Character Recognition (OCR; sect. 3.2) and visual object tracking (sect. 3.3).\nAll experiments were performed using MatConvNet [22].\n\n3.1 Architectures\n\nAs noted in sect. 2, the closest competitors to our method in discriminative one-shot learning are\nembedding learning using siamese architectures. Therefore, we structure the experiments to compare\nagainst this baseline. In particular, we choose to implement learnets using similar network topologies\nfor a fairer comparison.\nThe baseline siamese architecture comprises two parallel streams \u03d5(x; W ) and \u03d5(z; W ) composed\nof a number of layers, such as convolution, max-pooling, and ReLU, sharing parameters W (\ufb01g. 1.a).\nThe outputs of the two streams are compared by a layer \u0393(\u03d5(x; W ), \u03d5(z; W )) computing a measure\nof similarity or dissimilarity. We consider in particular: the dot product (cid:104)a, b(cid:105) between vectors a and\nb, the Euclidean distance (cid:107)a \u2212 b(cid:107), and the weighted l1-norm (cid:107)w (cid:12) a \u2212 w (cid:12) b(cid:107)1 where w is a vector\nof learnable weights and (cid:12) the Hadamard product).\nThe \ufb01rst modi\ufb01cation to the siamese baseline is to use a learnet to predict some of the intermediate\nshared stream parameters (\ufb01g. 1.b). In this case W = \u03c9(z; W (cid:48)) and the siamese architecture writes\n\u0393(\u03d5(x; \u03c9(z; W (cid:48))), \u03d5(z; \u03c9(z; W (cid:48)))). Note that the siamese parameters are still the same in the two\n\n5\n\n\fTable 1: Error rate for character recognition in foreign alphabets (chance is 95%).\n\nInner-product (%) Euclidean dist. (%) Weighted (cid:96)1 dist. (%)\n\nSiamese (shared)\nSiamese (unshared)\nSiamese (unshared, fact.)\nSiamese learnet (shared)\nLearnet\nModi\ufb01ed Hausdorff distance\n\n48.5\n47.0\n48.4\n51.0\n43.7\n\n37.3\n41.0\n\n\u2013\n\n39.8\n36.7\n\n43.2\n\n41.8\n34.6\n33.6\n31.4\n28.6\n\nstreams, whereas the learnet is an entirely new subnetwork whose purpose is to map the exemplar\nimage to the shared weights. We call this model the siamese learnet.\nThe second modi\ufb01cation is a single-stream learnet con\ufb01guration, using only one stream \u03d5 of the\nsiamese architecture and predicting its parameter using the learnet \u03c9. In this case, the comparison\nblock \u0393 is reinterpreted as the last layer of the stream \u03d5 (\ufb01g. 1.c). Note that: i) the single predicted\nstream and learnet are asymmetric and with different parameters and ii) the learnet predicts both the\n\ufb01nal comparison layer parameters \u0393 as well as intermediate \ufb01lter parameters.\nThe single-stream learnet architecture can be understood to predict a discriminant function from one\nexample, and the siamese learnet architecture to predict an embedding function for the comparison\nof two images. These two variants demonstrate the versatility of the dynamic convolutional layer\nfrom eq. (6).\nFinally, in order to ensure that any difference in performance is not simply due to the asymmetry of\nthe learnet architecture or to the induced \ufb01lter factorizations (sect. 2.2 and sect. 2.3), we also compare\nunshared siamese nets, which use distinct parameters for each stream, and factorized siamese nets,\nwhere convolutions are replaced by factorized convolutions as in learnet.\n\n3.2 Character recognition in foreign alphabets\n\nThis section describes our experiments in one-shot learning on OCR. For this, we use the Omniglot\ndataset [13], which contains images of handwritten characters from 50 different alphabets. These\nalphabets are divided into 30 background and 20 evaluation alphabets. The associated one-shot\nlearning problem is to develop a method for determining whether, given any single exemplar of a\ncharacter in an evaluation alphabet, any other image in that alphabet represents the same character or\nnot. Importantly, all methods are trained using only background alphabets and tested on the evaluation\nalphabets.\nDataset and evaluation protocol. Character images are resized to 28 \u00d7 28 pixels in order to be able\nto explore ef\ufb01ciently several variants of the proposed architectures. There are exactly 20 sample\nimages for each character, and an average of 32 characters per alphabet. The dataset contains a total\nof 19,280 images in the background alphabets and 13,180 in the evaluation alphabets.\nAlgorithms are evaluated on a series of recognition problems. Each recognition problem involves\nidentifying the image in a set of 20 that shows the same character as an exemplar image (there is\nalways exactly one match). All of the characters in a single problem belong to the same alphabet.\nAt test time, given a collection of characters (x1, . . . , xm), the function is evaluated on each pair\n(z, xi) and the candidate with the highest score is declared the match. In the case of the learnet\narchitectures, this can be interpreted as obtaining the parameters W = \u03c9(z; W (cid:48)) and then evaluating\na static network \u03d5(xi; W ) for each xi.\nArchitecture. The baseline stream \u03d5 for the siamese, siamese learnet, and single-stream learnet\narchitecture consists of 3 convolutional layers, with 2 \u00d7 2 max-pooling layers of stride 2 between\nthem. The \ufb01lter sizes are 5\u00d7 5\u00d7 1\u00d7 16, 5\u00d7 5\u00d7 16\u00d7 64 and 4\u00d7 4\u00d7 64\u00d7 512. For both the siamese\nlearnet and the single-stream learnet, \u03c9 consists of the same layers as \u03d5, except the number of outputs\nis 1600 \u2013 one for each element of the 64 predicted \ufb01lters (of size 5 \u00d7 5). To keep the experiments\nsimple, we only predict the parameters of one convolutional layer. We conducted cross-validation to\nchoose the predicted layer and found that the second convolutional layer yields the best results for\nboth of the proposed variants.\nSiamese nets have previously been applied to this problem by Koch et al. [10] using much deeper\nnetworks applied to images of size 105 \u00d7 105. However, we have restricted this investigation to\nrelatively shallow networks to enable a thorough exploration of the parameter space. A more powerful\n\n6\n\n\falgorithm for one-shot learning, Hierarchical Bayesian Program Learning [13], is able to achieve\nhuman-level performance. However, this approach involves computationally expensive inference at\ntest time, and leverages extra information at training time that describes the strokes drawn by the\nhuman author.\nLearning. Learning involves minimizing the objective function speci\ufb01c to each method (e.g. eq. (2)\nfor learnet and eq. (3) for siamese architectures) and uses stochastic gradient descent (SGD) in all\ncases. As noted in sect. 2, the objective is obtained by sampling triplets (zi, xi, (cid:96)i) where exemplars\nzi and xi are congruous ((cid:96)i = +1) or incongruous ((cid:96)i = \u22121) with 50% probability. We consider\n100,000 random pairs for training per epoch, and train for 60 epochs. We conducted a random\nsearch to \ufb01nd the best hyper-parameters for each algorithm (initial learning rate and geometric decay,\nstandard deviation of Gaussian parameter initialization, and weight decay).\nResults and discussion. Tab. 1 shows the classi\ufb01cation error obtained using variants of each\narchitecture. A dash indicates a failure to converge given a large range of hyper-parameters. The two\nlearnet architectures combined with the weighted (cid:96)1 distance are able to achieve signi\ufb01cantly better\nresults than other methods. The best architecture reduced the error from 37.3% for a siamese network\nwith shared parameters to 28.6% for a single-stream learnet.\nWhile the Euclidean distance gave the best results for siamese networks with shared parameters,\nbetter results were achieved by learnets (and siamese networks with unshared parameters) using a\nweighted (cid:96)1 distance. In fact, none of the alternative architectures are able to achieve lower error\nunder the Euclidean distance than the shared siamese net. The dot product was, in general, less\neffective than the other two metrics.\nThe introduction of the factorization in the convolutional layer might be expected to improve the\nquality of the estimated model by reducing the number of parameters, or to worsen it by diminishing\nthe capacity of the hypothesis space. For this relatively simple task of character recognition, the\nfactorization did not seem to have a large effect.\n\n3.3 Object tracking\n\nThe task of single-target object tracking requires to locate an object of interest in a sequence of video\nframes. A video frame can be seen as a collection F = {w1, . . . , wK} of image windows; then, in a\none-shot setting, given an exemplar z \u2208 F1 of the object in the \ufb01rst frame F1, the goal is to identify\nthe same window in the other frames F2, . . . ,FM .\nDatasets. The method is trained using the ImageNet Large Scale Visual Recognition Challenge\n2015 [19], with 3,862 videos totalling more than one million annotated frames. Instances of objects\nof thirty different classes (mostly vehicles and animals) are annotated throughout each video with\nbounding boxes. For tracking, instance labels are retained but object class labels are ignored. We use\n90% of the videos for training, while the other 10% are held-out to monitor validation error during\nnetwork training. Testing uses the VOT 2015 benchmark [11].\nArchitecture. We experiment with siamese and siamese learnet architectures (\ufb01g. 1) where the\nlearnet \u03c9 predicts the parameters of the second (dynamic) convolutional layer of the siamese streams.\nEach siamese stream has \ufb01ve convolutional layers and we test three variants of those: variant (A) has\nthe same con\ufb01guration as AlexNet [12] but with stride 2 in the \ufb01rst layer, and variants (B) and (C)\nreduce to 50% the number of \ufb01lters in the \ufb01rst two convolutional layers and, respectively, to 25% and\n12.5% the number of \ufb01lters in the last layer.\nTraining. In order to train the architecture ef\ufb01ciently from many windows, the data is prepared\nas follows. Given an object bounding box sampled at random, a crop z double the size of that is\nextracted from the corresponding frame, padding with the average image color when needed. The\nborder is included in order to incorporate some visual context around the exemplar object. Next,\n(cid:96) \u2208 {+1,\u22121} is sampled at random with 75% probability of being positive. If (cid:96) = \u22121, an image x\nis extracted by choosing at random a frame that does not contain the object. Otherwise, a second\nframe containing the same object and within 50 temporal samples of the \ufb01rst is selected at random.\nFrom that, a patch x centered around the object and four times bigger is extracted. In this way, x\ncontains both subwindows that do and do not match z. Images z and x are resized to 127 \u00d7 127 and\n255 \u00d7 255 pixels, respectively, and the triplet (z, x, (cid:96)) is formed. All 127 \u00d7 127 subwindows in x are\nconsidered to not match z except for the central 2 \u00d7 2 ones when (cid:96) = +1.\n\n7\n\n\fTable 2: Tracking accuracy and number of tracking failures in the VOT 2015 Benchmark, as reported\nby the toolkit [11]. Architectures are grouped by size of the main network (see text). For each group,\nthe best entry for each column is in bold. We also report the scores of 5 recent trackers.\n\nAccuracy Failures\n\nAccuracy Failures\n\nMethod\nSiamese (\u03d5=B)\nSiamese (\u03d5=B; unshared)\nSiamese (\u03d5=B; factorized)\nSiamese learnet (\u03d5=B; \u03c9=A)\nSiamese learnet (\u03d5=B; \u03c9=B)\nDAT [17]\nSO-DLT [23]\n\n0.465\n0.447\n0.444\n0.500\n0.497\n0.442\n0.540\n\n105\n131\n138\n87\n93\n113\n108\n\nMethod\nSiamese (\u03d5=C)\nSiamese (\u03d5=C; factorized)\nSiamese learnet (\u03d5=C; \u03c9=A)\nSiamese learnet (\u03d5=C; \u03c9=C)\nDSST [3]\nMEEM [24]\nMUSTer [7]\n\n0.466\n0.435\n0.483\n0.491\n0.483\n0.458\n0.471\n\n120\n132\n105\n106\n163\n107\n132\n\nAll networks are trained from scratch using SGD for 50 epoch of 50,000 sample triplets (zi, xi, (cid:96)i).\nThe multiple windows contained in x are compared to z ef\ufb01ciently by making the comparison layer\n\u0393 convolutional (\ufb01g. 1), accumulating a logistic loss across spatial locations. The same hyper-\nparameters (learning rate of 10\u22122 geometrically decaying to 10\u22125, weight decay of 0.005, and small\nmini-batches of size 8) are used for all experiments, which we found to work well for both the baseline\nand proposed architectures. The weights are initialized using the improved Xavier [6] method, and\nwe use batch normalization [8] after all linear layers.\nTesting. Adopting the initial crop as exemplar, the object is sought in a new frame within a radius of\nthe previous position, proceeding sequentially. This is done by evaluating the pupil net convolutionally,\nas well as searching at \ufb01ve possible scales in order to track the object through scale space. The\napproach is described in more detail in Bertinetto et al. [1].\nResults and discussion. Tab. 2 compares the methods in terms of the of\ufb01cial metrics (accuracy and\nnumber of failures) for the VOT 2015 benchmark [11]. The ranking plot produced by the VOT toolkit\nis provided in the supplementary material (\ufb01g. B.1). From tab. 2, it can be observed that factorizing\nthe \ufb01lters in the siamese architecture signi\ufb01cantly diminishes its performance, but using a learnet to\npredict the \ufb01lters in the factorization recovers this gap and in fact achieves better performance than\nthe original siamese net. The performance of the learnet architectures is not adversely affected by\nusing the slimmer prediction networks B and C (with less channels).\nAn elementary tracker based on learnet compares favourably against recent tracking systems, which\nmake use of different features and online model update strategies: DAT [17], DSST [3], MEEM [24],\nMUSTer [7] and SO-DLT [23]. SO-DLT in particular is a good example of direct adaptation of\nstandard batch deep learning methodology to online learning, as it uses SGD during tracking to\n\ufb01ne-tune an ensemble of deep convolutional networks. However, the online adaptation of the model\ncomes at a big computational cost and affects the speed of the method, which runs at 5 frames-per-\nsecond (FPS) on a GPU. Due to the feed-forward nature of our one-shot learnets, they can track\nobjects in real-time at framerates in excess of 60 FPS, while achieving less tracking failures. We\nconsider, however, that our implementation serves mostly as a proof-of-concept, using tracking as an\ninteresting demonstration of one-shot-learning, and is orthogonal to many technical improvements\nfound in the tracking literature [11].\n\n4 Conclusions\n\nIn this work, we have shown that it is possible to obtain the parameters of a deep neural network\nusing a single, feed-forward prediction from a second network. This approach is desirable when\niterative methods are too slow, and when large sets of annotated training samples are not available. We\nhave demonstrated the feasibility of feed-forward parameter prediction in two demanding one-shot\nlearning tasks in OCR and visual tracking. Our results hint at a promising avenue of research in\n\u201clearning to learn\u201d by solving millions of small discriminative problems in an of\ufb02ine phase. Possible\nextensions include domain adaptation and sharing a single learnet between different pupil networks.\n\nAcknowledgements\n\nThis research was supported by Apical Ltd. and ERC grants ERC-2012-AdG 321162-HELIOS,\nHELIOS-DFR00200 and \u201cIntegrated and Detailed Image Understanding\u201d (EP/L024683/1).\n\n8\n\n\fReferences\n\n[1] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-convolutional\n\nsiamese networks for object tracking. 2016.\n\n[2] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. S\u00e4ckinger, and R. Shah.\nSignature veri\ufb01cation using a \u201csiamese\u201d time delay neural network. International Journal of\nPattern Recognition and Arti\ufb01cial Intelligence, 1993.\n\n[3] M. Danelljan, G. H\u00e4ger, F. Khan, and M. Felsberg. Accurate scale estimation for robust visual\n\ntracking. In BMVC, 2014.\n\n[4] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In\n\nNIPS, 2013.\n\n[5] H. Fan, Z. Cao, Y. Jiang, Q. Yin, and C. Doudou. Learning deep face representation. arXiv,\n\n2014.\n\n[6] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level\n\nperformance on ImageNet classi\ufb01cation. In ICCV, 2015.\n\n[7] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao. Multi-store tracker (MUSTER):\n\nA cognitive psychology inspired approach to object tracking. In CVPR, 2015.\n\n[8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv, 2015.\n\n[9] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv, 2013.\n[10] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image\n\nrecognition. In ICML 2015 Deep Learning Workshop, 2016.\n\n[11] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, T. Vojir, G. Hager,\n\nG. Nebehay, and R. P\ufb02ugfelder. The VOT2015 challenge results. In ICCV Workshop, 2015.\n\n[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[13] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[14] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-SVMs for object detection\n\nand beyond. In ICCV, 2011.\n\n[15] H. Noh, P. Hongsuck Seo, and B. Han. Image question answering using convolutional neural\n\nnetwork with dynamic parameter prediction. In CVPR, 2016.\n\n[16] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. BMVC, 2015.\n[17] H. Possegger, T. Mauthner, and H. Bischof. In defense of color-based model-free tracking. In\n\nCVPR, 2015.\n\n[18] D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization\n\nin deep generative models. arXiv, 2016.\n\n[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition\nChallenge. IJCV, 2015.\n\n[20] J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks. Neural Computation, 4(1):131\u2013139, 1992.\n\n[21] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal\n\ntransfer. In NIPS, 2013.\n\n[22] A. Vedaldi and K. Lenc. MatConvNet \u2013 Convolutional Neural Networks for MATLAB. In\n\nProceedings of the ACM Int. Conf. on Multimedia, 2015.\n\n[23] N. Wang, S. Li, A. Gupta, and D.-Y. Yeung. Transferring rich feature hierarchies for robust\n\nvisual tracking. arXiv, 2015.\n\n[24] J. Zhang, S. Ma, and S. Sclaroff. MEEM: Robust tracking via multiple experts using entropy\n\nminimization. In ECCV. 2014.\n\n9\n\n\f", "award": [], "sourceid": 289, "authors": [{"given_name": "Luca", "family_name": "Bertinetto", "institution": "University of Oxford"}, {"given_name": "Jo\u00e3o", "family_name": "Henriques", "institution": "University of Oxford"}, {"given_name": "Jack", "family_name": "Valmadre", "institution": "University of Oxford"}, {"given_name": "Philip", "family_name": "Torr", "institution": "Oxford University"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "University of Oxford"}]}