{"title": "Deep Fisher Networks for Large-Scale Image Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 163, "page_last": 171, "abstract": "As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity. Discriminatively trained convolutional neural networks, in particular, were recently shown to yield state-of-the-art performance in challenging image classification benchmarks such as ImageNet. However, elements of these architectures are similar to standard hand-crafted representations used in computer vision. In this paper, we explore the extent of this analogy, proposing a version of the state-of-the-art Fisher vector image encoding that can be stacked in multiple layers. This architecture significantly improves on standard Fisher vectors, and obtains competitive results with deep convolutional networks at a significantly smaller computational cost. Our hybrid architecture allows us to measure the performance improvement brought by a deeper image classification pipeline, while staying in the realms of conventional SIFT features and FV encodings.", "full_text": "Deep Fisher Networks\n\nfor Large-Scale Image Classi\ufb01cation\n\nKaren Simonyan\n\nAndrea Vedaldi\n\nVisual Geometry Group, University of Oxford\n{karen,vedaldi,az}@robots.ox.ac.uk\n\nAndrew Zisserman\n\nAbstract\n\nAs massively parallel computations have become broadly available with modern\nGPUs, deep architectures trained on very large datasets have risen in popular-\nity. Discriminatively trained convolutional neural networks, in particular, were\nrecently shown to yield state-of-the-art performance in challenging image classi-\n\ufb01cation benchmarks such as ImageNet. However, elements of these architectures\nare similar to standard hand-crafted representations used in computer vision. In\nthis paper, we explore the extent of this analogy, proposing a version of the state-\nof-the-art Fisher vector image encoding that can be stacked in multiple layers.\nThis architecture signi\ufb01cantly improves on standard Fisher vectors, and obtains\ncompetitive results with deep convolutional networks at a smaller computational\nlearning cost. Our hybrid architecture allows us to assess how the performance of\na conventional hand-crafted image classi\ufb01cation pipeline changes with increased\ndepth. We also show that convolutional networks and Fisher vector encodings are\ncomplementary in the sense that their combination further improves the accuracy.\n\nIntroduction\n\n1\nDiscriminatively trained deep convolutional neural networks (CNN) [18] have recently achieved im-\npressive state of the art results over a number of areas, including, in particular, the visual recognition\nof categories in the ImageNet Large-Scale Visual Recognition Challenge [4]. This success is built\non many years of tuning and incorporating ideas into CNNs in order to improve their performance.\nMany of the key ideas in CNN have now been absorbed into features proposed in the computer vision\nliterature \u2013 some have been discovered independently and others have been overtly borrowed. For\nexample: the importance of whitening [11]; max pooling and sparse coding [26, 33]; non-linearity\nand normalization [20]. Indeed, several standard features and pipelines in computer vision, such as\nSIFT [19] and a spatial pyramid on Bag of visual Words (BoW) [16] can be seen as corresponding\nto layers of a standard CNN. However, image classi\ufb01cation pipelines used in the computer vision\nliterature are still generally quite shallow: either a global feature vector is computed over an im-\nage, and used directly for classi\ufb01cation; or, in a few cases, a two layer hierarchy is used, where the\noutputs of a number of classi\ufb01ers form the global feature vector for the image (e.g. attributes and\nclassemes [15, 30]).\nThe question we address in this paper is whether it is possible to improve the performance of off-the-\nshelf computer vision features by organising them into a deeper architecture. To this end we make\nthe following contributions: (i) we introduce a Fisher Vector Layer, which is a generalization of the\nstandard FV to a layer architecture suitable for stacking; (ii) we demonstrate that by discriminatively\ntraining several such layers and stacking them into a Fisher Vector Network, an accuracy competitive\nwith the deep CNN can be achieved, whilst staying in the realms of conventional SIFT and colour\nfeatures and FV encodings; and (iii) we show that class posteriors, computed by the deep CNN and\nFV, are complementary and can be combined to signi\ufb01cantly improve the accuracy.\nThe rest of the paper is organised as follows. After a discussion of the related work, we begin\nwith a brief description of the conventional FV encoding [20] (Sect. 2). We then show how this\n\n1\n\n\frepresentation can be modi\ufb01ed to be used as a layer in a deeper architecture (Sect. 3) and how\nthe latter can be discriminatively learnt to yield a deep Fisher network (Sect. 4). After discussing\nimportant details of the implementation (Sect. 5), we evaluate our architecture on the ImageNet\nimage classi\ufb01cation benchmark (Sect. 6).\nRelated work. There is a vast literature on large-scale image classi\ufb01cation, which we brie\ufb02y review\nhere. One widely used approach is to extract local features such as SIFT [19] densely from each im-\nage, aggregate and encode them as high-dimensional vectors, and feed the latter to a classi\ufb01er, e.g.\nan SVM. There exists a large variety of different encodings that can be used for this purpose, includ-\ning the BoW [9, 29] encoding, sparse coding [33], and the FV encoding [20]. Since FV was shown\nto outperform other encodings [6] and achieve very good performance on various image recognition\nbenchmarks [21, 28], we use it as the basis of our framework. We note that other recently proposed\nencodings (e.g. [5]) can be readily employed in the place of FV. Most encodings are designed to dis-\nregard the spatial location of features in order to be invariant to image transformations; in practice,\nhowever, retaining weak spatial information yields an improved classi\ufb01cation performance. This\ncan be incorporated by dividing the image into regions, encoding each of them individually, and\nstacking the result in a composite higher-dimensional code, known as a spatial pyramid [16]. The\nalternative, which does not increase the encoding dimensionality, is to augment the local features\nwith their spatial coordinates [24].\nAnother vast family of image classi\ufb01cation techniques is based on Deep Neural Networks (DNN),\nwhich are inspired by the layered structure of the visual cortex in mammals [22]. DNNs can be\ntrained greedily, in a layer-by-layer manner, as in Restricted Boltzmann Machines [12] and (sparse)\nauto-encoders [3, 17], or by learning all layers simultaneously, which is relatively ef\ufb01cient if the\nlayers are convolutional [18]. In particular, the advent of massively-parallel GPUs has recently made\nit possible to train deep convolutional networks on a large scale with excellent performance [7, 14].\nIt was also shown that techniques such as training and test data augmentation, as well as averaging\nthe outputs of independently trained DNNs, can signi\ufb01cantly improve the accuracy.\nThere have been attempts to bridge these two families, exploring the trade-offs between network\ndepth and width, as well as the complexity of the layers. For instance, dense feature encoding using\nthe bag of visual words was considered as a single layer of a deep network in [1, 8, 32].\n2 Fisher vector encoding for image classi\ufb01cation\nThe Fisher vector encoding \u03c6 of a set of features {xp} (e.g. densely computed SIFT features) is\nbased on \ufb01tting a parametric generative model, e.g. the Gaussian Mixture Model (GMM), to the\nfeatures, and then encoding the derivatives of the log-likelihood of the model with respect to its\nparameters [13]. In the particular case of GMMs with diagonal covariances, used here, this leads to\nthe representation which captures the average \ufb01rst and second order differences between the features\nand each of the GMM centres [20]:\n\n\u03c32\nk\n\n(cid:105)\n\n(cid:104)\n\nN\n\n\u03c0k\n\n\u03a6(1)\n\nk =\n\n, \u03a6(2)\n\nk =\n\n\u03b1k(xp)\n\nK , \u03a6(2)\n\nK\n\n\u03b1k(xp)\n\n1 , \u03a6(2)\n\u03a6(1)\n\n1 , . . . , \u03a6(1)\n\n(1)\nHere, {\u03c0k, \u00b5k, \u03c3k}k are the mixture weights, means, and diagonal covariances of the GMM, which is\ncomputed on the training set and used for the description of all images; \u03b1k(xp) is the soft assignment\nweight of the p-th feature xp to the k-th Gaussian. An FV is obtained by stacking the differences:\n. The encoding describes how the distribution of features of a\n\u03c6 =\nparticular image differs from the distribution \ufb01tted to the features of all training images. To make\nthe features amenable to the FV description based on the diagonal-covariance GMM, they are \ufb01rst\ndecorrelated by PCA.\nThe FV dimensionality is 2Kd, where K is the codebook size (the number of Gaussians in the\nGMM), and d is the dimensionality of the encoded feature vector. For instance, FV encoding of\na SIFT feature (d = 128) using a small GMM codebook (K = 256) is 65.5K-dimensional. This\nmeans that high-dimensional feature encodings can be quickly computed using small codebooks.\nUsing the same codebook size, BoW and sparse coding are only K-dimensional and less discrimina-\ntive, as demonstrated in [6]. From another point of view, given the desired encoding dimensionality,\nthese methods would require 2d-times larger codebooks than needed for FV, which would lead to\nimpractical computation times.\n\nN(cid:88)\n\np=1\n\n\u221a\n1\n\n(cid:18) xp \u2212 \u00b5k\n\n(cid:19)\n\n\u03c3k\n\nN(cid:88)\n\np=1\n\n\u221a\n1\n2\u03c0k\n\nN\n\n(cid:18) (xp \u2212 \u00b5k)2\n\n(cid:19)\n\n\u2212 1\n\n2\n\n\fFigure 1: Left: Fisher network (Sect. 4) with two Fisher layers. Right: conventional pipeline\nusing a shallow Fisher vector encoding. As shown in Sect. 6, making the conventional pipeline\nslightly deeper by injecting a single Fisher layer substantially improves the classi\ufb01cation accuracy.\n\nAs can be seen from (1), the (unnormalised) FV encoding is additive with respect to image features,\ni.e. the encoding of an image is an average of the individual encodings of its features. Following [20],\nFV performance is further improved by passing it through Signed Square-Rooting (SSR) and L2\nnormalisation. Finally, the high-dimensional FV is usually coupled with a one-vs-rest linear SVM\nclassi\ufb01er, and together they form a conventional image classi\ufb01cation pipeline [21] (see Fig. 1), which\nserves as a baseline for our classi\ufb01cation framework.\n3 Fisher layer\nThe conventional FV representation of an image (Sect. 2), effectively encodes each local feature\n(e.g. SIFT) into a high-dimensional representation, and then aggregates these encodings into a single\nvector by global sum-pooling over the whole image (followed by normalisation). This means that\nthe representation describes the image in terms of the local patch features, and can not capture more\ncomplex image structures. Deep neural networks are able to model the feature hierarchies by passing\nan output of one feature computation layer as the input to the next one. We adopt a similar approach\nhere, and devise a feed-forward feature encoding layer (which we term a Fisher layer), which is\nbased on off-the-shelf Fisher vector encoding. The layers can then be stacked into a deep network,\nwhich we call a Fisher network.\nThe architecture of the l-th Fisher layer is depicted in Fig. 2. On the input, it receives dl-dimensional\nfeatures (dl \u223c 102), densely computed over multiple scales on a regular image grid. The features are\nassumed to be decorrelated using PCA. The layer then performs feed-forward feature transformation\nin three sub-layers.\nThe \ufb01rst one computes semi-local FV encodings by pooling the input features not from the whole\nimage, but from a dense set of semi-local regions. The resulting FVs form a new set of densely\nsampled features that are more discriminative than the input ones and less local, as they integrate\ninformation from larger image areas. The FV encoder (Sect. 2) uses a layer-speci\ufb01c GMM with Kl\ncomponents, so the dimensionality of each FV is 2Kldl, which, considering that FVs are computed\ndensely, might be too large for practical applications. Therefore, we decrease FV dimensionality\nby projection onto hl-dimensional subspace using a discriminatively trained linear projection Wl \u2208\nRhl\u00d72Kldl. In practice, this is carried out using an ef\ufb01cient variant of the FV encoder (Sect. 5). In\nthe second sub-layer, the spatially adjacent features are stacked in a 2 \u00d7 2 window, which produces\n4hl-dimensional dense feature representation. Finally, the features are L2-normalised and PCA-\nprojected to dl+1-dimensional subspace using the linear projection Ul \u2208 Rdl+1\u00d74hl, and passed as\nthe input to the (l + 1)-th layer. Each sub-layer is explained in more detail below.\n\n3\n\nDense feature extraction SIFT, raw patches, \u2026 One vs. rest linear SVMs FV encoder Spatial stacking L2 norm. & PCA FV encoder SSR & L2 norm. SSR & L2 norm. input image 0-th layer 1-st Fisher layer (with optional global pooling branched out) 2-nd Fisher layer (global pooling) classifier layer Dense feature extraction SIFT, raw patches, \u2026 One vs rest linear SVMs FV encoder SSR & L2 norm. \fFigure 2: The architecture of a single Fisher layer. Left: the arrows illustrate the data \ufb02ow through\nthe layer; the dimensionality of densely computed features is shown next to the arrows. Right: spatial\npooling (the blue squares) and stacking (the red square) in sub-layers 1 and 2 respectively.\n\nFisher vector pooling (sub-layer 1). The key idea behind the \ufb01rst sub-layer is to aggregate the FVs\nof individual features over a family of semi-local spatial neighbourhoods. These neighbourhoods\nare overlapping square regions of size ql \u00d7 ql, sampled every \u03b4l pixels (see Fig. 2); compared to the\nregions used in global or spatial pyramid pooling [20], these are smaller and sampled much more\ndensely. As a result, instead of a single FV, describing the whole image, the image is represented by\na large number of densely computed semi-local FVs, each of which describes a spatially adjacent set\nof local features, computed by the previous layer. Thus, the new feature representation can capture\nmore complex image statistics with larger spatial support. We note that due to additivity, comput-\ning the FV of a spatial neighbourhood corresponds to the sum-pooling over the neighbourhood, a\nstage widely used in DNNs. The high dimensionality of Fisher vectors, however, brings up the\ncomputational complexity issue, as storing and processing thousands of dense FVs per image (each\nof which is 2Kldl-dimensional) is prohibitive at large scale. We tackle this problem by employing\ndiscriminative dimensionality reduction for high-dimensional FVs, which makes the layer learning\nprocedure supervised. The dimensionality reduction is carried out using a linear projection Wl onto\nan hl-dimensional subspace. As will be shown in Sect. 5, compressed FVs can be computed very\nef\ufb01ciently without the need to compute the full-dimensional FVs \ufb01rst, and then project them down.\nA similar approach (passing the output of a feature encoder to another encoder) has been previously\nemployed by [1, 8, 32], but in their case they used bag-of-words or sparse coding representations. As\nnoted in [8], such encodings require large codebooks to produce discriminative feature representa-\ntions. This, in turn, makes these approaches hardly applicable to the datasets of ImageNet scale [4].\nAs explained in Sect. 2, FV encoders do not require large codebooks, and by employing supervised\ndimensionality reduction, we can preserve the discriminative ability of FV even after the projection\nonto a low-dimensional space, similarly to [10].\nSpatial stacking (sub-layer 2). After the dimensionality-reduced FV pooling (Sect. 3), an image is\nrepresented as a spatially dense set of low-dimensional multi-scale discriminative features. It should\nbe noted that local sum-pooling, while making the representation invariant to small translations, is\nagnostic to the relative location of aggregated features. To capture the spatial structure within each\nfeature\u2019s neighbourhood, we incorporate the stacking sub-layer, which concatenates the spatially\nadjacent features in a 2\u00d7 2 window (Fig. 2). This step is similar to 4\u00d7 4 stacking employed in SIFT.\nNormalisation and PCA projection (sub-layer 3). After stacking, the features are L2-normalised,\nwhich improves their invariance properties. This procedure is closely related to Local Contrast\nNormalisation, widely used in DNNs. Finally, before passing the features to the FV encoder of the\nnext layer, PCA dimensionality reduction is carried out, which serves two purposes: (i) features\nare decorrelated so that they can be modelled using diagonal-covariance GMMs of the next layer;\n(ii) dimensionality is reduced from 4hl to dl+1 to keep the image representation compact and the\ncomputational complexity limited.\nMulti-scale computation. In practice, the Fisher layer computation is repeated at multiple scales by\nchanging the pooling window size ql (the PCA projection in sub-layer 3 is the same for all scales).\nThis allows a single layer to capture multi-scale statistics, which is different from typical DNN\narchitectures, which use a single pooling window size per layer. The resulting dense multi-scale\nfeatures, computed by the layer, form the input of the next layer (similarly to the dense multi-scale\nSIFT features). In Sect. 6 we show that a multi-scale Fisher layer indeed brings an improvement,\ncompared to a \ufb01xed pooling window size.\n\n4\n\nlocal Fisher Compressed semi-local Fisher vector encoding Spatial stacking (2\u00d72) norm. L2 norm. & PCA dl hl 4hl dl+1 mixture of Kl Gaussians projection Wl from 2Kldl to hl projection Ul from 4hl to dl+1 4hl ql 2ql \f4 Fisher network\nOur image classi\ufb01cation pipeline, which we coin Fisher network (shown in Fig. 1) is constructed by\nstacking several (at least one) Fisher layers (Sect. 3) on top of dense features, such as SIFT or raw\nimage patches. The penultimate layer, which computes a single-vector image representation, is the\nspecial case of the Fisher layer, where sum-pooling is only performed globally over the whole image.\nWe call this layer the global Fisher layer, and it effectively computes a full-dimensional normalised\nFisher vector encoding (the dimensionality reduction stage is omitted since the computed FV is\ndirectly used for classi\ufb01cation). The \ufb01nal layer is an off-the-shelf ensemble of one-vs-rest binary\nlinear SVMs. As can be seen, a Fisher network generalises the standard FV pipeline of [20], as the\nlatter corresponds to the network with a single global Fisher layer.\nMulti-layer image descriptor. Each subsequent Fisher layer is designed to capture more complex,\nhigher-level image statistics, but the competitive performance of shallow FV-based frameworks [21]\nsuggests that low-level SIFT features are already discriminative enough to distinguish between a\nnumber of image classes. To fully exploit the hierarchy of Fisher layers, we branch out a globally\npooled, normalised FV from each of the Fisher layers, not just the last one. These image representa-\ntions are then concatenated to produce a rich, multi-layer image descriptor. A similar approach has\npreviously been applied to convolutional networks in [25].\n4.1 Learning\nThe Fisher network is trained in a supervised manner, since each Fisher layer (apart from the global\nlayer) depends on discriminative dimensionality reduction. The network is trained greedily, layer by\nlayer. Here we discuss how the (non-global) Fisher layer can be ef\ufb01ciently trained in the large-scale\nscenario, and introduce two options for the projection learning objective.\nProjection learning proxy. As explained in Sect. 3, we need to learn a discriminative projection\nW to signi\ufb01cantly reduce the dimensionality of the densely-computed semi-local FVs. At the same\ntime, the only annotation available for discriminative learning in our case is the class label of the\nwhole image. We exploit this information by requiring that projected semi-local FVs are good\npredictors of the image class. Taking into account that (i) it may be unreasonable to require all local\nfeature occurrences to predict the object class (the support of some features may not even cover the\nobject), and (ii) there are too many features to use all of them in learning (\u223c 104 semi-local FVs for\neach of the \u223c 106 training images), we optimize the average class prediction of all the features in a\nlayer, rather than the prediction of individual feature occurrences.\nIn particular, we construct a learning proxy by computing the average \u03c8 of all unnormalised, unpro-\njected semi-local FVs \u03c6s of an image, \u03c8 = 1\ns=1 \u03c6s, and de\ufb01ning the learning constraints on \u03c8\nS\nusing the image label. Considering that W \u03c8 = 1\ns=1 W \u03c6s, the projection W , learnt for \u03c8, is also\nS\napplicable to individual semi-local FVs \u03c6s. The advantages of the proxy are that the image-level\nclass annotation can now be utilised, and during projection learning we only need to store a single\nvector \u03c8 per image. In the sequel, we de\ufb01ne two options for the projection learning objective, which\nare then compared in Sect. 6.\nBi-convex max-margin projection learning. One approach to discriminative dimensionality re-\nduction learning consists in \ufb01nding the projection onto a subspace where the image classes are\nas linearly separable as possible [10, 31]. This corresponds to the bilinear class scoring function:\nc W \u03c8, where W is the linear projection which we seek to optimise and vc is the linear model (e.g.\nvT\nan SVM) of the class c in the projected space. The max-margin optimisation problem for W and the\nensemble {vc} takes the following form:\n\n(cid:80)S\n(cid:80)S\n\n(cid:105)\n\n(cid:88)\n\nc\n\n+\n\n\u03bb\n2\n\n(cid:107)vc(cid:107)2\n\n2 +\n\n(cid:107)W(cid:107)2\nF ,\n\n\u00b5\n2\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\nc(cid:48)(cid:54)=c(i)\n\ni\n\n(cid:104)(cid:0)vc(cid:48) \u2212 vc(i)\n\n(cid:1)T\n\nmax\n\nW \u03c8i + 1, 0\n\nwhere ci is the ground-truth class of an image i, \u03bb and \u00b5 are the regularisation constants. The\nlearning objective is bi-convex in W and vc, and a local optimum can be found by alternation\nbetween the convex problems for W and {vc}, both of which can be solved in primal using a\nstochastic sub-gradient method [27]. We initialise the alternation by setting W to the PCA-whitening\nmatrix W0. Once the optimisation has converged, the classi\ufb01ers vc are discarded, and we keep the\nprojection W .\n\n5\n\n\fProjection onto the space of classi\ufb01er scores. Another dimensionality reduction technique, which\nwe consider in this work, is to train one-vs-rest SVM classi\ufb01ers {uc}C\nc=1 on the full-dimensional\nFVs \u03c8, and then use the C-dimensional vector of SVM outputs as the compressed representation\nof \u03c8. This corresponds to setting the c-th row of the projection matrix W to the SVM model uc.\nThis approach is closely related to attribute-based representations and classemes [15, 30], but in our\ncase we do not use any additional data annotated with a different set of (attribute) classes to train\nthe models; instead, the C = 1000 classi\ufb01ers trained directly on the ILSVRC dataset are used. If\na speci\ufb01c target dimensionality is required, PCA dimensionality reduction can be further applied to\nthe classi\ufb01er scores [10], but in our case we applied PCA after spatial stacking (Sect. 3).\nThe advantage of using SVM models for dimensionality reduction is, mostly, computational. As we\nwill show in Sect. 6, both formulations exhibit a similar level of performance, but training C one-\nvs-rest classi\ufb01ers is much faster than performing alternation between SVM learning and projection\nlearning in (2). The reason is that one-vs-rest SVM training can be easily parallelised, while projec-\ntion learning is signi\ufb01cantly slower even when using a parallel gradient descent implementation.\n5\nEf\ufb01cient computation of hard-assignment Fisher vectors. In the original FV encoding formula-\ntion (1), each feature is soft-assigned to all K Gaussians of the GMM by computing the assignment\nweights \u03b1k(xp) as the responsibilities of GMM component k for feature p: \u03b1k(xp) = \u03c0k N k(xp)\nj \u03c0j N j (xp),\nwhere N k(xp) is the likelihood of k-th Gaussian. To facilitate an ef\ufb01cient computation of a large\nnumber of dense FVs per image, we introduce and utilise a fast variant of FV (which we term\nhard-FV), which uses hard assignments of features to Gaussians, computed as\n\nImplementation details\n\n(cid:80)\n\n\u03b1k(xp) =\n\nif k = arg maxj \u03c0j N j(xp)\notherwise\n\n(cid:26)1\n\n0\n\n(cid:16)\n\n(3)\n\n(cid:17)\n\nl\n\nl\n\nl\n\nis easy to show that Wl\u03c6 =(cid:80)K\n\n(cid:80)\n\nl\n\nl\n\nk\n\nl\n\nk\n\nk=1\n\np\u2208\u2126k\n\nW (k,1)\n\n(p) are computed and projected using small hl\u00d7dl sub-matrices W (k,1)\n\nHard-FVs are inherently sparse; this allows for the fast computation of projected FVs Wl\u03c6. Indeed, it\n\u03a6(1)\nk (p) + W (k,2)\n\u03a6(2)\n, where \u2126k is the set\nk (p)\nof input vectors hard-assigned to the GMM component k, and W (k,1)\n, W (k,2)\nare the sub-matrices\nof Wl which correspond to the 1st and 2nd order differences \u03a6(1),(2)\n(p) between the feature xp\nand the k-th GMM mean (1). This suggests the fast computation procedure: each dl-dimensional\ninput feature xp is \ufb01rst hard-assigned to a Gaussian k based on (3). Then, the corresponding dl-D\ndifferences \u03a6(1),(2)\n, W (k,2)\n,\nwhich is fast. The algorithm avoids computing high-dimensional FVs, followed by the projection\nusing a large matrix Wl \u2208 Rhl\u00d72Kldl, which is prohibitive since the number of dense FVs is high.\nImplementation. Our SIFT feature extraction follows that of [21]. Images are rescaled so that\nthe number of pixels is 100K. Dense RootSIFT [2] is computed on 24 \u00d7 24 patches over 5 scales\n\u221a\n2) with a 3 pixel step. We also employ SIFT augmentation with the patch spatial\n(scale factor 3\ncoordinates [24]. During training, high-dimensional FVs, computed by the 2nd Fisher layer, are\ncompressed using product quantisation [23]. The learning framework is implemented in Matlab,\nspeeded up with C++ MEX. The computation is carried out on CPU without the use of GPU. Train-\ning the Fisher network on top of SIFT descriptors on 1.2M images of ILSVRC-2010 [4] dataset\ntakes about one day on a 200-core cluster. Image classi\ufb01cation time is \u223c 2s on a single core.\n6 Evaluation\nIn this section, we evaluate the proposed Fisher network on the dataset, introduced for the ImageNet\nLarge Scale Visual Recognition Challenge (ILSVRC) 2010 [4]. It contains images of 1000 cate-\ngories, with 1.2M images available for training, 50K for validation, and 150K for testing. Following\nthe standard evaluation protocol for the dataset, we report both top-1 and top-5 accuracy (%) com-\nputed on the test set. Sect. 6.1 evaluates variants of the Fisher network on a subset of ILSVRC to\nidentify the best one. Then, Sect. 6.2 evaluates the complete framework.\n6.1 Fisher network variants\nWe begin with comparing the performance of the Fisher network under different settings. The\ncomparison is carried out on a subset of ILSVRC, which was obtained by random sampling of 200\n\n6\n\n\fTable 1: Evaluation of dimensionality reduction, stacking, and normalisation sub-layers on a\n200 class subset of ILSVRC-2010. The following con\ufb01guration of Fisher layers was used: d1 =\n128, K1 = 256, q1 = 5, \u03b41 = 1, h1 = 200 (number of classes), d2 = 200 , K2 = 256. The baseline\nperformance of a shallow FV encoding is 57.03% and 78.9% (top-1 and top-5 accuracy).\n\nstacking\n\nL2 norm-n\n\ndim-ty reduction\nclassi\ufb01er scores\nclassi\ufb01er scores\nclassi\ufb01er scores\n\nbi-convex\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\ntop-1\n59.69\n59.42\n60.22\n59.49\n\ntop-5\n80.29\n80.44\n80.93\n81.11\n\nTable 2: Evaluation of multi-scale pooling and multi-layer image description on the subset of\nILSVRC-2010. The following con\ufb01guration of Fisher layers was used: d1 = 128, K1 = 256,\nh1 = 200, d2 = 200, K2 = 256. Both Fisher layers used spatial coordinate augmentation. The\nbaseline performance of a shallow FV encoding is 59.51% and 80.50% (top-1 and top-5 accuracy).\n\npooling window size q1\n\npooling stride \u03b41 multi-layer\n\n5\n\n{5, 7, 9, 11}\n{5, 7, 9, 11}\n\n1\n2\n2\n\ntop-1\n61.56\n62.16\n63.79\n\ntop-5\n82.21\n82.43\n83.73\n\n(cid:88)\n\nclasses out of 1000. To avoid over-\ufb01tting indirectly on the test set, comparisons in this section are\ncarried on the validation set. In our experiments, we used SIFT as the \ufb01rst layer of the network,\nfollowed by two Fisher layers (the second one is global, as explained in Sect. 4).\nDimensionality reduction, stacking, and normalisation. Here we quantitatively assess the three\nsub-layers of a Fisher layer (Sect. 3). We compare the two proposed dimensionality reduction learn-\ning schemes (bi-convex learning and classi\ufb01er scores), and also demonstrate the importance of spa-\ntial stacking and L2 normalisation. The results are shown in Table 1. As can be seen, both spatial\nstacking and L2 normalisation improve the performance, and dimensionality reduction via projec-\ntion onto the space of SVM classi\ufb01er scores performs on par with the projection learnt using the\nbi-convex formulation (2). In the following experiments we used the classi\ufb01er scores for dimension-\nality reduction, since their training can be parallelised and is signi\ufb01cantly faster.\nMulti-scale pooling and multi-layer image representation. In this experiment, we compare the\nperformance of semi-local FV pooling using single and multiple window sizes (Sect. 3), as well as\nsingle- and multi-layer image representations (Sect. 4). From Table 2 it is clear that using multi-\nple pooling window sizes is bene\ufb01cial compared to a single window size. When using multi-scale\npooling, the pooling stride was increased to keep the number of pooled semi-local FVs roughly the\nsame. Also, the multi-layer image descriptor obtained by stacking globally pooled and normalised\nFVs, computed by the two Fisher layers, outperforms each of these FVs taken separately. We also\nnote that in this experiment, unlike the previous one, both Fisher layers utilized spatial coordinate\naugmentation of the input features, which leads to a noticeable boost in the shallow baseline perfor-\nmance (from 78.9% to 80.50% top-5 accuracy). Apart from our Fisher network, multi-scale pooling\ncan be readily employed in convolutional networks.\n6.2 Evaluation on ILSVRC-2010\nNow that we have evaluated various Fisher layer con\ufb01gurations on a subset of ILSVRC, we assess\nthe performance of our framework on the full ILSVRC-2010 dataset. We use off-the-shelf SIFT and\ncolour features [20] in the feature extraction layer, and demonstrate that signi\ufb01cant improvements\ncan be achieved by injecting a single Fisher layer into the conventional FV-based pipeline [23].\nThe following con\ufb01guration of Fisher layers was used: d1 = 80, K1 = 512, q1 = {5, 7, 9, 11},\n\u03b41 = 2, h1 = 1000, d2 = 256, K2 = 256. On both Fisher layers, we used spatial coordinate\naugmentation of the input features. The \ufb01rst Fisher layer uses a large number of GMM components\nKl, since it was found to be bene\ufb01cial for shallow FV encodings [23], used here as a baseline. The\none-vs-rest SVM scores were Platt-calibrated on the validation set (we did not use calibration for\nsemi-local FV dimensionality reduction).\nThe results are shown in Table 3. First, we note that the globally pooled Fisher vector, branched\nout of the \ufb01rst Fisher layer (which effectively corresponds to the conventional FV encoding [23]),\nresults in better accuracy than reported in [23], which validates our implementation. Using the 2nd\nFisher layer on top of the 1st one leads to a signi\ufb01cant performance improvement. Finally, stacking\nthe FVs, produced by the 1st and 2nd Fisher layers, pushes the accuracy even further.\n\n7\n\n\fTable 3: Performance on ILSVRC-2010 using dense SIFT and colour features. We also specify\nthe dimensionality of SIFT-based image representations.\n\npipeline\nsetting\n\n1st Fisher layer\n2nd Fisher layer\n\nmulti-layer (1st and 2nd Fisher layers)\n\nS\u00b4anchez et al. [23]\n\ndimension\n\nSIFT only\ntop-1\n46.52\n48.54\n52.57\nN/A\n\n82K\n131K\n213K\n524K\n\ntop-5\n68.45\n71.35\n73.68\n67.9\n\nSIFT & colour\ntop-1\ntop-5\n76.35\n55.35\n77.68\n56.20\n79.20\n59.47\n54.3\n74.3\n\nThe state of the art on the ILSVRC-2010 dataset was obtained using an 8-layer convolutional net-\nwork [14], i.e. twice as deep as the Fisher network considered here. Using training and test set\naugmentation based on jittering (not employed here), they achieved the top-1 / top-5 accuracy of\n62.5% / 83.0%. Without test set augmentation (i.e. using only the original images for class scoring),\ntheir result is 61% / 81.7%. In our case, we did not augment neither the training, nor the test set, and\nachieved 59.5% / 79.2%. For reference, our baseline shallow FV accuracy is 55.4% / 76.4%. We\nconclude that injecting a single intermediate layer leads to a signi\ufb01cant performance boost (+4.1%\ntop-1 accuracy), but deep CNNs are still somewhat better (+1.5% top-1 accuracy). These results are\nhowever quite encouraging since they were obtained by using off-the-shelf features and encodings,\nrecon\ufb01gured to add a single intermediate layer. Notably, our model did not require an optimised\nGPU implementation, nor it was necessary to control over-\ufb01tting by techniques such as dropout [14]\nand training set augmentation.\nFinally, we demonstrate that the Fisher network and deep CNN representations are complementary\nby combining the class posteriors obtained from CNN with those of a Fisher network. To this end,\nwe re-implemented the deep CNN of [14] using their publicly available cuda-convnet toolbox. Our\nimplementation performs slightly better, giving 62.91% / 83.19% (with test set augmentation). The\nmultiplication of CNN and Fisher network posteriors leads to a signi\ufb01cantly improved accuracy:\n66.75% / 85.64%. It should be noted that another way of improving the CNN accuracy, used in [14]\non ImageNet-2012 dataset, consists in training several CNNs and averaging their posteriors. Further\nstudy of the complementarity of various deep and shallow representations is beyond the scope of\nthis paper, and will be addressed in future research.\n7 Conclusion\nWe have shown that Fisher vectors, a standard image encoding method, are amenable to be stacked in\nmultiple layers, in analogy to the state-of-the-art deep neural network architectures. Adding a single\nlayer is in fact suf\ufb01cient to signi\ufb01cantly boost the performance of these shallow image encodings,\nbringing their performance closer to the state of the art in the large-scale classi\ufb01cation scenario [14].\nThe fact that off-the-shelf image representations can be simply and successfully stacked indicates\nthat deep schemes may extend well beyond neural networks.\nAcknowledgements\nThis work was supported by ERC grant VisRec no. 228180.\nReferences\n[1] A. Agarwal and B. Triggs. Hyperfeatures - multilevel local coding for visual recognition. In Proc. ECCV,\n\npages 30\u201343, 2006.\n\n[2] R. Arandjelovi\u00b4c and A. Zisserman. Three things everyone should know to improve object retrieval. In\n\nProc. CVPR, 2012.\n\n[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In\n\nNIPS, pages 153\u2013160, 2006.\n\n[4] A Berg, J Deng, and L Fei-Fei. Large scale visual recognition challenge (ILSVRC), 2010. URL http:\n\n//www.image-net.org/challenges/LSVRC/2010/.\n\n[5] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pool-\n\ning. In Proc. ECCV, pages 430\u2013443, 2012.\n\n[6] K. Chat\ufb01eld, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of\n\nrecent feature encoding methods. In Proc. BMVC., 2011.\n\n8\n\n\f[7] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation.\n\nIn Proc. CVPR, pages 3642\u20133649, 2012.\n\n[8] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning.\n\nIn Proc. AISTATS, 2011.\n\n[9] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In Workshop on\n\nStatistical Learning in Computer Vision, ECCV, pages 1\u201322, 2004.\n\n[10] A. Gordo, J. A. Rodr\u00b4\u0131guez-Serrano, F. Perronnin, and E. Valveny. Leveraging category-level labels for\n\ninstance-level image retrieval. In Proc. CVPR, pages 3045\u20133052, 2012.\n\n[11] B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classi\ufb01cation.\n\nIn Proc. ECCV, 2012.\n\n[12] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18(7):1527\u20131554, 2006.\n\n[13] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi\ufb01ers. In NIPS, pages\n\n487\u2013493, 1998.\n\n[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1106\u20131114, 2012.\n\n[15] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\nattribute transfer. In Proc. CVPR, pages 951\u2013958, 2009.\n\n[16] S. Lazebnik, C. Schmid, and J Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recog-\n\nnizing Natural Scene Categories. In Proc. CVPR, 2006.\n\n[17] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level\n\nfeatures using large scale unsupervised learning. In Proc. ICML, 2012.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] D. Lowe. Object recognition from local scale-invariant features. In Proc. ICCV, pages 1150\u20131157, Sep\n\n1999.\n\n[20] F. Perronnin, J. S\u00b4anchez, and T. Mensink. Improving the Fisher kernel for large-scale image classi\ufb01cation.\n\nIn Proc. ECCV, 2010.\n\n[21] F. Perronnin, Z. Akata, Z. Harchaoui, and C. Schmid. Towards good practice in large-scale learning for\n\nimage classi\ufb01cation. In Proc. CVPR, pages 3482\u20133489, 2012.\n\n[22] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience,\n\n2(11):1019\u20131025, 1999.\n\n[23] J. S\u00b4anchez and F. Perronnin. High-dimensional signature compression for large-scale image classi\ufb01cation.\n\nIn Proc. CVPR, 2011.\n\n[24] J. S\u00b4anchez, F. Perronnin, and T. Em\u00b4\u0131dio de Campos. Modeling the spatial layout of images beyond spatial\n\npyramids. Pattern Recognition Letters, 33(16):2216\u20132223, 2012.\n\n[25] P. Sermanet and Y. LeCun. Traf\ufb01c sign recognition with multi-scale convolutional networks. In Interna-\n\ntional Joint Conference on Neural Networks, pages 2809\u20132813, 2011.\n\n[26] T. Serre, L. Wolf, and T. Poggio. A new biologically motivated framework for robust object recognition.\n\nProc. CVPR, 2005.\n\n[27] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient SOlver for SVM.\n\nIn Proc. ICML, volume 227, 2007.\n\n[28] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. Fisher Vector Faces in the Wild. In Proc.\n\nBMVC., 2013.\n\n[29] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Proc.\n\nICCV, volume 2, pages 1470\u20131477, 2003.\n\n[30] L. Torresani, M. Szummer, and A. Fitzgibbon. Ef\ufb01cient object category recognition using classemes. In\n\nProc. ECCV, pages 776\u2013789, sep 2010.\n\n[31] J. Weston, S. Bengio, and N. Usunier. WSABIE: Scaling up to large vocabulary image annotation. In\n\nProc. IJCAI, pages 2764\u20132770, 2011.\n\n[32] S. Yan, X. Xu, D. Xu, S. Lin, and X. Li. Beyond spatial pyramids: A new feature extraction framework\n\nwith dense spatial sampling for image classi\ufb01cation. In Proc. ECCV, pages 473\u2013487, 2012.\n\n[33] J. Yang, K. Yu, Y. Gong, and T. S. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In Proc. CVPR, pages 1794\u20131801, 2009.\n\n9\n\n\f", "award": [], "sourceid": 152, "authors": [{"given_name": "Karen", "family_name": "Simonyan", "institution": "University of Oxford"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "University of Oxford"}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": "University of Oxford"}]}