{"title": "Object based Scene Representations using Fisher Scores of Local Subspace Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 2811, "page_last": 2819, "abstract": "Several works have shown that deep CNN classifiers can be easily transferred across datasets, e.g. the transfer of a CNN trained to recognize objects on ImageNET to an object detector on Pascal VOC. Less clear, however, is the ability of CNNs to transfer knowledge across tasks. A common example of such transfer is the problem of scene classification that should leverage localized object detections to recognize holistic visual concepts. While this problem is currently addressed with Fisher vector representations, these are now shown ineffective for the high-dimensional and highly non-linear features extracted by modern CNNs. It is argued that this is mostly due to the reliance on a model, the Gaussian mixture of diagonal covariances, which has a very limited ability to capture the second order statistics of CNN features. This problem is addressed by the adoption of a better model, the mixture of factor analyzers (MFA), which approximates the non-linear data manifold by a collection of local subspaces. The Fisher score with respect to the MFA (MFA-FS) is derived and proposed as an image representation for holistic image classifiers. Extensive experiments show that the MFA-FS has state of the art performance for object-to-scene transfer and this transfer actually outperforms the training of a scene CNN from a large scene dataset. The two representations are also shown to be complementary, in the sense that their combination outperforms each of the representations by itself. When combined, they produce a state of the art scene classifier.", "full_text": "Object based Scene Representations using Fisher\n\nScores of Local Subspace Projections\n\nMandar Dixit and Nuno Vasconcelos\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n{mdixit, nvasconcelos}@ucsd.edu\n\nAbstract\n\nSeveral works have shown that deep CNNs can be easily transferred across datasets,\ne.g. the transfer from object recognition on ImageNet to object detection on Pascal\nVOC. Less clear, however, is the ability of CNNs to transfer knowledge across tasks.\nA common example of such transfer is the problem of scene classi\ufb01cation, that\nshould leverage localized object detections to recognize holistic visual concepts.\nWhile this problems is currently addressed with Fisher vector representations, these\nare now shown ineffective for the high-dimensional and highly non-linear features\nextracted by modern CNNs. It is argued that this is mostly due to the reliance on\na model, the Gaussian mixture of diagonal covariances, which has a very limited\nability to capture the second order statistics of CNN features. This problem is\naddressed by the adoption of a better model, the mixture of factor analyzers (MFA),\nwhich approximates the non-linear data manifold by a collection of local sub-spaces.\nThe Fisher score with respect to the MFA (MFA-FS) is derived and proposed as an\nimage representation for holistic image classi\ufb01ers. Extensive experiments show\nthat the MFA-FS has state of the art performance for object-to-scene transfer and\nthis transfer actually outperforms the training of a scene CNN from a large scene\ndataset. The two representations are also shown to be complementary, in the sense\nthat their combination outperforms each of the representations by itself. When\ncombined, they produce a state-of-the-art scene classi\ufb01er.\n\n1\n\nIntroduction\n\nIn recent years, convolutional neural networks (CNNs) trained on large scale datasets have achieved\nremarkable performance on traditional vision problems such as image classi\ufb01cation [8, 18, 26], object\ndetection and localization [5, 16] and others. The success of CNNs can be attributed to their ability\nto learn highly discriminative, non-linear, visual transformations with the help of supervised back-\npropagation [9]. Beyond the impressive, sometimes even superhuman, results on certain datasets,\na remarkable property of these classi\ufb01ers is the solution of the dataset bias problem [20] that has\nplagued computer vision for decades. It has now been shown many times that a network trained to\nsolve a task on a certain dataset (e.g. object recognition on ImageNet) can be very easily \ufb01ne-tuned to\nsolve a related problem on another dataset (e.g. object detection on the Pascal VOC or MS-COCO).\nLess clear, however, is the robustness of current CNNs to the problem of task bias, i.e. their ability to\ngeneralize accross tasks. Given the large number of possible vision tasks, it is impossible to train a\nCNN from scratch for each. In fact, it is likely not even feasible to collect the large number of images\nneeded to train effective deep CNNs for every task. Hence, there is a need to investigate the problem\nof task transfer.\nIn this work, we consider a very common class of such problems, where a classi\ufb01er trained on a class\nof instances is to be transferred to a second class of instances, which are loose combinations of the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\foriginal ones. In particular, we consider the problem where the original instances are objects and\nthe target instances are scene-level concepts that somehow depend on those objects. Examples of\nthis problem include the transfer of object classi\ufb01ers to tasks such as scene classi\ufb01cation [6, 11, 2] or\nimage captioning [23]. In all these cases, the goal is to predict holistic scene tags from the scores (or\nfeatures) from an object CNN classi\ufb01er. The dependence of the holistic descriptions on these objects\ncould range from very explicit to very subtle. For example, on the explicit end of the spectrum, an\nimage captioning system could produce sentence such as \u201ca person is sitting on a stool and feeding a\nzebra.\u201d On the other hand, on the subtle end of the spectrum, a scene classi\ufb01cation system would\nleverage the recognition of certain rocks, tree stumps, bushes and a particular lizard species to label\nan image with the tag \u201cJoshua Tree National Park\u201d. While it is obviously possible 1) to collect a large\ndataset of images, and 2) use them to train a CNN to directly solve each of these tasks, this approach\nhas two main limitations. First, it is extremely time consuming. Second, the \u201cdirectly learned\u201d CNN\nwill typically not accommodate explicit relations between the holistic descriptions and the objects in\nthe scene. This has, for example, been documented in the scene classi\ufb01cation literature, where the\nperformance of the best \u201cdirectly learned\u201d CNNs [26], can be substantially improved by fusion with\nobject recognition CNNs [6, 11, 2].\nSo far, the transfer from object CNNs to holistic scene description has been most extensively studied\nin the area of scene classi\ufb01cation, where state of the art results have been obtained with the bag of\nsemantics representation of [2]. This consists of feeding image patches through an object recognition\nCNN, collecting a bag vectors of object recognition scores, and embedding this bag into a \ufb01xed\ndimensional vector space with recourse to a Fisher vector [7]. While there are variations of detail,\nall other competitive methods are based on a similar architecture [6, 11]. This observation is, in\nprinciple, applicable to other tasks. For example, the state of the art in image captioning is to use a\nCNN as an image encoder that extracts a feature vector from the image. This feature vector is the fed\nto a natural language decoder (typically an LSTM) that produces sentences. While there has not yet\nbeen an extensive investigation of the best image encoder, it is likely that the best representations for\nscene classi\ufb01cation should also be effective encodings for language generation. For these reasons,\nwe restrict our attention to the scene classifcation problem in the remainder of this work, focusing\non the question of how to address possible limitations of the Fisher vector embedding. We note,\nin particular, that while Fisher vectors have been classically de\ufb01ned using gradients of image log-\nlikelihood with respect to the means and variances of a Gaussian mixture model (GMM) [13], this\nde\ufb01nition has not been applied universally in the CNN transfer context, where variance statistics are\noften disregarded [6, 2].\nIn this work we make several contributions to the use of Fisher vector type of representations for\nobject to scene transfer. The \ufb01rst is to show that, for object recognition scores produced by a CNN [2],\nvariance statistics are much less informative of scene class distributions than the mean gradients, and\ncan even degrade scene classi\ufb01cation performance. We then argue that this is due to the inability of the\nstandard GMM of diagonal covariances to provide a good approximation to the non-linear manifold\nof CNN responses. This leads to the adoption of a richer generative model, the mixture of factor\nanalyzers (MFA) [4, 22], which locally approximates the scene class manifold by low-dimensional\nlinear spaces. Our second contribution is to show that, by locally projecting the feature data into\nthese spaces, the MFA can ef\ufb01ciently model its local covariance structure. For this, we derive the\nFisher score of the MFA model, denoted the MFA Fisher score (MFA-FS), a representation similar to\nthe GMM Fisher vector of [13, 17]. We show that, for high dimensional CNN features, the MFA-FS\ncaptures highly discriminative covariance statistics, which were previously unavailable in [6, 2],\nproducing signi\ufb01cantly improved scene classi\ufb01cation over the conventional GMM Fisher vector. The\nthird contribution is a detailed experimental investigation of the MFA-FS. Since this can be seen\nas a second order pooling mechanism, we compare it to a number of recent methods for second\norder pooling of CNN features [21, 3]. Although these methods describe global covariance structure,\nthey lack the ability of the MFA-FS to capture that information along locally linear approximations\nof the highly non-linear CNN feature manifold. This is shown to be important, as the MFA-FS is\nshown to outperform all these representations by non-trivial margins. Finally, we show that the\nMFA-FS enables effective task transfer, by showing that MFA-FS vectors extracted from deep CNNs\ntrained for ImageNet object recognition [8, 18], achieve state-of-the-art results on challenging scene\nrecognition benchmarks, such as SUN [25] and MIT Indoor Scenes [14].\n\n2\n\n\f2 Fisher scores\nIn computer vision, an image I is frequently interpreted as a set of descriptors D = {x1, . . . , xn}\nsampled from some generative model p(x; \u03b8). Since most classi\ufb01ers require \ufb01xed-length inputs, it\nis common to map the set D into a \ufb01xed-length vector. A popular mapping consists of computing\n\u2202\u03b8 log p(D; \u03b8) for a model \u03b8b. This\nthe gradient (with respect to \u03b8) of the log-likelihood \u2207\u03b8L(\u03b8) = \u2202\nis known as the Fisher score of \u03b8. This gradient vector is often normalized by the square root\nof the Fisher information matrix F, according to F\u2212 1\n2\u2207\u03b8L(\u03b8). This is referred to as the Fisher\nvector (FV) [7] representation of I. While the Fisher vector is frequently used with a Gaussian\nmixture model (GMM) [13, 17], any generative model p(x; \u03b8) can be used. However, the information\nmatrix is not always easy to compute. When this is case, it is common to rely on the simpler\nrepresentation of I by the score \u2207\u03b8L(\u03b8). This is, for example, the case with the sparse coded gradient\nvectors in [11]. We next show that, for models with hidden variables, the Fisher score can be obtained\ntrivially from the steps of the expectation maximization (EM) algorithm commonly used to learn\nsuch models.\n\n2.1 Fisher Scores from EM\n\nConsider the log-likelihood of D under a latent-variable model log p(D; \u03b8) = log(cid:82) p(D, z; \u03b8)dz of\n\nhidden variable z. Since the left-hand side does not depend on the hidden variable, this can be written\nin an alternate form, which is widely used in the EM literature,\n\nlog p(D; \u03b8) =\n\nq(z) log p(D, z; \u03b8)dz \u2212\n\nq(z) log q(z)dz +\n\nq(z) log\n\nq(z)\n\np(z|D; \u03b8)\n\ndz\n\n= Q(q; \u03b8) + H(q) + KL(q||p; \u03b8)\n\n(1)\nwhere Q(q; \u03b8) is the \u201cQ\u201d function, q(z) a general probability distribution, H(q) its differential entropy\nand KL(q||p; \u03b8) the Kullback Liebler divergence between the posterior p(z|D; \u03b8) and q(z). Hence,\n\n(cid:90)\n\n(cid:90)\n\nlog p(D; \u03b8) =\n\n\u2202\n\u2202\u03b8\n\nwhere\n\n\u2202\n\u2202\u03b8\n\nQ(q; \u03b8) +\n\nKL(q||p; \u03b8)\n\n\u2202\n\u2202\u03b8\n\n(2)\n\nKL(q||p; \u03b8) = \u2212\n\n(3)\nIn each iteration of the EM algorithm the q distribution is chosen as q(z) = p(z|D; \u03b8b), where \u03b8b is a\nreference parameter vector (the parameter estimates from the previous EM iteration) and\n\np(z|D; \u03b8)\n\nq(z)\n\np(z|D; \u03b8)dz.\n\n\u2202\n\u2202\u03b8\n\n\u2202\n\u2202\u03b8\n\nQ(q; \u03b8) =\n\np(z|D; \u03b8b) log p(D, z; \u03b8)dz = Ez|D;\u03b8b [log p(D, z; \u03b8)].\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8=\u03b8b\n\nIt follows that\nKL(q||p; \u03b8)\n\n\u2202\n\u2202\u03b8\n\nand\n\n(cid:90)\n\np(z|D; \u03b8b)\n\n(cid:90) p(z|D; \u03b8b)\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8=\u03b8b\n\n(cid:90)\n\n(cid:90)\n\n3\n\n= \u2212\n\np(z|D; \u03b8)\n\n\u2202\n\u2202\u03b8\n\ndz = \u2212 \u2202\n\u2202\u03b8\n\np(z|D; \u03b8)\n\n(4)\n\ndz = 0\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8=\u03b8b\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8=\u03b8b\n\n(cid:90)\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8=\u03b8b\n\nlog p(D; \u03b8)\n\n\u2202\n\u2202\u03b8\n\nQ(p(z|D; \u03b8b); \u03b8)\n\n\u2202\n\u2202\u03b8\n\n.\n\n=\n\n(5)\nIn summary, the Fisher score \u2207\u03b8L(\u03b8)|{\u03b8=\u03b8b} of background model \u03b8b is the gradient of the Q-function\nof EM evaluated at reference model \u03b8b. The computation of the score thus simpli\ufb01es into the two\nsteps of EM. First, the E step computes the Q function Q(p(z|x; \u03b8b); \u03b8) at the reference \u03b8b. Second,\nthe M-step evaluates the gradient of the Q function with respect to \u03b8 at \u03b8 = \u03b8b. This interpretation\nof the Fisher score is particularly helpful when ef\ufb01cient implementations of the EM algorithm are\navailable, e.g. the recursive Baum-Welch computations commonly used to learn hidden Markov\nmodels [15]. For more tractable distributions, such as the GMM, it enables the simple reuse of the\nEM equations, which are always required to learn the reference model \u03b8b, to compute the Fisher\nscore.\n\n\fQ(p(z|D; \u03b8b); \u03b8) =\n\n(cid:104)(cid:88)\n\n(cid:88)\n(cid:88)\n\ni\n\nEzi|xi;\u03b8b\nhik log p(xi|zi = k; \u03b8)wk\n\nk\n\nI(zi, k) log p(xi, k; \u03b8)\n\n(cid:105)\n\n2.2 Bag of features\n\nFisher scores are usually combined with the bag-of-features representation, where an image is\ndescribed as an orderless collection of localized descriptors D = {x1, x2, . . . xn}. These were tradi-\ntionally SIFT descriptors, but have more recently been replaced with responses of object recognition\nCNNs [6, 1, 2]. In this work we use the semantic features proposed in [2], which are obtained by\ntransforming softmax probability vectors pi, obtained for image patches, into their natural parameter\nform. These features were shown to perform better than activations of other CNN layers [2].\n\n2.3 Gaussian Mixture Fisher Vectors\n\nA GMM is a model with a discrete hidden variable that determines the mixture component which\nexplains the observed data. The generative process is as follows. A mixture component zi is \ufb01rst\nsampled from a multinomial distribution p(z = k) = wk. An observation xi is then sampled from\nthe Gaussian component p(x|z = k) \u223c G(x, \u00b5k, \u03c3k) of mean \u00b5k and variance \u03c3k. Both the hidden\nand observed variables are sampled independently, and the Q function simpli\ufb01es to\n\ni,k\n\n=\n\n(6)\nwhere I(.) is the indicator function and hik is the posterior probability p(k|xi; \u03b8b). The probability\nvectors hi are the only quantities computed in the E-step.\nIn the Fisher vector literature [13, 17], the GMM is assumed to have diagonal covariances. This is\ndenoted as the variance-GMM. Substituting the expressions of p(xi|zi = k; \u03b8) and differentiating the\nQ function with respect to parameters \u03b8 = {\u00b5k, \u03c3k} leads to the two components of the Fisher score\n\nL(\u03b8) =\n\ni\n\n(7)\n\nL(\u03b8) =\n\nG\u03c3d\n\nk\n\nG\u00b5d\n\nk\n\n\u2202\n\u2202\u03c3d\nk\n\n(I) =\n\n(I) =\n\n\u2202\n\u2202\u00b5d\nk\n\n(cid:88)\n\np(k|xi)\n\np(k|xi)\n\ni \u2212 \u00b5d\n(\u03c3d\nk)2\ni \u2212 \u00b5d\nk)2\n\u2212 1\n(8)\n(\u03c3d\n\u03c3d\nk)3\nk\nThese quantities are evaluated using a reference model \u03b8b = {\u00b5b\nk} learned (with EM) from all\nk, \u03c3b\ntraining data. To compute the Fisher vectors, scores in (7) and (8) are often scaled by an approximate\nFisher information matrix, as detailed in [17]. When used with SIFT descriptors, these mean\nand variance scores usually capture complimentary discriminative information, useful for image\nclassi\ufb01cation [13]. Yet, FVs computed from CNN features only use the mean gradients similar\nto (7), ignoring second-order statistics [6, 2]. In the experimental section, we show that the variance\nstatistics of CNN features perform poorly compared to the mean gradients. This is perhaps due to the\ninability of the variance-GMM to accurately model data in high dimensions. We test this hypothesis\nby considering a model better suited for this task.\n\n(cid:21)\n\n.\n\n(cid:19)\n\nk\n\n(cid:88)\n\ni\n\n(cid:18) xd\n(cid:20) (xd\n\n2.4 Fisher Scores for the Mixture of Factor Analyzers\n\nA factor analyzer (FA) is a type of a Gaussian distribution that models high dimensional observations\nx \u2208 RD in terms of latent variables or \u201cfactors\u201d z \u2208 RR de\ufb01ned on a low-dimensional subspace\nR << D [4]. The process can be written as x = \u039bz + \u0001, where \u039b is known as the factor loading\nmatrix and \u0001 models the additive noise in dimensions of x. Factors z are assumed distributed as\nG(z, 0, I) and the noise is assumed to be G(\u0001, 0, \u03c8), where \u03c8 is a diagonal matrix. It can be shown\nthat x has full covariance S = \u039b\u039bT + \u03c8, making the FA better suited for high dimensional modeling\nthan a Gaussian of diagonal covariance.\nA mixture of factor analyzers (MFA) is an extension of the FA that allows a piece-wise linear\napproximation of a non-linear data manifold. Unlike the GMM, it has two hidden variables: a discrete\nvariable s, p(s = k) = wk, which determines the mixture assignments and a continuous latent\nvariable z \u2208 RR, p(z|s = k) = G(z, 0, I), which is a low dimensional projection of the observation\nvariable x \u2208 RD, p(x|z, s = k) = G(x, \u039bkz + \u00b5k, \u03c8). Hence, the kth MFA component is a FA of\nmean \u00b5k and subspace de\ufb01ned by \u039bk. Overall, the MFA components approximate the distribution of\n\n4\n\n\fthe observations x by a set of sub-spaces in observation space. The Q function is\n\n(cid:88)\n\nQ(p(s, z|D; \u03b8b); \u03b8) =\n\n(cid:88)\n\n(cid:104)(cid:88)\n\nk\n\nEzi,si|xi;\u03b8b\n\ni\n\nI(si, k) log p(xi, zi, si = k; \u03b8)\n\n(cid:105)\n\n(9)\n\n(10)\n\n(11)\n(12)\n\n=\n\nhikEzi|xi;\u03b8b [log G(xi, \u039bkzi + \u00b5k, \u03c8) + log G(zi, 0, I) + log wk] .\nwhere hik = p(si = k|xi; \u03b8b). After some simpli\ufb01cations, the E step reduces to computing\n\ni,k\n\nhik = p(k|xi; \u03b8b) \u221d wb\n\nkN (xi, \u00b5b\n\nk, Sb\nk)\n\nEzi|xi;\u03b8b [zi] = \u03b2b\n\nEzi|xi;\u03b8b [zizT\n\ni ] =\n\n(cid:16)\nk(xi \u2212 \u00b5b\nk)\nI \u2212 \u03b2b\nk + \u03b2b\nk\u039bb\n\n(cid:17)\n\n(cid:0)Sb\n\nk\n\nk(xi \u2212 \u00b5b\n\n(cid:1)\u22121. The M-step then evaluates the Fisher score of\n\nk)T \u03b2bT\n\n(13)\n\nk\n\nwith Sb\n\u03b8 = {\u00b5b\n\nk\n\nk = \u039bbT\n\nk + \u03c8b and \u03b2b\n\nk\u039bbT\nk}. With some algebraic manipulations, this can be shown to have components\n(cid:105)\n\np(k|xi; \u03b8b)\u03c8b\u22121(cid:0)I \u2212 \u039bb\n\n(cid:1)(cid:0)xi \u2212 \u00b5b\n(cid:104)\n\n(cid:88)\n(cid:88)\n\nk = \u039bb\nk, \u039bb\nG\u00b5k (I) =\nG\u039bk (I) =\n\np(k|xi; \u03b8b)\u03c8b\u22121\n\nk)T \u03b2bT\n\nk \u2212 \u039bb\n\nk\u03b2b\nk\nk \u2212 I)\n\n(xi \u2212 \u00b5b\n\n(\u039bb\n\nk\u03b2b\n\nk\n\ni\n\ni\n\n(14)\n\n(15)\n\n.\n\nk)(xi \u2212 \u00b5b\n(cid:1)\nk)(xi \u2212 \u00b5b\n\nk\n\nFor a detailed discussion of the Q function, the reader is referred to the EM derivation in [4]. Note\nthat the scores with respect to the means are functionally similar to the \ufb01rst order residuals in (7).\nHowever, the scores with respect to the factor loading matrices \u039bk account for covariance statistics of\nthe observations xi, not just variances. We refer to the representations (14) and (15) as MFA Fisher\nscores (MFA-FS). Note that these are not FVs due to the absence of normalization by the Fisher\ninformation, which is more complex to compute than for the variance-GMM.\n\n3 Related work\n\nN (\u039bz, \u03c32I) conditioned on Laplace factors p(z) \u221d(cid:81)\np(x) is intractable, it can be approximated by an evidence lower bound p(x) \u2265 (cid:82) q(z) p(x,z)\n\u039b. [21] proposed an alternative bilinear pooling mechanism(cid:80)\n\nThe most popular approach to transfer object scores (usually from an ImageNet CNN) into a feature\nvector for scene classi\ufb01cation is to rely on FV-style pooling. Although most classi\ufb01ers default to the\nGMM-FV embedding [6, 1, 2, 24], some recent works have explored different encoding [11] and\npooling schemes [21, 3] with promising results. Liu et al. [11] derived an FV like representation from\nsparse coding. Their model can be described as a factor analyzer with Gaussian observations p(x|z) \u223c\nr exp(\u2212|zr|). While the sparse FA marginal\nq(z) dz\nderived from a suitable variational posterior q(z). In [11], q is a point posterior \u03b4(z \u2212 z\u2217) and the\nMAP inference simpli\ufb01es into sparse coding. The image representation is obtained using gradients\nof the sparse coding objective evaluated at the MAP factors z\u2217, with respect to the factor loadings\ni . Similar to the MFA-FS, this\ncaptures second order statistics of CNN feature space, albeit globally. Due to its simplicity, this\nmechanism supports \ufb01ne-tuning of the object CNN to scene classes. Gao et al. [3] have recently\nshown that this representation can be compressed with minimal performance loss.\n\ni xixT\n\n4 Experiments\n\nThe MFA-FS was evaluated on the scene classi\ufb01cation problem, using the 67 class MIT Indoor scenes\ndataset [14] and the 397 class MIT SUN dataset [25]. For Indoor scenes, a single training set of 80\nimages per class is provided by the authors. The test set consists of 20 images per class. Results are\nreported as average per class classi\ufb01cation accuracy. The authors of SUN provide multiple train/test\nsplits, each test set containing 50 images per class. Results are reported as mean average per class\nclassi\ufb01cation accuracy over splits. Three object recognition CNNs, pre-trained on ImageNet, were\nused to extract features: the 8 layer network of [8] (denoted as AlexNet) and the deeper 16 and 19\nlayer networks of [18] (denoted VGG-16 and VGG-19, respectively). These CNNs assign 1000\ndimensional object recognition probabilities to P \u00d7 P patches (sampled on a grid of \ufb01xed spacing)\nof the scene images, with P \u2208 {128, 160, 96}. Patch probability vectors were converted into their\nnatural parameter form and PCA-reduced to 500 dimensions as in [2]. Each image was mapped into a\n\n5\n\n\fTable 1: Classi\ufb01cation accu-\nracy (K = 50, R = 10).\n\nMIT Indoor\n\nGMM FV (\u00b5)\nGMM FV (\u03c3)\nMFA FS (\u00b5)\nMFA FS (\u039b)\n\n66.08\n53.86\n67.68\n71.11\n\nSUN\nGMM FV (\u00b5)\nGMM FV (\u03c3)\nMFA FS (\u00b5)\nMFA FS (\u039b)\n\n50.01\n37.71\n51.43\n53.38\n\nFigure 1: Classi\ufb01cation accuracy vs. descriptor size for MFA-FS(\u039b) of K = 50\ncomponents and R factor dimensions and GMM-FV(\u03c3) of K components. Left:\nMIT Indoor. Right: SUN.\n\nGMM-FV [13] using a background GMM, and an MFA-FS, using (14), (15) and a background MFA.\nAs usual in the FV literature, these vectors were power normalized, L2 normalized, and classi\ufb01ed\nwith a cross-validated linear SVM. These classi\ufb01ers were compared to scene CNNs, trained on the\nlarge scale Places dataset. In this case, the features from the penultimate CNN layer were used as a\nholistic scene representation and classi\ufb01ed with a linear SVM, as in [26]. We used the places CNNs\nwith the AlexNet and VGG-16 architectures provided by the authors.\n\n4.1\n\nImpact of Covariance Modeling\n\nWe begin with an experiment to compare the modeling power of MFAs to variance-GMMs. This was\nbased on AlexNet features, mixtures of K = 50 components, and an MFA latent space dimension of\nR = 10. Table 1 presents the classi\ufb01cation accuracy of a GMM-FV that only considers the mean\n- GMM-FV(\u00b5) - or variance - GMM-FV(\u03c3) - parameters and a MFA-FS that only considers the\nmean - MFA-FS(\u00b5) - or covariance - MFA-FS(\u039b) - parameters. The most interesting observation\nis the complete failure of the GMM-FV (\u03c3), which under-performs the GMM-FV(\u00b5) by more than\n10%. The difference between the two components of the GMM-FV is not as startling for lower\ndimensional SIFT features [13]. However, for CNN features, the discriminative power of variance\nstatistics is exceptionally low. This explains why previous FV representations for CNNs [6, 2] only\nconsider gradients with respect to the means. A second observation of importance is that the improved\nmodeling of covariances by the MFA eliminates this problem. In fact, MFA-FS(\u039b) is signi\ufb01cantly\nbetter than both GMM-FVs. It could be argued that a fair comparison requires an increase in the\nGMM modeling capacity. Fig. 1 tests this hypothesis by comparing GMM-FVs(\u03c3) and MFA-FS\n(\u039b) for various numbers of GMM components (K \u2208 {50, . . . , 500}) and MFA hidden sub-spaces\ndimensions (R \u2208 {1, . . . , 10}). For comparable vector dimensions, the covariance based scores\nalways signi\ufb01cantly outperforms the variance statistics on both datasets. A \ufb01nal observation is that,\ndue to covariance modeling in MFAs, the MFA-FS(\u00b5) performs better the GMM-FV(\u00b5). The \ufb01rst\norder residuals pooled to obtain the MFA-FS(\u00b5) (14) are scaled by covariance matrices instead of\nvariances. This local de-correlation provides a non-trivial improvement for the MFA-FS(\u00b5) over the\nGMM-FV(\u00b5)(\u223c 1.5% points). Covariance modeling was previously used in [19] to obtain FVs w.r.t.\nGaussian means and local subspace variances (eigen-values of covariance). Their subspace variance\nFV, derived with our MFAs, performs much better than the variance GMM-FV (\u03c3), due to a better\nunderlying model (60.7% v 53.86% on Indoor). It is, however, still inferior to the MFA-FS(\u039b) which\ncaptures full covariance within local subspaces.\nWhile a combination of the MFA-FS(\u00b5) and MFA-FS(\u039b) produces a small improvement (\u223c 1%), we\nrestrict to using the latter in the remainder of this work.\n\n4.2 Multi-scale learning and Deep CNNs\n\nRecent works have demonstrated value in combining deep CNN features extracted at multiple-scales.\nTable 2 presents the classi\ufb01cation accuracies of the MFA-FS (\u039b) based on AlexNet, and 16 and 19\nlayer VGG features extracted from 96x96, 128x128 and 160x160 pixel image patches, as well as\ntheir concatenation (3 scales), as suggested by [2]. These results con\ufb01rm the bene\ufb01ts of multi-scale\nfeature combination, which achieves the best performance for all CNNs and datasets.\n\n6\n\n00.511.522.53x 1050.40.450.50.550.60.650.70.750.8Descriptor DimensionsAccuracy MFA Grad (Factor Loading)GMM FV (Variance)K = 200K = 300K = 400K = 500K = 50R = 2K = 50R = 5K = 50R = 10K = 50R = 1K = 50K = 10000.511.522.53x 1050.20.250.30.350.40.450.50.55Descriptor DimensionsAccuracy MFA Grad (Factor Loading)GMM FV (Variance)K = 50R = 2K = 50R = 1K = 50R = 5K = 50R = 10K = 500K = 400K = 300K = 200K = 100K = 50\fTable 2: MFA-FS classi\ufb01cation ac-\ncuracy as a function of patch scale.\n\nTable 3: Performance of scene classi\ufb01cation methods.\ncombination of patch scales (128, 96, 160).\n\n*-\n\nMIT\nIndoor\nAlexNet\n69.83\n71.11\n70.51\n73.58\nVGG-16\n77.26\n77.28\n79.57\n80.1\nVGG-19\n77.21\n79.39\n79.9\n81.43\n\n160x160\n128x128\n96x96\n3 scales\n\n160x160\n128x128\n96x96\n3 scales\n\n160x160\n128x128\n96x96\n3 scales\n\nSUN\n\n52.36\n53.38\n53.54\n55.95\n\n59.77\n60.99\n61.71\n63.31\n\n-\n-\n-\n-\n\nMethod\n\nMFA-FS + Places (VGG)\n\nMFA-FS + Places (AlexNet)\n\nMFA-FS (VGG)\n\nMFA-FS (AlexNet)\nFull BN (VGG) [3]\n\nCompact BN (VGG) [3]\nH-Sparse (VGG) [12]\n\nSparse Coding (VGG) [12]\n\nSparse Coding (AlexNet) [11]\n\nMetaClass (AlexNet) + Places [24]\nFV (AlexNet)(4 scales) + Places [2]\nFV (AlexNet)(3 scales) + Places [2]\n\nFV (AlexNet) (4 scales) [2]\nFV (Alexnet)(3 scales) [2]\n\nVLAD (AlexNet) [6]\nFV+FC (VGG) [1]\n\nMid Level [10]\n\nMIT Indoor\n\n87.23\n79.86\n81.43\n73.58\n77.55\n76.17\n79.5\n77.6\n68.2\n78.9\n79.0\n78.5\u2217\n72.86\n71.24\n68.88\n81.0\n70.46\n\nSUN\n71.06\n63.16\n63.31\n55.95\n\n-\n-\n-\n-\n\n58.11\n61.72\n\n54.4\n53.0\n51.98\n\n-\n\n-\n-\n\n4.3 Comparison with ImageNet based Classi\ufb01ers\n\nWe next compared the MFA-FS to state of the art scene classi\ufb01ers also based on transfer from\nImageNet CNN features [11, 1\u20133]. Since all these methods only report results for MIT Indoor, we\nlimited the comparison to this dataset, with the results of Table 4. The GMM-FV of [2] operates\non AlexNet CNN semantics extracted from image patches of multiple sizes (96, 128, 160, 80). The\nFV in [1] is computed using convolutional features from AlexNet or VGG-16 extracted in a large\nmulti-scale setting. Liu et al. proposed a gradient representation based on sparse codes. Their initial\nresults were reported on a single patch scale of 128x128 using AlexNet features [11]. More recently,\nthey have proposed an improved H-Sparse representation, combined multiple patch scales and used\nVGG features in [12]. The recently proposed bilinear (BN) descriptor pooling of [21] is similar to\nthe MFA-FS in the sense that it captures global second order descriptor statistics. The simplicity of\nthese descriptors enables the \ufb01ne-tuning of the CNN layers to the scene classi\ufb01cation task. However,\ntheir results, reproduced in [3] for VGG-16 features, are clearly inferior to those of the MFA-FS\nwithout \ufb01ne-tuning. Gao et al. [3] propose a way to compress these bilinear statistics with trainable\ntransformations. For a compact image representation of size 8K, their accuracy is inferior to a\nrepresentation of 5K dimensions obtained by combining the MFA-FS with a simple PCA.\nThese experiments show that the MFA-FS is a state of the art procedure for task transfer from object\nrecognition (on ImageNet) to scene classi\ufb01cation (e.g. on MIT Indoor or SUN). Its closest competitor\nis the classi\ufb01er of [1], which combines CNN features in a massive multiscale setting ( 10 image sizes).\nWhile MFA-FS outperforms [1] with only 3 image scales, its performance improves even further with\naddition of more scales (82% with VGG, 4 patch sizes).\n\n4.4 Task transfer performance\n\nThe next question is how object-to-scene transfer compares to the much more intensive process,\npursued by [26], of collecting a large scale labeled dataset and training a deep CNN from it. Their\nscene dataset, known as Places, consists of 2.4M images, from which both AlexNet and VGG Net\nCNNs were trained for scene classi\ufb01cation. The fully connected features from the networks are used\nas scene representations and classi\ufb01ed with linear SVMs on Indoor scenes and SUN. The Places\nCNN features are a direct alternatives to the MFA-FS. While the use of the former is an example of\ndataset transfer (features trained on scenes to classify scenes) the use of the latter is an example of\ntask transfer (features trained on objects to classify scenes).\nA comparison between the two transfer approaches is shown in table 5. Somewhat surprisingly,\ntask transfer with the MFA-FS outperformed dataset transfer with the Places CNN, on both MIT\nIndoors and SUN and for both the AlexNet and VGG architectures. This supports the hypothesis that\nthe variability of con\ufb01gurations of most scenes makes scene classi\ufb01cation much harder than object\nrecognition, to the point where CNN architectures that have close-to or above human performance for\n\n7\n\n\fTable 4: Comparison to task transfer methods (Ima-\ngeNet CNNs) on MIT Indoor.\n\nMethod\n\n1 scale mscale\n\nAlexNet\n\nMFA-FS\n\nGMM FV [2]\nFV+FC [1]\n\nSparse Coding [11]\n\nVGG\n\nMFA-FS\n\nSparse Coding [12]\n\nH-Sparse [12]\n\nBN [3]\n\nFV+FC [1]\n\n71.11\n68.5\n\n-\n\n68.2\n\n73.58\n72.86\n71.6\n\n-\n\n79.9\n\n77.55\n\n-\n-\n\n-\n\n81.43\n77.6\n79.5\n\n-\n\n81.0\n\nVGG + dim. reduction\n\nMFA-FS + PCA (5k)\n\nBN (8k) [3]\n\n79.3\n76.17\n\n-\n-\n\nTable 5: Comparison with the Places trained Scene\nCNNs.\n\nMethod\n\nSUN\n\nIndoor\n\nAlexNet\n\nVGG\n\nMFA-FS\nPlaces\n\nCombined\n\nMFA-FS\nPlaces\n\nCombined\n\nAlexNet + VGG\n\nPlaces (VGG + Alex)\n\nMFA-FS(Alex) + Places(VGG)\nMFA-FS(VGG) + Places(Alex)\n\n55.95\n54.3\n63.16\n\n63.31\n61.32\n71.06\n\n65.91\n68.8\n67.34\n\n73.58\n68.24\n79.86\n\n81.43\n79.47\n87.23\n\n81.29\n85.6\n82.82\n\nobject recognition are much less effective for scenes. It is, instead, preferable to pool object detections\nacross the scene image, using a pooling mechanism such as the MFA-FS. This observation is in line\nwith an interesting result of [2], showing that the object-based and scene-based representations are\ncomplementary, by concatenating ImageNet- and Places-based feature vectors. By replacing the\nthe GMM-FV of [2] with the MFA-FS now proposed, we improve upon these results. For both the\nAlexNet and VGG CNNs, the combination of the ImageNet-based MFA-FS and the Places CNN\nfeature vector outperformed both the MFA-FS and the Places CNN features by themselves, in both\nSUN and MIT Indoor. To the best of our knowledge, no method using these or deeper CNNs has\nreported better results than the combined MFA-FS and Places VGG features of Table 5.\nIt could be argued that this improvement is just an effect of the often observed bene\ufb01ts of fusing\ndifferent classi\ufb01ers. Many works even resort to \u201cbagging\u201d of multiple CNNs to achieve performance\nimprovements [18]. To test this hypothesis we also implemented a classi\ufb01er that combines two Places\nCNNs with the AlexNet and VGG architectures. This is shown as Places (VGG+AlexNet) in the last\nsection of Table 5. While improving on the performance of both MFA-FS and Places, its performance\nis not as good as that of the combination of the object-based and scene-based representations (MFA-\nFS + Places). As shown in the remainder of the last section of the table, any combination of an object\nCNN with MFA-FS based transfer and a scene CNN outperforms this classi\ufb01er.\nFinally, table 3 compares results to the best recent scene classi\ufb01cation methods in the literature.\nThis comparison shows that MFA-FS + Places combination is a state-of-the-art classi\ufb01er with\nsubstantial gains over all other proposals. The results of 71.06% on SUN and 87.23% on Indoor\nscenes substantially outperform the previous best results of 61.7% and 81.7%, respectively.\n\n5 Conclusion\n\nIt is now well established that deep CNNs can be transferred across datasets that address similar\ntasks. It is less clear, however, whether they are robust to transfer across tasks. In this work, we have\nconsidered a class of problems that involve this type of transfer, namely problems that bene\ufb01t from\ntransferring object detections into holistic scene level inference, eg. scene classi\ufb01cation. While such\nproblems have been addressed with FV-like representations in the past, we have shown that these are\nnot very effective for the high-dimensional CNN features. The reason is their reliance on a model, the\nvariance-GMM, with a limited \ufb02exibility. We have addressed this problem by adopting a better model,\nthe MFA, which approximates the non-linear data manifold by a set of local sub-spaces. We then\nintroduced the Fisher score with respect to this model, denoted as the MFA-FS. Through extensive\nexperiments, we have shown that the MFA-FS has state of the art performance for object-to-scene\ntransfer and this transfer actually outperforms a scene CNN trained on a large scene dataset. These\nresults are signi\ufb01cant given that 1) MFA training takes only a few hours versus training a CNN, and\n2) transfer requires a much smaller scene dataset.\n\n8\n\n\fReferences\n[1] M. Cimpoi, S. Maji, I. Kokkinos, and A. Vedaldi. Deep \ufb01lter banks for texture recognition, description,\n\nand segmentation. International Journal of Computer Vision, 2015.\n\n[2] Mandar Dixit, Si Chen, Dashan Gao, Nikhil Rasiwasia, and Nuno Vasconcelos. Scene classi\ufb01cation with\nsemantic \ufb01sher vectors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2015.\n\n[3] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. CoRR,\n\n[4] Zoubin Ghahramani and Geoffrey E. Hinton. The em algorithm for mixtures of factor analyzers. Technical\n\nabs/1511.06062, 2015.\n\nreport, 1997.\n\n[5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\nobject detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2014.\n\n[6] Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. Multi-scale orderless pooling of deep\nconvolutional activation features. In Computer Vision ECCV 2014, volume 8695, pages 392\u2013407, 2014.\n[7] Tommi S. Jaakkola and David Haussler. Exploiting generative models in discriminative classi\ufb01ers. In\nProceedings of the 1998 conference on Advances in neural information processing systems II, pages\n487\u2013493, 1999.\n\n[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems 25, pages 1097\u20131105, 2012.\n\n[9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[10] Yao Li, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Mid-level deep pattern mining. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.\n\n[11] Lingqiao Liu, Chunhua Shen, Lei Wang, Anton Hengel, and Chao Wang. Encoding high dimensional local\nfeatures by sparse coding based \ufb01sher vectors. In Advances in Neural Information Processing Systems 27,\npages 1143\u20131151. 2014.\n\n[12] Lingqiao Liu, Peng Wang, Chunhua Shen, Lei Wang, Anton van den Hengel, Chao Wang, and Heng Tao\nShen. Compositional model based \ufb01sher vector coding for image classi\ufb01cation. CoRR, abs/1601.04143,\n2016.\n\n[13] Florent Perronnin, Jorge S\u00e1nchez, and Thomas Mensink. Improving the \ufb01sher kernel for large-scale image\nclassi\ufb01cation. In Proceedings of the 11th European conference on Computer vision: Part IV, ECCV\u201910,\npages 143\u2013156, 2010.\n\n[14] A. Quattoni and A. Torralba. Recognizing indoor scenes. 2012 IEEE Conference on Computer Vision and\nPattern Recognition, 0:413\u2013420, 2009. doi: http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.\n5206537.\n\n[15] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.\n\nProceedings of the IEEE, 77(2):257\u2013286, Feb 1989. ISSN 0018-9219. doi: 10.1109/5.18626.\n\n[16] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object\n\ndetection with region proposal networks. In Neural Information Processing Systems (NIPS), 2015.\n\n[17] Jorge S\u00e1nchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. Image classi\ufb01cation with the\n\n\ufb01sher vector: Theory and practice. International Journal of Computer Vision, 105(3):222\u2013245, 2013.\n\n[18] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[19] Masayuki Tanaka, Akihiko Torii, and Masatoshi Okutomi. Fisher vector based on full-covariance gaussian\nmixture model. IPSJ Transactions on Computer Vision and Applications, 5:50\u201354, 2013. doi: 10.2197/\nipsjtcva.5.50.\n\n[20] Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In CVPR\u201911, June 2011.\n[21] Aruni RoyChowdhury Tsung-Yu Lin and Subhransu Maji. Bilinear cnn models for \ufb01ne-grained visual\n\nrecognition. In International Conference on Computer Vision (ICCV), 2015.\n\n[22] Jakob Verbeek. Learning nonlinear image manifolds by global alignment of local linear models. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 28(8):1236\u20131250, August 2006.\n\n[23] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image\n\ncaption generator. CoRR, abs/1411.4555, 2014.\n\n[24] Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. Harvesting discriminative meta objects\nwith deep cnn features for scene classi\ufb01cation. In The IEEE International Conference on Computer Vision\n(ICCV), December 2015.\n\n[25] Jianxiong Xiao, J. Hays, K.A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recog-\nnition from abbey to zoo. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on,\npages 3485\u20133492, 2010.\n\n[26] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition\n\nusing Places Database. NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1423, "authors": [{"given_name": "Mandar", "family_name": "Dixit", "institution": "UC San Diego"}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": "UC San Diego"}]}