{"title": "PiCoDes: Learning a Compact Code for Novel-Category Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 2088, "page_last": 2096, "abstract": "We introduce PiCoDes: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. PiCoDes of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.", "full_text": "PICODES: Learning a Compact Code for\n\nNovel-Category Recognition\n\nAlessandro Bergamo, Lorenzo Torresani\n\nDartmouth College\nHanover, NH, U.S.A.\n\n{aleb, lorenzo}@cs.dartmouth.edu\n\nAndrew Fitzgibbon\nMicrosoft Research\n\nCambridge, United Kingdom\n\nawf@microsoft.com\n\nAbstract\n\nWe introduce PICODES: a very compact image descriptor which nevertheless al-\nlows high performance on object category recognition. In particular, we address\nnovel-category recognition: the task of de\ufb01ning indexing structures and image\nrepresentations which enable a large collection of images to be searched for an\nobject category that was not known when the index was built. Instead, the train-\ning images de\ufb01ning the category are supplied at query time. We explicitly learn\ndescriptors of a given length (from as small as 16 bytes per image) which have\ngood object-recognition performance. In contrast to previous work in the domain\nof object recognition, we do not choose an arbitrary intermediate representation,\nbut explicitly learn short codes. In contrast to previous approaches to learn com-\npact codes, we optimize explicitly for (an upper bound on) classi\ufb01cation perfor-\nmance. Optimization directly for binary features is dif\ufb01cult and nonconvex, but\nwe present an alternation scheme and convex upper bound which demonstrate ex-\ncellent performance in practice. PICODES of 256 bytes match the accuracy of the\ncurrent best known classi\ufb01er for the Caltech256 benchmark, but they decrease the\ndatabase storage size by a factor of 100 and speed-up the training and testing of\nnovel classes by orders of magnitude.\n\n1 Introduction\n\nIn this work we consider the problem of ef\ufb01cient object-class recognition in large image collections.\nWe are speci\ufb01cally interested in scenarios where the classes to be recognized are not known in\nadvance. The motivating application is \u201cobject-class search by example\u201d where a user provides\nat query time a small set of training images de\ufb01ning an arbitrary novel category and the system\nmust retrieve from a large database images belonging to this class. This application scenario poses\nchallenging requirements on the system design: the object classi\ufb01er must be learned ef\ufb01ciently at\nquery time from few examples; recognition must have low computational cost with respect to the\ndatabase size; \ufb01nally, compact image descriptors must be used to allow storage of large collections\nin memory rather than on disk for additional ef\ufb01ciency.\n\nTraditional object categorization methods do not meet these requirements as they typically use non-\nlinear kernels on high-dimensional descriptors, which renders them computationally expensive to\ntrain and test, and causes them to occupy large amounts of storage. For example, the LP-\u03b2 multiple\nkernel combiner [11] achieves state-of-the-art accuracy on several categorization benchmarks but it\nrequires over 23 Kbytes to represent each image and it uses 39 feature-speci\ufb01c nonlinear kernels.\nThis recognition model is impractical for our application because it would require costly query-time\nkernel evaluations for each image in the database since the training set varies with every new query\nand thus pre-calculation of kernel distances is not possible.\n\nWe propose to address these storage and ef\ufb01ciency requirements by learning a compact binary im-\nage representation, called PICODES1, optimized to yield good categorization accuracy with linear\n\n1Which we think of as \u201cPicture Codes\u201d or \u201cPico-Descriptors\u201d, or (with Greek pronunciation) \u03c0-codes\n\n1\n\n\fFigure 1: Visualization of PICODES. The 128-bit PICODE (whose accuracy on Caltech256 is displayed in\n\ufb01gure 3) is applied to the test data of ILSVRC2010. Six of the 128 bits are illustrated as follows: for bit c,\nall images are sorted by non-binarized classi\ufb01er outputs a\u22a4\nc x and the 10 smallest and largest are presented on\neach row. Note that ac is de\ufb01ned only up to sign, so the patterns to which the bits are specialized may appear\nin either the \u201cpositive\u201d or \u201cnegative\u201d columns.\n\n(i.e., ef\ufb01cient) classi\ufb01ers. The binary entries in our image descriptor are thresholded nonlinear pro-\njections of low-level visual features extracted from the image, such as descriptors encoding texture\nor the appearance of local image patches. Each non-linear projection can be viewed as implementing\na nonlinear classi\ufb01er using multiple kernels. The intuition is that we can then use these pre-learned\nmultiple kernel combiners as a classi\ufb01cation basis to de\ufb01ne recognition models for arbitrary novel\ncategories: the \ufb01nal classi\ufb01er for a novel class is obtained by linearly combining the binary outputs\nof the basis classi\ufb01ers, which we can pre-compute for every image in the database, thus enabling\nef\ufb01cient novel object-class recognition even in large datasets.\n\nThe search for compact codes for images has been the subject of much recent work, which we\nloosely divide into \u201cdesigned\u201d and \u201clearned\u201d codes. In the former category we include min-hash [6],\nVLAD [14], and attributes [10, 18, 17] which are fully-supervised classi\ufb01ers trained to recognize\ncertain visual properties in the image. A related idea is the representation of images in terms of\ndistances to basis classes. This has been previously investigated as a way to de\ufb01ne image similar-\nities [30], to perform video search [12], or to enable natural scene recognition and retrieval [29].\nTorresani et al. [27] de\ufb01ne a compact image code as a bitvector, the entries of which are the out-\nputs of a large set of weakly-trained basis classi\ufb01ers (\u201cclassemes\u201d) evaluated on the image. Simple\nlinear classi\ufb01ers trained on classeme vectors produce near state-of-the-art categorization accuracy.\nLi et al. [19] use the localized outputs of object detectors as an image representation. The advan-\ntage of this representation is that it encodes spatial information; furthermore, object detectors are\nmore robust to clutter and uninformative background than classi\ufb01ers evaluated on the entire image.\nThese prior methods work under the assumption that an \u201covercomplete\u201d representation for classi\ufb01-\ncation can be obtained by pre-learning classi\ufb01ers for a large number of basis classes, some of which\nwill be related to those encountered at test-time. Such high-dimensional representations are then\ncompressed down using quantization, dimensionality reduction or feature selection methods.\n\nThe second strand of related work is the learning of compact codes for images [31, 26, 24, 15,\n22, 8] where binary image codes are learned such that the Hamming distance between codewords\napproximates a kernelized distance between image descriptors, most typically GIST. Autoencoder\nlearning [23], on the other hand, produces a compact code which has good image reconstruction\nproperties, but again is not specialized for category recognition.\n\nAll the above descriptors can produce very compact codes, but few (excepting [27, 19]) have been\nshown to be effective at category-level recognition beyond simpli\ufb01ed problems such as Caltech-\n20 [2] or Caltech-101 [14, 16]. In contrast, we consider Caltech-256 a baseline competence, and\nalso test compact codes on a large-scale class retrieval task using ImageNet [7].\n\nThe goal of this paper then is to learn a compact binary code (as short as 128 bits) which has\ngood object-category recognition accuracy. In contrast to previous learning approaches, our training\nobjective is a direct approximation to this goal; while in contrast to previous \u201cdesigned\u201d descriptors,\nwe learn abstract categories (see \ufb01gure 1) aimed at optimizing classi\ufb01cation rather than an arbitrary\nprede\ufb01ned set of attributes or classemes, and thus achieve increased accuracy for a given code length.\n\n2\n\n\f2 Technical approach\n\n \n\n52K\n\n6415\n\n32\n\n26\n\n20\n\n18\n\n30\n\n28\n\n24\n\n22\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n \n\n16\n102\n\nLinear SVM on x\nLP\u2212\u03b2 using explicit feature maps\nLP\u2212\u03b2 using nonlinear kernels\n\nWe start by introducing the basic classi\ufb01er archi-\ntecture used by state-of-the-art category recognizers,\nwhich we want to leverage as effectively as possible\nto de\ufb01ne our image descriptor. Given an image I, a\nbank of feature descriptors is computed (e.g. SIFT,\nPHOG, GIST), to yield a feature vector f I \u2208 RF\n(the feature vector used in our implementation has\ndimensionality F = 17360 and is described in the\nexperimental section). State-of-the-art recognizers\nuse kernel matching between these descriptors to de-\n\ufb01ne powerful classi\ufb01ers, nonlinear in f I . For exam-\nple, the LP-\u03b2 classi\ufb01er of Gehler and Nowozin [11],\nwhich has achieved the best results on several bench-\nmarks to date, operates by combining the outputs of\nnonlinear SVMs trained on individual features. In\nour work we approximate these nonlinear classi\ufb01ers\nby employing the lifting method of Vedaldi and Zis-\nserman [28]: for the family of homogeneous additive\nkernels K, there exists a \ufb01nite-dimensional feature\nmap \u03c8 : RF \u2212\u2192 RF (2r+1) such that the nonlin-\near kernel distance K(f I , f I \u2032) \u2248 h\u03c8(f I ), \u03c8(f I \u2032)i\nwhere r is a small positive integer (in our implemen-\ntation set to 1). These explicit feature maps allow\nus to approximate a non-linear classi\ufb01er, such as the\nLP-\u03b2 kernel combiner, via an ef\ufb01cient linear projection. As described below, we use these nonlinear\nclassi\ufb01er approximated via linear projections as the basis for learning our features. However, in our\ncase F (2r + 1) = 17360 \u00d7 3 = 52080. This dimensionality is too large in practice for our learn-\ning. Thus, we apply a linear dimensionality reduction xI = P\u03c8(f I ), where projection matrix P is\nobtained through PCA, so xI \u2208 Rn for n \u226a F (2r + 1). As this procedure is performed identically\nfor every image we consider, we will drop the dependence on I and simply refer to the \u201cimage\u201d x.\n\nFigure 2: The accuracy versus compactness trade\noff. The benchmark is Caltech256, using 10 ex-\namples per class. The pink cross shows the multi-\nclass categorization accuracy achieved by an LP-\u03b2\nclassi\ufb01er using kernel distance; the red triangle is\nthe accuracy of an LP-\u03b2 classi\ufb01er that uses \u201clifted-\nup\u201d features to approximate kernel distances; the\nblue line shows accuracy of a linear SVM trained\non PCA projections of the lifted-up features, as a\nfunction of PCA dimension.\n\nDescriptor dimensionality\n\n103\n\n104\n\n105\n\nA natural question to address is: how much accuracy do we lose due to the kernel approximation\nand the PCA projection? We answer this question in \ufb01gure 2, where we compare the multi-class\nclassi\ufb01cation accuracies obtained on the Caltech256 data set by the following methods using our\nlow-level descriptors f \u2208 R17360: an LP-\u03b2 combiner based on exact non-linear kernel calculations;\nan LP-\u03b2 combiner using explicit feature maps; a linear SVM trained on the PCA projections x as a\nfunction of the PCA subspace dimensionality. We see from this \ufb01gure that the explicit maps degrade\nthe accuracy only slightly, which is consistent with the results reported in [28]. However, the linear\nSVM produces slightly inferior accuracy even when applied to the full 52,080-dimensional feature\nvectors. The key-difference between the linear SVM and the LP-\u03b2 classi\ufb01er is that the former de\ufb01nes\na classi\ufb01er in the joint space of all 13 features, while the latter \ufb01rst trains a separate classi\ufb01er for\neach feature, and then learns a linear combination of them. The results in our \ufb01gure suggest that the\ntwo-step procedure of LP-\u03b2 provides a form of bene\ufb01cial regularization, a fact \ufb01rst noted in [11].\nFor our feature learning algorithm, we chose to use a PCA subspace of dimensionality n = 6415\nsince, as suggested by the plot, this setting gives a good tradeoff in terms of compact dimensionality\nand good recognition accuracy.\n\nTorresani et al. [27] have shown that an effective image descriptor for categorization can be built\nby collecting in a vector the thresholded outputs of a large set of nonlinear classi\ufb01ers evaluated on\nthe image. This \u201cclasseme\u201d descriptor can produce recognition accuracies within 10% of the state\nof the art for novel classes even with simple linear classi\ufb01cation models. Using our formulation\nbased on explicit feature maps, we can approximately express each classeme entry (which in [27]\nis implemented as an LP-\u03b2 classi\ufb01er) as the output of a linear classi\ufb01er\n\nh(x; ac) = 1[aT\n\nc x > 0]\n\n(1)\n\nwhere 1[.] is the 0-1 indicator function of its boolean argument and x is the PCA-projection of \u03c8(f ),\nwith a 1 appended to it to avoid dealing with an explicit bias coef\ufb01cient, i.e., x = [P\u03c8(f ); 1] .\n\n3\n\n\fIf following the approach of Torresani et al. [27], we would collect C training categories, and learn\nthe parameters ac for each class from of\ufb02ine training data using some standard training objective\nsuch as hinge loss. We gather the parameters into a n \u00d7 C matrix\n\nThen, for image x, the \u201cclasseme\u201d descriptor h(x) is computed as the concatenation of the outputs\nof the classi\ufb01ers learned for the training categories:\n\nA = [a1| . . . |aC ].\n\nh(x; A) =\uf8ee\n\uf8ef\uf8f0\n\nh(x; a1)\n\n...\n\nh(x; aC )\n\n\uf8f9\n\uf8fa\uf8fb\n\n\u2208 {0, 1}C\n\n(2)\n\nThe PICODE descriptor is also of this form. However, the key-difference with respect to [27] lies in\nour training procedure, and the fact that the dimensionality C is no longer restricted to be the same\nas the number of training classes.\n\nTo emphasize once more the contributions of this paper, let us review the shortcomings of existing\nattribute- and classi\ufb01er-based descriptors, which we overcome in this paper:\n\n\u2022 Prior work used attributes learned disjointly from one another, which \u201cjust so-happen\u201d\nto work well as features for classi\ufb01cation, without theoretical justi\ufb01cation for their use\nin subsequent classi\ufb01cation. Given that we want to use attributes as features for linear\nclassi\ufb01cation, we propose to formalize as learning objective that linear combinations of\nsuch attributes must yield good accuracy.\n\n\u2022 Unlike the attribute or classeme approach, our method decouples the number of training\nclasses from the target dimensionality of the binary descriptor. We can optimize our fea-\ntures for any arbitrary desired length, thus avoiding a suboptimal feature selection stage.\n\n\u2022 Finally, we directly optimize the learning parameters with respect to binary features while\n\nprior attribute systems binarized the features in a quantization stage after the learning.\n\nWe now introduce a framework to learn the A parameters directly on a classi\ufb01cation objective.\n\n2.1 Learning the basis classi\ufb01ers\n\nWe assume that we are given a set of N training images, with each image coming from one of K\ntraining classes. We will continue to let C stand for the dimensionality (i.e., number of bits) of our\ncode. Let D = {(xi, yi)}N\ni=1 be the training set for learning the basis classi\ufb01ers, where xi is the i-th\nimage example (represented by its n-dimensional PCA projection augmented with a constant entry\nset to 1) and yi \u2208 {\u22121, +1}K is a vector encoding the category label out of K possible classes:\nyik = +1 iff the i-th example belongs to class k.\n\nWe then de\ufb01ne our c-th basis classi\ufb01er to be a boolean function of the form (1), a thresholded non-\nlinear projection of the original low-level features f , parameterized by ac \u2208 Rn. We then optimize\nthese parameters so that linear combinations of these basis classi\ufb01ers yield good categorization ac-\ncuracy on D. The learning objective introduces auxiliary variables (wk, bk) for each training class,\nwhich parameterize the linear classi\ufb01er for that training class, operating on the PICODE represen-\ntation of the training examples, and the objective for A simply minimizes over these auxiliaries:\n\nE(A) = min\n\nw1..K ,b1..K\n\nE(A, w1..K, b1..K)\n\n(3)\n\nSolving for A then amounts to simultaneous optimization over all variables of the following learning\nobjective, which is a trade off between a small classi\ufb01cation error and a large margin when using the\noutput bits of the basis classi\ufb01ers as features in a one-versus-all linear SVM:\n\nE(A, w1..K, b1..K) =\n\nK\n\nXk=1( 1\n\n2\n\nkwkk2 +\n\n\u03bb\nN\n\nN\n\nXi=1\n\n\u2113(cid:20)yi,k(bk + w\u22a4\n\nk h(xi; A)(cid:21))\n\n(4)\n\nwhere \u2113[\u00b7] is the traditional hinge loss function. Expanding, we get\n\nE(A, w1..K, b1..K) =\n\nK\n\nXk=1(cid:26) 1\n\n2\n\nkwkk2+\n\n\u03bb\nN\n\nN\n\nXi=1\n\n\u2113(cid:20)yi,k(bk +\n\nC\n\nXc=1\n\nwkc1[aT\n\nc xi > 0])(cid:21)) .\n\n4\n\n\fNote that the linear SVM and the basis classi\ufb01ers are learned jointly using the method described\nbelow.\n\n2.2 Optimization\n\nWe propose to minimize this error function by block coordinate descent. We alternate between the\ntwo following steps:\n\n1. Learn classi\ufb01ers.\nWe \ufb01x A and optimize the objective with respect to w and b jointly. This optimization is convex\nand equivalent to traditional linear SVM learning.\n\n2. Learn projectors.\nGiven the current values of w and b, we minimize the objective with respect to A by up-\ndating one basis-classi\ufb01er at a time. Let us consider the update of ac with \ufb01xed parameters\nw1..K, b, a1, . . . , ac\u22121, ac+1, . . . , aC . It can be seen (Appendix A) that in this case the objective\nbecomes:\n\nN\n\nE(ac) =\n\nvi1[ziaT\n\nc xi > 0] + const\n\n(5)\n\nwhere zi \u2208 {\u22121, +1} and vi \u2208 R+ are known values computed from the \ufb01xed parameters. Optimiz-\ning the objective in Eq. 5 is equivalent to learning a linear classi\ufb01er minimizing the sum of weighted\nmisclassi\ufb01cations, where vi represents the cost of misclassifying example i. Unfortunately, this ob-\njective is not convex and it is dif\ufb01cult to optimize. Thus, we replace it with the following convex\nupper bound de\ufb01ned in terms of the hinge function \u2113:\n\nXi=1\n\nN\n\nXi=1\n\n\u02c6E(ac) =\n\nvi\u2113(ziaT\n\nc xi)\n\n(6)\n\nThis objective can be globally optimized using an LP solver or software for SVM training. We had\nsuccess with LIBLINEAR [9], which deals nicely with the large problem sizes we considered.\n\nWe have also experimented with several other optimization methods, including stochastic gradient\ndescent applied to a modi\ufb01ed version of our objective where we replaced the binarization function\nh(x; ac) = 1[aT\nc x)) to relax\nthe problem. After learning, at test-time we replaced back \u03c3(x; ac) with h(x; ac) to obtain binary\ndescriptors. However, we found that these binary codes performed much worse than those directly\nlearned via the coordinate descent procedure described above.\n\nc x > 0] with the sigmoid function \u03c3(x; ac) = 1/(1 + exp(\u2212 2\n\nT aT\n\n3 Experiments\n\nWe now describe experimental evaluations carried out over several data sets. In order to allow a fair\ncomparison, we reimplemented the \u201cclasseme descriptor\u201d based on the same set of low-level features\nand settings described in [27] but using the explicit feature map framework to replace the expensive\nnonlinear kernel distance computations. The low-level features are: color GIST [21], spatial pyramid\nof histograms of oriented gradients (PHOG) [4], spatial pyramid of self-similarity descriptors [25],\nand a histogram of SIFT features [20] quantized using a dictionary of 5000 visual words. Each\nspatial pyramid level of each descriptor was treated as a separate feature, thus producing a total\nof 13 low-level features. Each of these features was lifted up to a higher-dimensional space using\nthe explicit feature maps of Vedaldi and Zisserman [28]. We chose the mapping approximating the\nhistogram intersection kernels for n = 1, which effectively mapped each low-level feature descriptor\nto a space 3 times larger than its original one. The resulting vectors \u03c8(f ) have dimensionality\n3 \u00d7 F = 52, 080. To learn our basis classi\ufb01ers, we used 6415-dimensional PCA projections of these\nhigh-dimensional vectors.\n\nWe compared PICODES with binary classeme vectors. For both descriptors we used a training set\nof K = 2659 classes randomly sampled from the ImageNet dataset [7], with 30 images for each\ncategory for a total of N = 2659 \u00d7 30 = 79, 770 images. Each class in ImageNet is associated\nto a \u201csynset\u201d, which is a set of words describing the category. Since we wanted to evaluate the\n\n5\n\n\f35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n5\n\n \n\n128\n\nLP\u2212\u03b2\n\n \n\nPiCoDes\nClassemes from ImageNet + RFE\nLSH [13]\nSpectral Hashing [31]\nBRE [15]\nOriginal Classemes [27] + RFE\n\n192 256\n\n1024\nDescriptor size (number of bits)\n\n512\n\n2048\n\nFigure 3: Multiclass categorization accuracy on Cal-\ntech256 using different binary codes, as a function\nof the number of bits. PICODES outperform all the\nother compact codes. PICODES of 2048 bits match\nthe accuracy of the state-of-the-art LP-\u03b2 classi\ufb01er.\n\n)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n24\n\n22\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n \n\nPiCoDes, 512 bits\nClassemes from ImageNet + RFE, 512 bits\nPiCoDes, 128 bits\nClassemes from ImageNet + RFE, 128 bits\n\n1329\n\n2659\n\nNumber of training classes for the descriptor (K)\n\n \n\n3988\n\nFigure 4: Caltech256 classi\ufb01cation accuracy for PI-\nCODES and classemes as a function of the number of\ntraining classes used to learn the descriptors.\n\nlearned descriptors on the Caltech256 and ILSVRC2010 [3] benchmarks, we selected 2659 Ima-\ngeNet training classes such that the synsets of these classes do not contain any of the Caltech256 or\nILSVRC2010 class labels, so as to avoid \u201cpre-learning\u201d the test classes during the feature-training\nstage, which could yield a biased evaluation.\n\nWe also present comparisons with binary codes trained to directly approximate the Eucliden dis-\ntances between the vectors x, using the following previously proposed algorithms: locality sensitive\nhashing (LSH) [13], spectral hashing (SH) [31], and binary reconstructive embeddings (BRE) [15].\nSince these descriptors in the past have been used predominantly with the k-NN classi\ufb01er, we have\nalso tested this classi\ufb01cation model but obtained inferior results compared to when using a linear\nSVM. For this reason, here we report only results using the linear SVM model.\n\nMulticlass recognition using PICODES. We \ufb01rst report in \ufb01gure 3 the results showing multiclass\nclassi\ufb01cation accuracy achieved with binary codes on the Caltech256 data set. Since PICODES are\noptimized for categorization using linear models, we adopt simple linear \u201cone-versus-all\u201d SVMs as\nclassi\ufb01ers. For each Caltech256 category, the classi\ufb01er was trained using 10 positive examples and a\ntotal of 2550 negative examples obtained by sampling 10 images from each of the other classes. We\ncomputed accuracies using 25 test examples per class, using 5-fold cross validation for the model\nselection. As usual, accuracy is computed as the average over the mean recognition rates per class.\nFigure 3 shows the results obtained with binary descriptors of varying dimensionality. While our\napproach can accommodate easily the case were the number of feature dimensions (C) is different\nfrom the number of feature-training categories (K), the classeme learning method can only produce\ndescriptors of size K. Thus, the descriptor size is typically reduced through a subsequent feature\nselection stage [27, 19]. In this \ufb01gure we show accuracy obtained with classeme features selected\nby multi-class recursive feature elimination (RFE) with SVM [5], which at each iteration retrains\nthe SVMs for all classes on the active features and then removes the m least-used active features\nuntil reaching the desired compactness. We also report accuracy obtained with the original classeme\nvectors of [27], which were learned with exact kernels on a different training set, consisting of\nweakly-supervised images retrieved with text-based image search engines. From this \ufb01gure we see\nthat PICODES greatly outperform all the other compact codes considered here (classemes, LSH,\nSH, BRE) for all descriptor sizes. In addition, perhaps surprisingly, PICODES of 2048 bits yield\neven higher accuracy than the-state-of-the-art multiple kernel combiner LP-\u03b2 [11] trained on our\nlow-level features f (30.5% versus 29.7%). At the same time, our codes are 100 times smaller and\nreduce the training and testing time by two-orders of magnitude compared to LP-\u03b2.\n\nWe have also investigated the in\ufb02uence of the parameter K, i.e., the number of training classes\nused to learn the descriptor. We learned different PICODES and classeme descriptors by varying K\nwhile keeping the number of training examples per class \ufb01xed to 30. Figure 4 shows the multiclass\ncategorization accuracy on Caltech256 as a function of K. From this plot we see that PICODES\n\n6\n\n\f)\n\n%\n\n(\n \n5\n2\n \n@\n \nn\no\ns\nc\ne\nr\nP\n\ni\n\ni\n\n32\n\n30\n\n28\n\n26\n\n24\n\n22\n\n20\n\n18\n\n16\n\n14\n\n12\n\n \n\n5\n\n10\n\nPiCoDes\nClassemes from ImageNet + RFE\nOriginal Classemes [27] + RFE\n\n15\n\n20\n\n25\n\nNumber of training examples per class\n\n30\n\n35\n\n40\n\n45\n\n50\n\n \n\n)\n\n%\n\n(\n \n\n \n\n \n\nK\n@\nn\no\ns\nc\ne\nr\nP\n\ni\n\ni\n\n42\n\n40\n\n38\n\n36\n\n34\n\n32\n\n30\n\n28\n\n26\n\n \n\n \n\nPiCoDes\nOriginal Classemes [27] + RFE\n\n1\n\n3\n\n5\n\n7\n\n10\n\n15\nK\n\n20\n\n25\n\nFigure 5: Precision of object-class search using\ncodes of 256 bytes on Caltech256: for a varying\nnumber of training examples per class, we report\nthe percentage of true positives in top 25 retrieved\nfrom a dataset containing 6375 distractors and 25\nrelevant results.\n\nFigure 6: Finding pictures of an object class in the\nILSVRC2010 dataset, which includes 150K images\nfor 1000 different classes, using 256-byte codes. PI-\nCODES enable accurate class retrieval from this large\ncollection in less than a second.\n\npro\ufb01t more than classemes from a larger number of training classes, producing further improvement\nin generalization on novel classes.\n\nAn advantage of the classeme learning setup presented in [27] is the intrinsic parallelization that can\nbe achieved during the learning of the C classeme classi\ufb01ers (which are disjointly trained), enabling\nthe use of more training data. We have considered this scenario, and tried learning the classeme\ndescriptors from ImageNet using 5 times more images than for PICODES, i.e., 150 images for each\ntraining category for a total of N = 2659 \u00d7 150 = 398, 850 examples. Despite the disparate training\nset sizes, we found that PICODES still outperformed classemes (22.41% versus 20.4% for 512 bits).\n\nRetrieval of object classes on Caltech256.\nIn \ufb01gure 5 we present results corresponding to our\nmotivating application of object-class search, using codes of 256 bytes. For each Caltech256 class,\nwe trained a one-versus-all linear SVM using p positive examples and p \u00d7 255 negative examples,\nfor varying values of p. We then used the learned classi\ufb01er to \ufb01nd images of that category in a\ndatabase containing 6400 Caltech256 test images, with 25 images per class. The retrieval accuracy\nis measured as precision at 25, which is the proportion of true positives (i.e., images of the query\nclass) ranked in the top 25. Again, we see that our features yield consistently better ranking precision\ncompared to classeme vectors learned on the same ImageNet training set, and produce an average\nimprovement of about 28% over the original classeme features.\n\nObject class search in a large image collection. Finally, we present experiments on the 150K\nimage data set of the Large Scale Visual Recognition Challenge 2010 (ILSVRC2010) [3], which\nincludes images of 1000 different categories, different from those used to train PICODES. Again, we\nevaluate our binary codes on the task of object-class retrieval. For each of the 1000 classes, we train\na linear SVM using all examples of that class available in the ILSVRC2010 training set (this number\nvaries from a minimum of 619 to a maximum of 3047, depending on the class) and 4995 negative\nexamples obtained by sampling \ufb01ve images from each of the other classes. We test the classi\ufb01ers\non the ILSVRC2010 test set, which includes 150 images for each of the 1000 classes. Figure 6\nshows a comparison between PICODES and classemes in terms of precision at k for varying k.\nDespite the very large number of distractors (149,850 for each query), search with our codes yields\nprecisions exceeding 38%. Furthermore, the tiny size of our descriptor allows the entire data set to\nbe easily kept in memory for ef\ufb01cient retrieval (the whole database size using our representation is\nonly 36MB): the average search time for a query class, including the learning time, is about 1 second\non an Intel Xeon X5680 @ 3.33GHz.\n\n7\n\n\f4 Conclusion\n\nWe have described a new type of compact code, which is learned by directly minimizing a multiclass\nclassi\ufb01cation objective on a large set of of\ufb02ine training classes. This allows recognition of novel\ncategories to be performed using extremely compact codes with state-of-the-art accuracy. Although\nthere is much existing work on learning compact codes, we know of no other compact code which\noffers this performance on a category recognition task.\n\nOur experiments have focussed on whole-image \u201cCaltech-like\u201d category recognition, while it is\nclear that subimage recognition is also an important application. However, we argue that for many\nimage search tasks, whole-image performance is relevant, and for a very compact code, one could\npossibly encode several windows (dozens, say) in each image, while retaining a relatively compact\nrepresentation.\n\nAdditional material including software to extract PICODES from images may be obtained from [1].\n\nAcknowledgments\n\nWe are grateful to Chen Fang for programming help. This research was funded in part by Microsoft\nand NSF CAREER award IIS-0952943.\n\nA Derivation of eq. 5\n\nWe present below the derivation of eq. 5. First, we rewrite our objective function, i.e., eq. 4, in\nexpanded form:\n\nE(A, w1..K, b1..K) =\n\nK\n\nXk=1( 1\n\n2\n\nkwkk2 +\n\n\u03bb\nN\n\nN\n\nXi=1\n\n\u2113(cid:20)yik(bk +\n\nC\n\nXc=1\n\nwkc1[aT\n\nc xi > 0])(cid:21)) .\n\nFixing the parameters w1..K, b, a1, . . . , ac\u22121, ac+1, . . . , aC and minimizing the function above\nwith respect to ac, is equivalent to minimizing the following objective:\n\nE\u2032(ac) =\n\nK\n\nN\n\nXk=1\n\nXi=1\n\n\u2113(cid:20)yikwkc1[aT\n\nc xi > 0] + yikbk + Xc\u20326=c\n\nyikwkc\u2032 1[aT\n\nc\u2032 xi > 0](cid:21).\n\nLet us de\ufb01ne \u03b1ikc \u2261 yikwkc, and \u03b2ikc \u2261 (yikbk +Pc\u20326=c yikwkc\u2032 1[aT\n\nrewrite the objective as follows:\n\nc\u2032 xi > 0]). Then, we can\n\nE\u2032(ac) =\n\n=\n\n=\n\nN\n\nN\n\nK\n\nN\n\nK\n\nXk=1\nXi=1\nXi=1(1[aT\nXi=1(1[aT\n\nc xi > 0] + \u03b2ikc(cid:21)\n\u2113(cid:20)\u03b1ikc1[aT\nXk=1\nXk=1\n\nc xi > 0]\n\nc xi > 0]\n\nK\n\n\u2113(\u03b1ikc + \u03b2ikc) + (1 \u2212 1[aT\n\nc xi > 0])\n\n\u2113(\u03b1ikc + \u03b2ikc) \u2212 \u2113(\u03b2ikc)) + const .\n\n\u2113(\u03b2ikc))\n\nK\n\nXk=1\n\nFinally, it can be seen that optimizing this objective is equivalent to minimizing\n\nN\n\nE(ac) =\n\nXi=1\nk=1 \u2113(\u03b1ikc + \u03b2ikc) \u2212 \u2113(\u03b2ikc)(cid:12)(cid:12)(cid:12)\n\nwhere vi = (cid:12)(cid:12)(cid:12)PK\n\nThis yields eq. 5.\n\nvi1[ziaT\n\nc xi > 0]\n\nand zi = sign(cid:16)PK\n\nk=1 \u2113(\u03b1ikc + \u03b2ikc) \u2212 \u2113(\u03b2ikc)(cid:17).\n\n8\n\n\fReferences\n\n[1] http://vlg.cs.dartmouth.edu/picodes.\n[2] B. Babenko, S. Branson, and S. Belongie. Similarity metrics for categorization: From monolithic to\n\ncategory speci\ufb01c. In Intl. Conf. Computer Vision, pages 293 \u2013300, 2009.\n\n[3] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge, 2010. http://www.image-\n\nnet.org/challenges/LSVRC/2010/.\n\n[4] A. Bosch, A. Zisserman, and X. Mu\u02dcnoz. Representing shape with a spatial pyramid kernel.\n\nIn Conf.\n\nImage and Video Retrieval (CIVR), pages 401\u2013408, 2007.\n\n[5] O. Chapelle and S. S. Keerthi. Multi-class feature selection with support vector machines. Proc. of the\n\nAm. Stat. Assoc., 2008.\n\n[6] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-hash and tf-idf weighting.\n\nIn British Machine Vision Conf., 2008.\n\n[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical\n\nImage Database. In CVPR, 2009.\n\n[8] M. Douze, A. Ramisa, and C. Schmid. Combining attributes and \ufb01sher vectors for ef\ufb01cient image retrieval.\n\nIn Proc. Comp. Vision Pattern Recogn. (CVPR), 2011.\n\n[9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear\n\nclassi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.\n[11] P. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation. In ICCV, 2009.\n[12] A. G. Hauptmann, R. Yan, W.-H. Lin, M. G. Christel, and H. D. Wactlar. Can high-level concepts \ufb01ll the\nsemantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia,\n9(5):958\u2013966, 2007.\n\n[13] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality.\nIn STOC \u201998: Proceedings of the thirtieth annual ACM symposium on Theory of computing, New York,\nNY, USA, 1998. ACM Press.\n\n[14] H. J\u00b4egou, M. Douze, C. Schmid, and P. P\u00b4erez. Aggregating local descriptors into a compact image\n\nrepresentation. In Proc. Comp. Vision Pattern Recogn. (CVPR), 2010.\n\n[15] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2009.\n\n[16] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In Intl. Conf.\n\nComputer Vision, 2010.\n\n[17] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and Simile Classi\ufb01ers for Face\n\nVeri\ufb01cation. In Intl. Conf. Computer Vision, 2009.\n\n[18] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\nattribute transfer. In CVPR, 2009.\n\n[19] L. Li, H. Su, E. Xing, and L. Fei-Fei. Object Bank: A high-level image representation for scene classi\ufb01-\n\ncation & semantic feature sparsi\ufb01cation. In NIPS. 2010.\n\n[20] D. Lowe. Distinctive image features from scale-invariant keypoints.\n\nIntl. Jrnl. of Computer Vision,\n\n60(2):91\u2013110, 2004.\n\n[21] A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in recognition.\n\nVisual Perception, Progress in Brain Research, 155, 2006.\n\n[22] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In Advances\n\nin Neural Information Processing Systems (NIPS), 2010.\n\n[23] M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In Advances in\n\nNeural Information Processing Systems (NIPS), 2007.\n\n[24] R. Salakhutdinov and G. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50:969\u2013978, 2009.\n[25] E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In Proc. Comp.\n\nVision Pattern Recogn. (CVPR), 2007.\n\n[26] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proc.\n\nComp. Vision Pattern Recogn. (CVPR), 2008.\n\n[27] L. Torresani, M. Szummer, and A. Fitzgibbon. Ef\ufb01cient object category recognition using classemes. In\n\nECCV, 2010.\n\n[28] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. In CVPR, 2010.\n[29] J. Vogel and B. Schiele. Semantic modeling of natural scenes for content-based image retrieval. Intl. Jrnl.\n\nof Computer Vision, 72(2):133\u2013157, 2007.\n\n[30] G. Wang, D. Hoiem, and D. Forsyth. Learning image similarity from \ufb02ickr using stochastic intersection\n\nkernel machines. In Intl. Conf. Computer Vision, 2009.\n\n[31] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS. 2009.\n\n9\n\n\f", "award": [], "sourceid": 1175, "authors": [{"given_name": "Alessandro", "family_name": "Bergamo", "institution": null}, {"given_name": "Lorenzo", "family_name": "Torresani", "institution": null}, {"given_name": "Andrew", "family_name": "Fitzgibbon", "institution": null}]}