{"title": "LSDA: Large Scale Detection through Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 3536, "page_last": 3544, "abstract": "A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach. This algorithm enables us to produce a >7.6K detector by using available classification data from leaf nodes in the ImageNet tree. We additionally demonstrate how to modify our architecture to produce a fast detector (running at 2fps for the 7.6K detector). Models and software are available at", "full_text": "LSDA: Large Scale Detection through Adaptation\n\nJudy Hoffman(cid:5), Sergio Guadarrama(cid:5), Eric Tzeng(cid:5), Ronghang Hu\u2207, Jeff Donahue(cid:5),\n\n(cid:5)EECS, UC Berkeley, \u2207EE, Tsinghua University\n\n{jhoffman, sguada, tzeng, jdonahue}@eecs.berkeley.edu\n\nhrh11@mails.tsinghua.edu.cn\n\nRoss Girshick(cid:5), Trevor Darrell(cid:5), Kate Saenko(cid:52)\n\n(cid:5)EECS, UC Berkeley, (cid:52)CS, UMass Lowell\n\n{rbg, trevor}@eecs.berkeley.edu, saenko@cs.uml.edu\n\nAbstract\n\nA major challenge in scaling object detection is the dif\ufb01culty of obtaining labeled\nimages for large numbers of categories. Recently, deep convolutional neural net-\nworks (CNNs) have emerged as clear winners on object classi\ufb01cation benchmarks,\nin part due to training with 1.2M+ labeled classi\ufb01cation images. Unfortunately,\nonly a small fraction of those labels are available for the detection task. It is much\ncheaper and easier to collect large quantities of image-level labels from search en-\ngines than it is to collect detection data and label it with precise bounding boxes.\nIn this paper, we propose Large Scale Detection through Adaptation (LSDA), an\nalgorithm which learns the difference between the two tasks and transfers this\nknowledge to classi\ufb01ers for categories without bounding box annotated data, turn-\ning them into detectors. Our method has the potential to enable detection for the\ntens of thousands of categories that lack bounding box annotations, yet have plenty\nof classi\ufb01cation data. Evaluation on the ImageNet LSVRC-2013 detection chal-\nlenge demonstrates the ef\ufb01cacy of our approach. This algorithm enables us to\nproduce a >7.6K detector by using available classi\ufb01cation data from leaf nodes in\nthe ImageNet tree. We additionally demonstrate how to modify our architecture\nto produce a fast detector (running at 2fps for the 7.6K detector). Models and\nsoftware are available at lsda.berkeleyvision.org.\n\n1\n\nIntroduction\n\nBoth classi\ufb01cation and detection are key visual recognition challenges, though historically very\ndifferent architectures have been deployed for each. Recently, the R-CNN model [1] showed how\nto adapt an ImageNet classi\ufb01er into a detector, but required bounding box data for all categories.\nWe ask, is there something generic in the transformation from classi\ufb01cation to detection that can be\nlearned on a subset of categories and then transferred to other classi\ufb01ers?\nOne of the fundamental challenges in training object detection systems is the need to collect a\nlarge of amount of images with bounding box annotations. The introduction of detection challenge\ndatasets, such as PASCAL VOC [2], have propelled progress by providing the research community\na dataset with enough fully annotated images to train competitive models although only for 20\nclasses. Even though the more recent ImageNet detection challenge dataset [3] has extended the set\nof annotated images, it only contains data for 200 categories. As we look forward towards the goal\nof scaling our systems to human-level category detection, it becomes impractical to collect a large\nquantity of bounding box labels for tens or hundreds of thousands of categories.\n\n\u2217This work was supported in part by DARPA\u2019s MSEE and SMISC programs, by NSF awards IIS-1427425,\n\nand IIS-1212798, IIS-1116411, and by support from Toyota.\n\n1\n\n\fFigure 1: The core idea is that we can learn detectors (weights) from labeled classi\ufb01cation data (left),\nfor a wide range of classes. For some of these classes (top) we also have detection labels (right), and\ncan learn detectors. But what can we do about the classes with classi\ufb01cation data but no detection\ndata (bottom)? Can we learn something from the paired relationships for the classes for which we\nhave both classi\ufb01ers and detectors, and transfer that to the classi\ufb01er at the bottom to make it into a\ndetector?\n\nIn contrast, image-level annotation is comparatively easy to acquire. The prevalence of image tags\nallows search engines to quickly produce a set of images that have some correspondence to any\nparticular category. ImageNet [3], for example, has made use of these search results in combination\nwith manual outlier detection to produce a large classi\ufb01cation dataset comprised of over 20,000\ncategories. While this data can be effectively used to train object classi\ufb01er models, it lacks the\nsupervised annotations needed to train state-of-the-art detectors.\nIn this work, we propose Large Scale Detection through Adaptation (LSDA), an algorithm that\nlearns to transform an image classi\ufb01er into an object detector. To accomplish this goal, we use\nsupervised convolutional neural networks (CNNs), which have recently been shown to perform well\nboth for image classi\ufb01cation [4] and object detection [1, 5]. We cast the task as a domain adaptation\nproblem, considering the data used to train classi\ufb01ers (images with category labels) as our source\ndomain, and the data used to train detectors (images with bounding boxes and category labels) as our\ntarget domain. We then seek to \ufb01nd a general transformation from the source domain to the target\ndomain, that can be applied to any image classi\ufb01er to adapt it into a object detector (see Figure 1).\nGirshick et al. (R-CNN) [1] demonstrated that adaptation, in the form of \ufb01ne-tuning, is very impor-\ntant for transferring deep features from classi\ufb01cation to detection and partially inspired our approach.\nHowever, the R-CNN algorithm uses classi\ufb01cation data only to pre-train a deep network and then\nrequires a large number of bounding boxes to train each detection category.\nOur LSDA algorithm uses image classi\ufb01cation data to train strong classi\ufb01ers and requires detection\nbounding box labeled data for only a small subset of the \ufb01nal detection categories and much less\ntime. It uses the classes labeled with both classi\ufb01cation and detection labels to learn a transformation\nof the classi\ufb01cation network into a detection network. It then applies this transformation to adapt\nclassi\ufb01ers for categories without any bounding box annotated data into detectors.\nOur experiments on the ImageNet detection task show signi\ufb01cant improvement (+50% relative\nmAP) over a baseline of just using raw classi\ufb01er weights on object proposal regions. One can\nadapt any ImageNet-trained classi\ufb01er into a detector using our approach, whether or not there are\ncorresponding detection labels for that class.\n\n2 Related Work\n\nRecently, Multiple Instance Learning (MIL) has been used for training detectors using weak labels,\ni.e. images with category labels but not bounding box labels. The MIL paradigm estimates latent\nlabels of examples in positive training bags, where each positive bag is known to contain at least one\npositive example. Ali et al. [6] constructs positive bags from all object proposal regions in a weakly\nlabeled image that is known to contain the object, and uses a version of MIL to learn an object\ndetector. A similar method [7] learns detectors from PASCAL VOC images without bounding box\n\n2\n\nI CLASSIFY dog apple I DET dog apple I CLASSIFY cat W CLASSIFY dog W CLASSIFY apple Classifiers W DET dog W DET apple Detectors W CLASSIFY cat W DET cat I DET ? \fFigure 2: Detection with the LSDA network. Given an image, extract region proposals, reshape the\nregions to \ufb01t into the network size and \ufb01nally produce detection scores per category for the region.\nLayers with red dots/\ufb01ll indicate they have been modi\ufb01ed/learned during \ufb01ne-tuning with available\nbounding box annotated data.\n\nlabels. MIL-based methods are a promising approach that is complimentary to ours. They have not\nyet been evaluated on the large-scale ImageNet detection challenge to allow for direct comparison.\nDeep convolutional neural networks (CNNs) have emerged as state of the art on popular object\nclassi\ufb01cation benchmarks (ILSVRC, MNIST) [4]. In fact, \u201cdeep features\u201d extracted from CNNs\ntrained on the object classi\ufb01cation task are also state of the art on other tasks, e.g., subcategory\nclassi\ufb01cation, scene classi\ufb01cation, domain adaptation [8] and even image matching [9]. Unlike the\npreviously dominant features (SIFT [10], HOG [11]), deep CNN features can be learned for each\nspeci\ufb01c task, but only if suf\ufb01cient labeled training data are available. R-CNN [1] showed that \ufb01ne-\ntuning deep features on a large amount of bounding box labeled data signi\ufb01cantly improves detection\nperformance.\nDomain adaptation methods aim to reduce dataset bias caused by a difference in the statistical dis-\ntributions between training and test domains. In this paper, we treat the transformation of classi\ufb01ers\ninto detectors as a domain adaptation task. Many approaches have been proposed for classi\ufb01er\nadaptation; e.g., feature space transformations [12], model adaptation approaches [13, 14] and joint\nfeature and model adaptation [15, 16]. However, even the joint learning models are not able to mod-\nify the feature extraction process and so are limited to shallow adaptation techniques. Additionally,\nthese methods only adapt between visual domains, keeping the task \ufb01xed, while we adapt both from\na large visual domain to a smaller visual domain and from a classi\ufb01cation task to a detection task.\nSeveral supervised domain adaptation models have been proposed for object detection. Given a\ndetector trained on a source domain, they adjust its parameters on labeled target domain data. These\ninclude variants for linear support vector machines [17, 18, 19], as well as adaptive latent SVMs [20]\nand adaptive exemplar SVM [21]. A related recent method [22] proposes a fast adaptation technique\nbased on Linear Discriminant Analysis. These methods require labeled detection data for all object\ncategories, both in the source and target domains, which is absent in our scenario. To our knowledge,\nours is the \ufb01rst method to adapt to held-out categories that have no detection data.\n\n3 Large Scale Detection through Adaptation (LSDA)\n\nWe propose Large Scale Detection through Adaptation (LSDA), an algorithm for adapting classi\ufb01ers\nto detectors. With our algorithm, we are able to produce a detection network for all categories of\ninterest, whether or not bounding boxes are available at training time (see Figure 2).\nSuppose we have K categories we want to detect, but we only have bounding box annotations for m\ncategories. We will refer to the set of categories with bounding box annotations as B = {1, ...m},\nand the set of categories without bounding box annotations as set A = {m, ..., K}. In practice,\nwe will likely have m (cid:28) K, as is the case in the ImageNet dataset. We assume availability of\nclassi\ufb01cation data (image-level labels) for all K categories and will use that data to initialize our\nnetwork.\n\n3\n\nbackground:\"0.25\"det\"layers\"175\"det\"fc6\"det\"fc7\"Input\"image\"Region\"Proposals\"Warped\"\"region\"LSDA\"Net\"cat:\"0.90\"\"fcA\"cat?\"yes\"dog:\"0.45\"fcB\"dog?\"no\"Produce\"\"Predic=ons\"background\"\u03b4B\"adapt\"\fLSDA transforms image classi\ufb01ers into object detectors using three key insights:\n\n1. Recognizing background is an important step in adapting a classi\ufb01er into a detector\n2. Category invariant information can be transferred between the classi\ufb01er and detector fea-\n\nture representations\n\n3. There may be category speci\ufb01c differences between a classi\ufb01er and a detector\n\nWe will next demonstrate how our method accomplishes each of these insights as we describe the\ntraining of LSDA.\n\n3.1 Training LSDA: Category Invariant Adaptation\n\nFor our convolutional neural network, we adopt the architecture of Krizhevsky et al. [4], which\nachieved state-of-the-art performance on the ImageNet ILSVRC2012 classi\ufb01cation challenge. Since\nthis network requires a large amount of data and time to train its approximately 60 million param-\neters, we start by pre-training the CNN trained on the ILSVRC2012 classi\ufb01cation dataset, which\ncontains 1.2 million classi\ufb01cation-labeled images of 1000 categories. Pre-training on this dataset\nhas been shown to be a very effective technique [8, 5, 1], both in terms of performance and in terms\nof limiting the amount of in-domain labeled data needed to successfully tune the network. Next, we\nreplace the last weight layer (1000 linear classi\ufb01ers) with K linear classi\ufb01ers, one for each category\nin our task. This weight layer is randomly initialized and then we \ufb01ne-tune the whole network on\nour classi\ufb01cation data. At this point, we have a network that can take an image or a region proposal\nas input, and produce a set of scores for each of the K categories. We \ufb01nd that even using the net\ntrained on classi\ufb01cation data in this way produces a strong baseline (see Section 4).\nWe next transform our classi\ufb01cation network into a detection network. We do this by \ufb01ne-tuning\nlayers 1-7 using the available labeled detection data for categories in set B. Following the Regions-\nbased CNN (R-CNN) [1] algorithm, we collect positive bounding boxes for each category in set B\nas well as a set of background boxes using a region proposal algorithm, such as selective search [23].\nWe use each labeled region as a \ufb01ne-tuning input to the CNN after padding and warping it to the\nCNN\u2019s input size. Note that the R-CNN \ufb01ne-tuning algorithm requires bounding box annotated data\nfor all categories and so can not directly be applied to train all K detectors. Fine-tuning transforms\nall network weights (except for the linear classi\ufb01ers for set A) and produces a softmax detector for\ncategories in set B, which includes a weight vector for the new background class.\nLayers 1-7 are shared between all categories in set B and we \ufb01nd empirically that \ufb01ne-tuning induces\na generic, category invariant transformation of the classi\ufb01cation network into a detection network.\nThat is, even though \ufb01ne-tuning sees no detection data for categories in set A, the network trans-\nforms in a way that automatically makes the original set A image classi\ufb01ers much more effective\nat detection (see Figure 3). Fine-tuning for detection also learns a background weight vector that\nencodes a generic \u201cbackground\u201d category. This background model is important for modeling the\ntask shift from image classi\ufb01cation, which does not include background distractors, to detection,\nwhich is dominated by background patches.\n\n3.2 Training LSDA: Category Speci\ufb01c Adaptation\n\nFinally, we learn a category speci\ufb01c transformation that will change the classi\ufb01er model parameters\ninto the detector model parameters that operate on the detection feature representation. The category\nspeci\ufb01c output layer (f c8) is comprised of f cA, f cB, \u03b4B, and f c \u2212 BG. For categories in set B,\nthis transformation can be learned through directly \ufb01ne-tuning the category speci\ufb01c parameters f cB\n(Figure 2). This is equivalent to \ufb01xing f cB and learning a new layer, zero initialized, \u03b4B, with\nequivalent loss to f cB, and adding together the outputs of \u03b4B and f cB.\nLet us de\ufb01ne the weights of the output layer of the original classi\ufb01cation network as W c, and the\nweights of the output layer of the adapted detection network as W d. We know that for a category\ni \u2208 B, the \ufb01nal detection weights should be computed as W d\ni + \u03b4Bi. However, since\nthere is no detection data for categories in A, we can not directly learn a corresponding \u03b4A layer\nduring \ufb01ne-tuning. Instead, we can approximate the \ufb01ne-tuning that would have occurred to f cA\nhad detection data been available. We do this by \ufb01nding the nearest neighbors categories in set B\nfor each category in set A and applying the average change. Here we de\ufb01ne nearest neighbors as\n\ni = W c\n\n4\n\n\fthose categories with the nearest (minimal Euclidean distance) (cid:96)2-normalized f c8 parameters in the\nclassi\ufb01cation network. This corresponds to the classi\ufb01cation model being most similar and hence,\nwe assume, the detection model should be most similar. We denote the kth nearest neighbor in set\nB of category j \u2208 A as NB(j, k), then we compute the \ufb01nal output detection weights for categories\nin set A as:\n\nk(cid:88)\n\ni=1\n\n\u2200j \u2208 A : W d\n\nj = W c\n\nj +\n\n1\nk\n\n\u03b4BNB (j,i)\n\n(1)\n\nThus, we adapt the category speci\ufb01c parameters even without bounding boxes for categories in set\nA. In the next section we experiment with various values of k, including taking the full average:\nk = |B|.\n\n3.3 Detection with LSDA\n\nAt test time we use our network to extract K + 1 scores per region proposal in an image (similar to\nthe R-CNN [1] pipeline). One for each category and an additional score for the background category.\nFinally, for a given region, the score for category i is computed by combining the per category score\nwith the background score: scorei \u2212 scorebackground.\nIn contrast to the R-CNN [1] model which trains SVMs on the extracted features from layer 7\nand bounding box regression on the extracted features from layer 5, we directly use the \ufb01nal score\nvector to produce the prediction scores without either of the retraining steps. This choice results in a\nsmall performance loss, but offers the \ufb02exibility of being able to directly combine the classi\ufb01cation\nportion of the network that has no detection labeled data, and reduces the training time from 3 days\nto roughly 5.5 hours.\n4 Experiments\n\nTo demonstrate the effectiveness of our approach we present quantitative results on the ILSVRC2013\ndetection dataset. The dataset offers a 200-category detection challenge. The training set has \u223c400K\nannotated images and on average 1.534 object classes per image. The validation set has 20K anno-\ntated images with \u223c50K annotated objects. We simulate having access to classi\ufb01cation labels for\nall 200 categories and having detection annotations for only the \ufb01rst 100 categories (alphabetically\nsorted).\n\n4.1 Experiment Setup & Implementation Details\n\nWe start by separating our data into classi\ufb01cation and detection sets for training and a validation\nset for testing. Since the ILSVRC2013 training set has on average fewer objects per image than\nthe validation set, we use this data as our classi\ufb01cation data. To balance the categories we use\n\u22481000 images per class (200,000 total images). Note: for classi\ufb01cation data we only have access\nto a single image-level annotation that gives a category label. In effect, since the training set may\ncontain multiple objects, this single full-image label is a weak annotation, even compared to other\nclassi\ufb01cation training data sets. Next, we split the ILSVRC2013 validation set in half as [1] did,\nproducing two sets: val1 and val2. To construct our detection training set, we take the images\nwith bounding box labels from val1 for only the \ufb01rst 100 categories (\u2248 5000 images). Since the\nvalidation set is relatively small, we augment our detection set with 1000 bounding box annotated\nimages per category from the ILSVRC2013 training set (following the protocol of [1]). Finally we\nuse the second half of the ILSVRC2013 validation set (val2) for our evaluation.\nWe implemented our CNN architectures and execute all \ufb01ne-tuning using the open source software\npackage Caffe [24] and have made our model de\ufb01nitions weights publicly available.\n\n4.2 Quantitative Analysis on Held-out Categories\n\nWe evaluate the importance of each component of our algorithm through an ablation study. As\na baseline we consider training the network with only the classi\ufb01cation data (no adaptation) and\napplying the network to the region proposals. The summary of the importance of our three adaptation\ncomponents is shown in Figure 3. Our full LSDA model achieves a 50% relative mAP boost over\n\n5\n\n\fDetection\n\nAdaptation Layers\n\nOutput Layer\nAdaptation\nNo Adapt (Classi\ufb01cation Network)\n\nfcbgrnd\nfcbgrnd,fc6\nfcbgrnd,fc7\nfcbgrnd,fcB\nfcbgrnd,fc6,fc7\nfcbgrnd,fc6,fc7,fcB\nfcbgrnd,layers1-7,fcB\nAvg NN (k=5)\nfcbgrnd,layers1-7,fcB\nfcbgrnd,layers1-7,fcB\nAvg NN (k=10)\nfcbgrnd,layers1-7,fcB Avg NN (k=100)\n\n-\n-\n-\n-\n-\n-\n-\n\nOracle: Full Detection Network\n\nmAP Trained\n100 Categories\n\nmAP Held-out\n100 Categories\n\nmAP All\n\n200 Categories\n\n12.63\n14.93\n24.72\n23.41\n18.04\n25.78\n26.33\n27.81\n28.12\n27.95\n27.91\n\n29.72\n\n10.31\n12.22\n13.72\n14.57\n11.74\n14.20\n14.42\n15.85\n15.97\n16.15\n15.96\n\n26.25\n\n11.90\n13.60\n19.20\n19.00\n14.90\n20.00\n20.40\n21.83\n22.05\n22.05\n21.94\n\n28.00\n\nTable 1: Ablation study for the components of LSDA. We consider removing different pieces of our\nalgorithm to determine which pieces are essential. We consider training with the \ufb01rst 100 (alpha-\nbetically) categories of the ILSVRC2013 detection validation set (on val1) and report mean average\nprecision (mAP) over the 100 trained on and 100 held out categories (on val2). We \ufb01nd the best\nimprovement is from \ufb01ne-tuning all layers and using category speci\ufb01c adaptation.\n\nthe classi\ufb01cation only network. The most important step of our algorithm proved to be adapting\nthe feature representation, while the least important was adapting the category speci\ufb01c parameter.\nThis \ufb01ts with our intuition that the main bene\ufb01t of our approach is to transfer category invariant\ninformation from categories with known bounding box annotation to those without the bounding\nbox annotations.\nIn Table 1, we present a more detailed analysis of the\ndifferent adaptation techniques we could use to train the\nnetwork. We \ufb01nd that the best category invariant adap-\ntation approach is to learn the background category layer\nand adapt all convolutional and fully connected layers,\nbringing mAP on the held-out categories from 10.31% up\nto 15.85%. Additionally, using output layer adaptation\n(k = 10) further improves performance, bringing mAP\nto 16.15% on the held-out categories (statistically signif-\nicant at p = 0.017 using a paired sample t-test [25]). The\nlast row shows the performance achievable by our detec-\ntion network if it had access to detection data for all 200\ncategories, and serves as a performance upper bound.1\nWe \ufb01nd that one of the biggest reasons our algorithm im-\nproves is from reducing localization error. For example,\nin Figure 4, we show that while the classi\ufb01cation only\ntrained net tends to focus on the most discriminative part\nof an object (ex: face of an animal) after our adaptation, we learn to localize the whole object (ex:\nentire body of the animal).\n\nFigure 3: Comparison (mAP%) of our\nfull system (LSDA) on categories with\nno bounding boxes at training time.\n\n4.3 Error Analysis on Held Out Categories\n\nWe next present an analysis of the types of errors that our system (LSDA) makes on the held out\nobject categories. First, in Figure 5, we consider three types of false positive errors: Loc (local-\nization errors), BG (confusion with background), and Oth (other error types, which is essentially\n\n1To achieve R-CNN performance requires additionally learning SVMs on the activations of layer 7 and\nbounding box regression on the activations of layer 5. Each of these steps adds between 1-2mAP at high\ncomputation cost and using the SVMs removes the adaptation capacity of the system.\n\n6\n\n05101520Classification NetLSDA (bg only)LSDA (bg+ft)LSDA10.3112.215.8516.15\fFigure 4: We show example detections on held out categories, for which we have no detection\ntraining data, where our adapted network (LSDA) (shown with green box) correctly localizes and\nlabels the object of interest, while the classi\ufb01cation network baseline (shown in red) incorrectly\nlocalizes the object. This demonstrates that our algorithm learns to adapt the classi\ufb01er into a detector\nwhich is sensitive to localization and background rejection.\n\ncorrectly localizing an object, but misclassifying it). After separating all false positives into one of\nthese three error types we visually show the percentage of errors found in each type as you look at\nthe top scoring 25-3200 false positives.2 We consider the baseline of starting with the classi\ufb01cation\nonly network and show the false positive breakdown in Figure 5(b). Note that the majority of false\npositive errors are confusion with background and localization errors. In contrast, after adapting\nthe network using LSDA we \ufb01nd that the errors found in the top false positives are far less due to\nlocalization and background confusion (see Figure 5(c)). Arguably one of the biggest differences be-\ntween classi\ufb01cation and detection is the ability to accurately localize objects and reject background.\nTherefore, we show that our method successfully adapts the classi\ufb01cation parameters to be more\nsuitable for detection.\nIn Figure 5(a) we show examples of the top scoring Oth error types for LSDA on the held-out\ncategories. This means the detector localizes an incorrect object type. For example, the motorcycle\ndetector localized and mislabeled bicycle and the lemon detector localized and mislabeled an orange.\nIn general, we noticed that many of the top false positives from the Oth error type were confusion\nwith very similar categories.\n\n4.4 Large Scale Detection\n\nTo showcase the capabilities of our technique we produced a 7604 category detector. The \ufb01rst\ncategories correspond to the 200 categories from the ILSVRC2013 challenge dataset which have\nbounding box labeled data available. The other 7404 categories correspond to leaf nodes in the\nImageNet database and are trained using the available full image labeled classi\ufb01cation data. We\ntrained a full detection network using the 200 fully annotated categories and trained the other 7404\nlast layer nodes using only the classi\ufb01cation data. Since we lack bounding box annotated data for\nthe majority of the categories we show example top detections in Figure 6. The results are \ufb01ltered\nusing non-max suppression across categories to only show the highest scoring categories.\nThe main contribution of our algorithm is the adaptation technique for modifying a convolutional\nneural network for detection. However, the choice of network and how the net is used at test time\nboth effect the detection time computation. We have therefore also implemented and released a\nversion of our algorithm running with fast region proposals [27] on a spatial pyramid pooling net-\nwork [28], reducing our detection time down to half a second per image (from 4s per image) with\nnearly the same performance. We hope that this will allow the use of our 7.6K model on large data\nsources such as videos. We have released the 7.6K model and code to run detection (both the way\npresented in this paper and our faster version) at lsda.berkeleyvision.org.\n\n2We modi\ufb01ed the analysis software made available by Hoeim et al. [26] to work on ILSVRC-2013 detection\n\n7\n\n\f(a) Example Top Scoring False Positives: LSDA correctly localizes but incorrectly labels object\n\n(b) Classi\ufb01cation Network\n\n(c) LSDA Network\n\nFigure 5: We examine the top scoring false positives from LSDA. Many of our top scoring false\npositives come from confusion with other categories (a). (b-c) Comparison of error type breakdown\non the categories which have no training bounding boxes available (held-out categories). After\nadapting the network using our algorithm (LSDA), the percentage of false positive errors due to\nlocalization and background confusion is reduced (c) as compared to directly using the classi\ufb01cation\nnetwork in a detection framework (b).\n\nFigure 6: Example top detections from our 7604 category detector. Detections from the 200 cat-\negories that have bounding box training data available are shown in blue. Detections from the\nremaining 7404 categories for which only classi\ufb01cation training data is available are shown in red.\n\n5 Conclusion\n\nWe have presented an algorithm that is capable of transforming a classi\ufb01er into a detector. We\nuse CNN models to train both a classi\ufb01cation and a detection network. Our multi-stage algorithm\nuses corresponding classi\ufb01cation and detection data to learn the change from a classi\ufb01cation CNN\nnetwork to a detection CNN network, and applies that difference to future classi\ufb01ers for which there\nis no available detection data.\nWe show quantitatively that without seeing any bounding box annotated data, we can increase per-\nformance of a classi\ufb01cation network by 50% relative improvement using our adaptation algorithm.\nGiven the signi\ufb01cant improvement on the held out categories, our algorithm has the potential to\nenable detection of tens of thousands of categories. All that would be needed is to train a classi\ufb01ca-\ntion layer for the new categories and use our \ufb01ne-tuned detection model along with our output layer\nadaptation techniques to update the classi\ufb01cation parameters directly.\nOur approach signi\ufb01cantly reduces the overhead of producing a high quality detector. We hope that\nin doing so we will be able to minimize the gap between having strong large-scale classi\ufb01ers and\nstrong large-scale detectors. There is still a large gap to reach oracle (known bounding box labels)\nperformance. For future work we would like to explore multiple instance learning techniques to\ndiscover and mine patches for the categories that lack bounding box data.\n\n8\n\nmicrophone (sim): ov=0.00 1\u2212r=\u22123.00microphoneminiskirt (sim): ov=0.00 1\u2212r=\u22121.00miniskirtmotorcycle (sim): ov=0.00 1\u2212r=\u22126.00motorcyclemushroom (sim): ov=0.00 1\u2212r=\u22128.00mushroomnail (sim): ov=0.00 1\u2212r=\u22124.00naillaptop (sim): ov=0.00 1\u2212r=\u22123.00laptoplemon (sim): ov=0.00 1\u2212r=\u22125.00lemontotal false positivespercentage of each typeHeld\u2212out Categories 255010020040080016003200020406080100LocOthBGtotal false positivespercentage of each typeHeld\u2212out Categories 255010020040080016003200020406080100LocOthBGAmerican bison: 7.0taillight: 0.9wheel and axle: 1.0car: 6.0whippet: 2.0dog: 4.1sofa: 8.0\fReferences\n[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In In Proc. CVPR, 2014.\n\n[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. International Journal of Computer Vision, 88(2):303\u2013338, June 2010.\n[3] A. Berg, J. Deng, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. 2012.\n[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Proc. NIPS, 2012.\n\n[5] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.\n\n[6] K. Ali and K. Saenko. Con\ufb01dence-rated multiple instance boosting for object detection. In IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, 2014.\n\n[7] H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects\nwith minimal supervision. In Proceedings of the International Conference on Machine Learning (ICML),\n2014.\n\n[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convo-\n\nlutional Activation Feature for Generic Visual Recognition. In Proc. ICML, 2014.\n\n[9] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Descriptor matching with convolutional neural\n\nnetworks: a comparison to sift. ArXiv e-prints, abs/1405.5769, 2014.\n\n[10] D. G. Lowe. Distinctive image features from scale-invariant key points. IJCV, 2004.\n[11] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In In Proc. CVPR, 2005.\n[12] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using\n\nasymmetric kernel transforms. In Proc. CVPR, 2011.\n\n[13] J. Yang, R. Yan, and A. Hauptmann. Adapting SVM classi\ufb01ers to data with shifted distributions. In ICDM\n\nWorkshops, 2007.\n\n[14] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In Proc. ICCV,\n\n2011.\n\n[15] J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell. Ef\ufb01cient learning of domain-invariant\n\nimage representations. In Proc. ICLR, 2013.\n\n[16] L. Duan, D. Xu, and Ivor W. Tsang. Learning with augmented features for heterogeneous domain adap-\n\ntation. In Proc. ICML, 2012.\n\n[17] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. ACM\n\nMultimedia, 2007.\n\n[18] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In IEEE Interna-\n\ntional Conference on Computer Vision, 2011.\n\n[19] J. Donahue, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell. Semi-supervised domain adaptation with\n\ninstance constraints. In Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[20] J. Xu, S. Ramos, D. V\u00b4azquez, and A.M. L\u00b4opez. Domain adaptation of deformable part-based models.\n\nIEEE Trans. on Pattern Analysis and Machine Intelligence, In Press, 2014.\n\n[21] Y. Aytar and A. Zisserman. Enhancing exemplar svms using part level transfer regularization. In British\n\nMachine Vision Conference, 2012.\n\n[22] D. Goehring, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell. Interactive adaptation of real-time object\n\ndetectors. In International Conference on Robotics and Automation (ICRA), 2014.\n\n[23] J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders. Selective search for object\n\nrecognition. International Journal of Computer Vision, 104(2):154\u2013171, 2013.\n\n[24] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv\npreprint arXiv:1408.5093, 2014.\n\n[25] M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical signi\ufb01cance tests for information\n\nretrieval evaluation. In In Conference on Information and Knowledge Management, 2007.\n\n[26] D. Hoeim, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. In In Proc. ECCV,\n\n2012.\n\n[27] P. Kr\u00a8ahenb\u00a8uhl and V. Koltun. Geodesic object proposals. In In Proc. ECCV, 2014.\n[28] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. In In Proc. ECCV, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1863, "authors": [{"given_name": "Judy", "family_name": "Hoffman", "institution": "UC Berkeley"}, {"given_name": "Sergio", "family_name": "Guadarrama", "institution": "UC Berkeley"}, {"given_name": "Eric", "family_name": "Tzeng", "institution": "UC Berkeley"}, {"given_name": "Ronghang", "family_name": "Hu", "institution": "Tsinghua University"}, {"given_name": "Jeff", "family_name": "Donahue", "institution": "UC Berkeley"}, {"given_name": "Ross", "family_name": "Girshick", "institution": "UC Berkeley"}, {"given_name": "Trevor", "family_name": "Darrell", "institution": "UC Berkeley"}, {"given_name": "Kate", "family_name": "Saenko", "institution": "UMass Lowell"}]}