{"title": "Object Recognition by Scene Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 1241, "page_last": 1248, "abstract": null, "full_text": "Object Recognition by Scene Alignment\n\nBryan C. Russell Antonio Torralba Ce Liu Rob Fergus William T. Freeman\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambrige, MA 02139 USA\n\n{brussell,torralba,celiu,fergus,billf}@csail.mit.edu\n\nAbstract\n\nCurrent object recognition systems can only recognize a limited number of object\ncategories; scaling up to many categories is the next challenge. We seek to build\na system to recognize and localize many different object categories in complex\nscenes. We achieve this through a simple approach: by matching the input im-\nage, in an appropriate representation, to images in a large training set of labeled\nimages. Due to regularities in object identities across similar scenes, the retrieved\nmatches provide hypotheses for object identities and locations. We build a prob-\nabilistic model to transfer the labels from the retrieval set to the input image. We\ndemonstrate the effectiveness of this approach and study algorithm component\ncontributions using held-out test sets from the LabelMe database.\n\n1\n\nIntroduction\n\nThe recognition of objects in a scene often consists of matching representations of image regions\nto an object model while rejecting background regions. Recent examples of this approach include\naligning pictorial cues [4], shape correspondence [1], and modeling the constellation of parts [5].\nOther models, exploiting knowledge of the scene context in which the objects reside, have proven\nsuccessful in boosting object recognition performance [18, 20, 15, 7, 13]. These methods model the\nrelationship between scenes and objects and allow information transfer across the two.\n\nHere, we exploit scene context using a different approach: we formulate the object detection prob-\nlem as one of aligning elements of the entire scene to a large database of labeled images. The\nbackground, instead of being treated as a set of outliers, is used to guide the detection process. Our\napproach relies on the observation that when we have a large enough database of labeled images, we\ncan \ufb01nd with high probability some images in the database that are very close to the query image\nin appearance, scene contents, and spatial arrangement [6, 19]. Since the images in the database\nare partially labeled, we can transfer the knowledge of the labeling to the query image. Figure 1\nillustrates this idea. With these assumptions, the problem of object detection in scenes becomes a\nproblem of aligning scenes. The main issues are: (1) Can we \ufb01nd a big enough dataset to span the\nrequired large number of scene con\ufb01gurations? (2) Given an input image, how do we \ufb01nd a set of\nimages that aligns well with the query image? (3) How do we transfer the knowledge about objects\ncontained in the labels?\n\nThe LabelMe dataset [14] is well-suited for this task, having a large number of images and labels\nspanning hundreds of object categories. Recent studies using non-parametric methods for computer\nvision and graphics [19, 6] show that when a large number of images are available, simple indexing\ntechniques can be used to retrieve images with object arrangements similar to those of a query image.\n\nThe core part of our system is the transfer of labels from the images that best match the query image.\nWe assume that there are commonalities amongst the labeled objects in the retrieved images and we\ncluster them to form candidate scenes. These scene clusters give hints as to what objects are depicted\n\n1\n\n\fscreen 2\n\ndesk 3\n\nmousepad 2 \n\nkeyboard 2\n\nmouse 1\n\n(a) Input image\n\n(b) Images with similar scene \n\n(c) Output image with object\n\n      configuration\n\n     labels transferred\n\nFigure 1: Overview of our system. Given an input image, we search for images having a similar scene\ncon\ufb01guration in a large labeled database. The knowledge contained in the object labels for the best matching\nimages is then transfered onto the input image to detect objects. Additional information, such as depth-ordering\nrelationships between the objects, can also be transferred.\n\nFigure 2: Retrieval set images. Each of the two rows depicts an input image (on the left) and 30 images from\nthe LabelMe dataset [14] that best match the input image using the gist feature [12] and L1 distance (the images\nare sorted by their distances in raster order). Notice that the retrieved images generally belong to similar scene\ncategories. Also the images contain mostly the same object categories, with the larger objects often matching\nin spatial location within the image. Many of the retrieved images share similar geometric perspective.\n\nin the query image and their likely location. We describe a relatively simple generative model for\ndetermining which scene cluster best matches the query image and use this to detect objects.\n\nThe remaining sections are organized as follows: In Section 2, we describe our representation for\nscenes and objects. We formulate a model that integrates the information in the object labels with\nobject detectors in Section 3. In Section 4, we extend this model to allow clustering of the retrieved\nimages based on the object labels. We show experimental results of our system output in Section 5,\nand conclude in Section 6.\n\n2 Matching Scenes and Objects with the Gist Feature\n\nWe describe the gist feature [12], which is a low dimensional representation of an image region\nand has been shown to achieve good performance for the scene recognition task when applied to an\nentire image. To construct the gist feature, an image region is passed through a Gabor \ufb01lter bank\ncomprising 4 scales and 8 orientations. The image region is divided into a 4x4 non-overlapping grid\nand the output energy of each \ufb01lter is averaged within each grid cell. The resulting representation\nis a 4 \u00d7 8 \u00d7 16 = 512 dimensional vector. Note that the gist feature preserves spatial structure\ninformation and is similar to applying the SIFT descriptor [9] to the image region.\n\nWe consider the task of retrieving a set of images (which we refer to as the retrieval set) that closely\nmatches the scene contents and geometrical layout of an input image. Figure 2 shows retrieval sets\nfor two typical input images using the gist feature. We show the top 30 closest matching images\nfrom the LabelMe database based on the L1-norm distance, which is robust to outliers. Notice that\nthe gist feature retrieves images that match the scene type of the input image. Furthermore, many\nof the objects depicted in the input image appear in the retrieval set, with the larger objects residing\nin approximately the same spatial location relative to the image. Also, the retrieval set has many\n\n2\n\n\fimages that share a similar geometric perspective. Of course, not every retrieved image matches\nwell and we account for outliers in Section 4.\n\nWe evaluate the ability of the retrieval\nset to predict the presence of objects in\nthe input image. For this, we found a\nretrieval set of 200 images and formed\na normalized histogram (the histogram\nentries sum to one) of the object cate-\ngories that were labeled. We compute\nperformance for object categories with\nat least 200 training examples and that\nappear in at least 15 test images. We\ncompute the area under the ROC curve\nfor each object category. As a com-\nparison, we evaluate the performance\nof an SVM applied to gist features by\nusing the maximal score over a set of\nbounding boxes extracted from the im-\nage. The area under ROC performance\nof the retrieval set versus the SVM is\nshown in Figure 3 as a scatter plot, with\neach point corresponding to a tested ob-\nject category. As a guide, a diagonal\nline is displayed; those points that re-\nside above the diagonal indicate better\nSVM performance (and vice versa). No-\ntice that the retrieval set predicts well\nthe objects present in the input image\nand outperforms the detectors based on\nlocal appearance information (the SVM)\nfor most object classes.\n\n3 Utilizing Retrieval Set\nImages for Object Detec-\ntion\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n)\ne\nc\nn\na\nr\na\ne\np\np\na\n\n \nl\n\na\nc\no\nl\n(\n \n\nM\nV\nS\n\nscreen\n\nsidewalk\nroad\nmouse\nkeyboard\n\nhead\n\npole\n\nlamp\n\nspeaker\n\ncup\n\nmug\n\nblind\nbook\n\nbottle\n\nchair\n\nwindow\n\nsky\n\npaper\n\nplant\n\ndoor\n\nphone\nmousepad\n\ntable\nmotorbike\n\nbookshelf\n\ncabinet\n\ncar\n\ntree\n\nstreetlight\n\nperson\n\n0.5\n\n0.5\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n0.8\nRetrieval set\n\n0.85\n\n0.9\n\n0.95\n\n1\n\nFigure 3: Evaluation of the goodness of the retrieval set by\nhow well it predicts which objects are present in the input im-\nage. We build a simple classi\ufb01er based on object counts in the\nretrieval set as provided by their associated LabelMe object la-\nbels. We compare this to detection based on local appearance\nalone using an SVM applied to bounding boxes in the input im-\nage (the maximal score is used). The area under the ROC curve\nis computed for many object categories for the two classi\ufb01ers.\nPerformance is shown as a scatter plot where each point repre-\nsents an object category. Notice that the retrieval set predicts\nwell object presence and in a majority cases outperforms the\nSVM output, which is based only on local appearance.\n\nIn Section 2, we observed that the set of labels corresponding to images that best match an input\nimage predict well the contents of the input image. In this section, we will describe a model that\nintegrates local appearance with object presence and spatial likelihood information given by the\nobject labels belonging to the retrieval set.\n\nWe wish to model the relationship between object categories o, their spatial location x within an\nimage, and their appearance g. For a set of N images, each having Mi object proposals over L\nobject categories, we assume a joint model that factorizes as follows:\n\np(o, x, g|\u03b8, \u03c6, \u03b7) =\n\nN\n\nY\n\nMi\n\nY\n\n1\n\nX\n\ni=1\n\nj=1\n\nhi,j =0\n\np(oi,j|hi,j, \u03b8) p(xi,j|oi,j, hi,j, \u03c6) p(gi,j|oi,j, hi,j, \u03b7)\n\n(1)\n\nWe assume that the joint model factorizes as a product of three terms: (i) p(oi,j|hi,j = m, \u03b8m), the\nlikelihood of which object categories will appear in the image, (ii) p(xi,j|oi,j = l, hi,j = m, \u03c6m,l),\nthe likely spatial locations of observing object category l in the image, and (iii) p(gi,j|oi,j = l, hi,j =\nm, \u03b7m,l), the appearance likelihood of object category l. We let hi,j = 1 indicate whether object\ncategory oi,j is actually present in location xi,j (hi,j = 0 indicates absence). Figure 4 depicts the\nabove as a graphical model. We use plate notation, where the variable nodes inside a plate are\nduplicated based on the counts depicted in the top-left corner of the plate.\n\nWe instantiate the model as follows. The spatial location of objects are parameterized as bounding\nboxes xi,j = (cx\ni,j) is the width and\n\ni,j) is the centroid and (cw\n\ni,j) where (cx\n\ni,j, cw\n\ni,j, ch\n\ni,j, cy\n\ni,j, cy\n\ni,j,cw\n\n3\n\n\f\u00bb\n\nhi,j\n\noi,j\n\nx\n\ni,j\n\n\u00b4m,l\n\ngi,j\n\nf0,1g\n\n\u00b5m\n\nf0,1g\n\nL\n\nN\nMi\n\nheight (bounding boxes are extracted from object labels by tightly cropping the polygonal annota-\ntion). Each component of xi,j is normalized with respect to the image to lie in [0, 1]. We assume \u03b8m\nare multinomial parameters and \u03c6m,l = (\u00b5m,l, \u039bm,l) are Gaussian means and covariances over the\nbounding box parameters. Finally, we assume gi,j is the output of a trained SVM applied to a gist\nfeature \u02dcgi,j. We let \u03b7m,l parameterize the logistic function (1 + exp(\u2212\u03b7m,l [1 gi,j]T ))\u22121.\nThe parameters \u03b7m,l are learned of\ufb02ine by \ufb01rst\ntraining SVMs for each object class on the set\nof all labeled examples of object class l and a\nset of distractors. We then \ufb01t logistic functions\nto the positive and negative examples of each\nclass. We learn the parameters \u03b8m and \u03c6m,l\nonline using the object labels corresponding to\nthe retrieval set. These are learned by sim-\nply counting the object class occurrences and\n\ufb01tting Gaussians to the bounding boxes corre-\nsponding to the object labels.\n\nL\n\u00c1m,l\n\nFigure 4: Graphical model that integrates informa-\ntion about which objects are likely to be present in the\nimage o, their appearance g, and their likely spatial lo-\ncation x. The parameters for object appearance \u03b7 are\nlearned of\ufb02ine using positive and negative examples for\neach object class. The parameters for object presence\nlikelihood \u03b8 and spatial location \u03c6 are learned online\nfrom the retrieval set. For all possible bounding boxes\nin the input image, we wish to infer h, which indicates\nwhether an object is present or absent.\n\nFor the input image, we wish to infer the latent\nvariables hi,j corresponding to a dense sam-\npling of all possible bounding box locations\nxi,j and object classes oi,j using the learned\nparameters \u03b8m, \u03c6m,l, and \u03b7m,l. For this, we\ncompute the postierior distribution p(hi,j =\nm|oi,j = l, xi,j, gi,j, \u03b8m, \u03c6m,l, \u03b7m,l), which is\nproportional to the product of the three learned distributions, for m = {0, 1}.\nThe procedure outlined here allows for signi\ufb01cant computational savings over naive application of\nan object detector. Without \ufb01nding similar images that match the input scene con\ufb01guration, we\nwould need to apply an object detector densely across the entire image for all object categories. In\ncontrast, our model can constrain which object categories to look for and where. More precisely,\nwe only need to consider object categories with relatively high probability in the scene model and\nbounding boxes within the range of the likely search locations. These can be decided based on\nthresholds. Also note that the conditional independences implied by the graphical model allows us\nto \ufb01t the parameters from the retrieval set and train the object detectors separately.\n\n\u00af\n\n\u00b0\n\nNote that for tractability, we assume Dirichlet and Normal-Inverse-Wishart conjugate prior distrib-\nutions over \u03b8m and \u03c6m,l with hyperparemters \u03b2 and \u03b3 = (\u03ba, \u03d1, \u03bd, \u2206) (expected mean \u03d1, \u03ba pseudo-\ncounts on the scale of the spatial observations, \u03bd degrees of freedom, and sample covariance \u2206).\nFurthermore, we assume a Bernoulli prior distribution over hi,j parameterized by \u03be = 0.5. We\nhand-tuned the remaining parameters in the model. For hi,j = 0, we assume the noninformative\ndistributions oi,j \u223c U nif orm(1/L) and each component of xi,j \u223c U nif orm(1).\n\n4 Clustering Retrieval Set Images for Robustness to Mis-\n\nmatches\n\nWhile many images in the retrieval set match the input image scene con\ufb01guration and contents,\nthere are also outliers. Typically, most of the labeled objects in the outlier images are not present\nin the input image or in the set of correctly matched retrieval images. In this section, we describe\na process to organize the retrieval set images into consistent clusters based on the co-occurrence of\nthe object labels within the images. The clusters will typically correspond to different scene types\nand/or viewpoints. The task is to then automatically choose the cluster of retrieval set images that\nwill best assist us in detecting objects in the input image.\nWe augment the model of Section 3 by assigning each image to a latent cluster si. The cluster as-\nsignments are distributed according to the mixing weights \u03c0. We depict the model in Figure 5(a).\nIntuitively, the model \ufb01nds clusters using the object labels oi,j and their spatial location xi,j within\nthe retrieved set of images. To automatically infer the number of clusters, we use a Dirichlet Process\nprior on the mixing weights \u03c0 \u223c Stick(\u03b1), where Stick(\u03b1) is the stick-breaking process of Grif-\n\n4\n\n\fN\n\nMi\n\nhi,j\n\ngi,j\n\n\u00bb\n\nf0,1g\n\nL\n\n\u00b4m,l\n\n\u00bc\n\n1\n\nf0,1g\n\n\u00b5k,m\n\nL\n\u00c1k,m,l\n\n\u00ae\n\n\u00af\n\n\u00b0\n\nsi\n\noi,j\n\nx\n\ni,j\n\n(a)\n\nInput image\n\nCluster counts\n\ns\nt\nn\nu\no\nC\n\n60\n\n40\n\n20\n\n0\n\n1\n\n3\n\n2\nClusters\n\n4\n\n5\n\n(b)\n\n(c)\n\nCluster 1\n\nCluster 2\n\nCluster 3\n\nCluster 4\n\nCluster 5\n\n(d)\n\n(e)\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n(f)\n\n0.1\n\n0.05\n\n0\n\nflo or\np oster\nchair\nkeyb oard\nperso n\nb o okshelf\nm o use\ndesk\nscreen\n\n0\n\nwin d o w\nflo or\nw all\ntable\nb o ok\npicture\nla m p\ncabinet\nchair\n\n(g)\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\nb uildin g\nwin d o w\nd o orw ay\ntree\nroad\ncar\nsky\nside w alk\npedestrian\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nplant\nflo w er\nlan d\ntree\nclock\nbrush\nsky\ngreenery\nberries\n\n0\n\nfo ot\nfurniture\nbag\ndish\ngarden\nhead\nperso n\nbedside\nglass\n\nFigure 5:\n(a) Graphical model for clustering retrieval set images using their object labels. We extend the\nmodel of Figure 4 to allow each image to be assigned to a latent cluster si, which is drawn from mixing weights\n\u03c0. We use a Dirichlet process prior to automatically infer the number of clusters. We illustrate the clustering\nprocess for the retrieval set corresponding to the input image in (b). (c) Histogram of the number of images\nassigned to the \ufb01ve clusters with highest likelihood. (d) Montages of retrieval set images assigned to each\ncluster, along with their object labels (colors show spatial extent), shown in (e). (f) The likelihood of an object\ncategory being present in a given cluster (the top nine most likely objects are listed). (g) Spatial likelihoods for\nthe objects listed in (f). Note that the montage cells are sorted in raster order.\n\n\ufb01ths, Engen, and McCloskey [8, 11, 16] with concentration parameter \u03b1. In the Chinese restaurant\nanalogy, the different clusters correspond to tables and the parameters for object presence \u03b8k and\nspatial location \u03c6k are the dishes served at a given table. An image (along with its object labels)\ncorresponds to a single customer that is seated at a table.\n\nWe illustrate the clustering process for a retrieval set belonging to the input image in Figure 5(b).\nThe \ufb01ve clusters with highest likelihood are visualized in the columns of Figure 5(d)-(g). Figure 5(d)\nshows montages of retrieval images with highest likelihood that were assigned to each cluster. The\ntotal number of retrieval images that were assigned to each cluster are shown as a histogram in\nFigure 5(c). The number of images assigned to each cluster is proportional to the cluster mixing\nweights, \u03c0. Figure 5(e) depicts the object labels that were provided for the images in Figure 5(d),\nwith the colors showing the spatial extent of the object labels. Notice that the images and labels\nbelonging to each cluster share approximately the same object categories and geometrical con\ufb01g-\nuration. Also, the cluster that best matches the input image tends to have the highest number of\nretrieval images assigned to it. Figure 5(f) shows the likelihood of objects that appear in the cluster\n\n5\n\n\f(the nine objects with highest likelihood are shown). This corresponds to \u03b8 in the model. Figure 5(g)\ndepicts the spatial distribution of the object centroid within the cluster. The montage of nine cells\ncorrespond to the nine objects listed in Figure 5(f), sorted in raster order. The spatial distributions\nillustrate \u03c6. Notice that typically at least one cluster predicts well the objects contained in the input\nimage, in addition to their location, via the object likelihoods and spatial distributions.\nTo learn \u03b8k and \u03c6k, we use a Rao-Blackwellized Gibbs sampler to draw samples from the posterior\ndistribution over si given the object labels belonging to the set of retrieved images. We ran the\nGibbs sampler for 100 iterations. Empirically, we observed relatively fast convergence to a stable\nsolution. Note that improved performance may be achieved with variational inference for Dirichlet\nProcesses [10, 17]. We manually tuned all hyperparameters using a validation set of images, with\nconcentration parameter \u03b1 = 100 and spatial location parameters \u03ba = 0.1, \u03d1 = 0.5, \u03bd = 3, and\n\u2206 = 0.01 across all bounding box parameters (with the exception of \u2206 = 0.1 for the horizontal\ncentroid location, which re\ufb02ects less certainty a priori about the horizontal location of objects). We\nused a symmetric Dirichlet hyperparameter with \u03b2l = 0.1 across all object categories l.\nFor \ufb01nal object detection, we use the learned parameters \u03c0, \u03b8, and \u03c6 to infer hi,j. Since si and hi,j\nare latent random variables for the input image, we perform hard EM by marginalizing over hi,j to\ninfer the best cluster s\u2217\n\ni and infer hi,j, as outlined in Section 3.\n\ni . We then in turn \ufb01x s\u2217\n\n5 Experimental Results\n\nIn this section we show qualitative and quantitative results for our model. We use a subset of the\nLabelMe dataset for our experiments, discarding spurrious and nonlabeled images. The dataset is\nsplit into training and test sets. The training set has 15691 images and 105034 annotations. The\ntest set has 560 images and 3571 annotations. The test set comprises images of street scenes and\nindoor of\ufb01ce scenes. To avoid over\ufb01tting, we used street scene images that were photographed in\na different city from the images in the training set. To overcome the diverse object labels provided\nby users of LabelMe, we used WordNet [3] to resolve synonyms. For object detection, we extracted\n3809 bounding boxes per image. For the \ufb01nal detection results, we used non-maximal suppression.\n\nExample object detections from our system are shown in Figure 6(b),(d),(e). Notice that our system\ncan \ufb01nd many different objects embedded in different scene type con\ufb01gurations. When mistakes\nare made, the proposed object location typically makes sense within the scene. In Figure 6(c), we\ncompare against a baseline object detector using only appearance information and trained with a\nlinear kernel SVM. Thresholds for both detectors were set to yield a 0.5 false positive rate per image\nfor each object category (\u223c1.3e-4 false positives per window). Notice that our system produces\nmore detections and rejects objects that do not belong to the scene. In Figure 6(e), we show typical\nfailures of the system, which usually occurs when the retrieval set is not correct or an input image is\noutside of the training set.\n\nIn Figure 7, we show quantitative results for object detection for a number of object categories.\nWe show ROC curves (plotted on log-log axes) for the local appearance detector, the detector from\nSection 3 (without clustering), and the full system with clustering. We scored detections using the\nPASCAL VOC 2006 criteria [2], where the outputs are sorted from most con\ufb01dent to least and the\nratio of intersection area to union area is computed between an output bounding box and ground-\ntruth bounding box. If the ratio exceeds 0.5, then the output is deemed correct and the ground-truth\nlabel is removed. While this scoring criteria is good for some objects, other objects are not well\nrepresented by bounding boxes (e.g. buildings and sky).\n\nNotice that the detectors that take into account context typically outperforms the detector using local\nappearance only. Also, clustering does as well and in some cases outperforms no clustering. Finally,\nthe overall system sometimes performs worse for indoor scenes. This is due to poor retrieval set\nmatching, which causes a poor context model to be learned.\n\n6 Conclusion\n\nWe presented a framework for object detection in scenes based on transferring knowledge about\nobjects from a large labeled image database. We have shown that a relatively simple parametric\n\n6\n\n\f(a)\n\nsky\n\nbuilding\n\ntree\n\nroad\n\n(b)\n\ncar\n\ncar\n\nroad\n\ncar\n\nsky\n\ntree\n\ncar\n\nroad\n\nsky\n\nskybuilding\n\nsky\n\n(c)\n\nsidewalk\n\ntabletable\nkeyboard\n\nkeyboard\n\nchair\n\nkeyboard\nkeyboard\n\nwall\n\nwall\nscreen\n\nkeyboard\n\nkeyboard\n\nwall\n\nwall\n\nscreen\n\nbuilding\n\nbuilding\n\ntable\n\nroad\n\ntable\n\nchair\n\ncar\nkeyboard\n\ntable\n\nroad\n\nkeyboard\n\nsky\n\nwindow\n\nwindow\n\nwindow\n\n(d)\n\nroad\n\ncar\n\ncar\n\ncar\n\nperson\n\nperson\n\nperson\n\nroad\n\ncar\n\ncar\n\ncar\nsidewalk\n\ncar\n\nroad\n\ncar\n\nsky\n\n(e)\n\nsky\n\nwindow\n\nscreen\n\nscreen\n\nroad\n\nkeyboard\n\nbuilding building\n\nchair\n\nFigure 6:\n(a) Input images. (b) Object detections from our system combining scene alignment with local\n(c) Object detections using appearance information only with an SVM. Notice that our system\ndetection.\ndetects more objects and rejects out-of-context objects. (d) More outputs from our system. Notice that many\ndifferent object categories are detected across different scenes. (e) Failure cases for our system. These often\noccur when the retrieval set is incorrect.\n\nmodel, trained on images loosely matching the spatial con\ufb01guration of the input image, is capable\nof accurately inferring which objects are depicted in the input image along with their location. We\nshowed that we can successfully detect a wide range of objects depicted in a variety of scene types.\n\n7 Acknowledgments\n\nThis work was supported by the National Science Foundation Grant No. 0413232, the National\nGeospatial-Intelligence Agency NEGI-1582-04-0004, and the Of\ufb01ce of Naval Research MURI\nGrant N00014-06-1-0734.\n\nReferences\n[1] A. Berg, T. Berg, and J. Malik. Shape matching and object recognition using low distortion correspon-\n\ndence. In CVPR, volume 1, pages 26\u201333, June 2005.\n\n[2] M. Everingham, A. Zisserman, C.K.I. Williams, and L. Van Gool. The pascal visual object classes\nchallenge 2006 (voc 2006) results. Technical report, September 2006. The PASCAL2006 dataset can be\ndownloaded at http : //www.pascal \u2212 network.org/challenges/VOC/voc2006/.\n\n7\n\n\f100\n\ntree (531)\n\n100\n\nbuilding (547)\n\n100\n\nperson (113)\n\n100\n\nsidewalk (196)\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\ncar (138)\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nscreen (268)\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nroad (232)\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nbookshelf (47)\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nsky (144)\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nkeyboard (154)\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22121\n\n10\u22124\n\n100\n\n10\u22121\n\n10\u22124\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\n10\u22121\n\n10\u22124\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\n10\u22121\n\n10\u22124\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\n10\u22121\n\n \n\n10\u22124\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nmotorbike (40)\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nwall (69)\n\nSVM\nNo clustering\nClustering\n\n \n\n10\u22123\n\n10\u22122\n\n10\u22121\n\nFigure 7: Comparison of full system against local appearance only detector (SVM). Detection rate for a\nnumber of object categories tested at a \ufb01xed false positive per window rate of 2e-04 (0.8 false positives per\nimage per object class). The number of test examples appear in parenthesis next to the category name. We\nplot performance for a number of classes for the baseline SVM object detector (blue), the detector of Section 3\nusing no clustering (red), and the full system (green). Notice that detectors taking into account context performs\nbetter in most cases than using local appearance alone. Also, clustering does as well, and sometimes exceeds no\nclustering. Notable exceptions are for some indoor object categories. This is due to poor retrieval set matching,\nwhich causes a poor context model to be learned.\n\n[3] C. Fellbaum. Wordnet: An Electronic Lexical Database. Bradford Books, 1998.\n[4] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. Intl. J. Computer Vision,\n\n[5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.\n\n61(1), 2005.\n\nIn CVPR, 2003.\n\n[6] James Hays and Alexei Efros. Scene completion using millions of photographs. In \u201dSIGGRAPH\u201d, 2007.\n[7] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. In CVPR, 2006.\n[8] H. Ishwaran and M. Zarepour. Exact and approximate sum-representations for the dirichlet process.\n\nCanadian Journal of Statistics, 30:269\u2013283, 2002.\n\n[9] David G. Lowe. Distinctive image features from scale-invariant keypoints.\n\nIntl. J. Computer Vision,\n\n60(2):91\u2013110, 2004.\n\n7:619\u2013629, 2003.\n\n[10] J. McAuliffe, D. Blei, and M. Jordan. Nonparametric empirical bayes for the Dirichlet process mixture\n\nmodel. Statistics and Computing, 16:5\u201314, 2006.\n\n[11] R. M. Neal. Density modeling and clustering using Dirichlet diffusion trees.\n\nIn Bayesian Statistics,\n\n[12] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial\n\nenvelope. Intl. J. Computer Vision, 42(3):145\u2013175, 2001.\n\n[13] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In IEEE\n\nIntl. Conf. on Computer Vision, 2007.\n\n[14] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database and web-based tool\n\nfor image annotation. Technical Report AIM-2005-025, MIT AI Lab Memo, September, 2005.\n\n[15] E. Sudderth, A. Torralba, W. T. Freeman, and W. Willsky. Learning hierarchical models of scenes, objects,\n\nand parts. In IEEE Intl. Conf. on Computer Vision, 2005.\n\n[16] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 2006.\n\n[17] Y. W. Teh, D. Newman, and Welling M. A collapsed variational bayesian inference algorithm for latent\n\ndirichlet allocation. In Advances in Neural Info. Proc. Systems, 2006.\n\n[18] A. Torralba. Contextual priming for object detection. Intl. J. Computer Vision, 53(2):153\u2013167, 2003.\n[19] A. Torralba, R. Fergus, and W.T. Freeman. Tiny images. Technical Report AIM-2005-025, MIT AI Lab\n\nMemo, September, 2005.\n\n[20] A. Torralba, K. Murphy, W. Freeman, and M. Rubin. Context-based vision system for place and object\n\nrecognition. In Intl. Conf. Computer Vision, 2003.\n\n8\n\n\f", "award": [], "sourceid": 778, "authors": [{"given_name": "Bryan", "family_name": "Russell", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}, {"given_name": "Ce", "family_name": "Liu", "institution": null}, {"given_name": "Rob", "family_name": "Fergus", "institution": null}, {"given_name": "William", "family_name": "Freeman", "institution": null}]}