{"title": "Im2Text: Describing Images Using 1 Million Captioned Photographs", "book": "Advances in Neural Information Processing Systems", "page_first": 1143, "page_last": 1151, "abstract": "We develop and demonstrate automatic image description methods using a large captioned photo collection.  One contribution is our technique for the automatic collection of this new dataset -- performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions.  Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning.", "full_text": "Im2Text: Describing Images Using 1 Million\n\nCaptioned Photographs\n\nVicente Ordonez\n\nGirish Kulkarni\n\nTamara L Berg\n\nStony Brook University\nStony Brook, NY 11794\n\n{vordonezroma or tlberg}@cs.stonybrook.edu\n\nAbstract\n\nWe develop and demonstrate automatic image description methods using a large\ncaptioned photo collection. One contribution is our technique for the automatic\ncollection of this new dataset \u2013 performing a huge number of Flickr queries and\nthen \ufb01ltering the noisy results down to 1 million images with associated visually\nrelevant captions. Such a collection allows us to approach the extremely chal-\nlenging problem of description generation using relatively simple non-parametric\nmethods and produces surprisingly effective results. We also develop methods in-\ncorporating many state of the art, but fairly noisy, estimates of image content to\nproduce even more pleasing results. Finally we introduce a new objective perfor-\nmance measure for image captioning.\n\nIntroduction\n\n1\nProducing a relevant and accurate caption for an arbitrary image is an extremely challenging prob-\nlem, perhaps nearly as dif\ufb01cult as the underlying general image understanding task. However, there\nare already many images with relevant associated descriptive text available in the noisy vastness of\nthe web. The key is to \ufb01nd the right images and make use of them in the right way! In this paper,\nwe present a method to effectively skim the top of the image understanding problem to caption pho-\ntographs by collecting and utilizing the large body of images on the internet with associated visually\ndescriptive text. We follow in the footsteps of past work on internet vision that has demonstrated\nthat big data can often make big problems \u2013 e.g.\nimage localization [13], retrieving photos with\nspeci\ufb01c content [27], or image parsing [26] \u2013 much more bite size and amenable to very simple non-\nparametric matching methods. In our case, with a large captioned photo collection we can create an\nimage description surprisingly well even with basic global image representations for retrieval and\ncaption transfer. In addition, we show that it is possible to make use of large numbers of state of the\nart, but fairly noisy estimates of image content to produce more pleasing and relevant results.\nPeople communicate through language, whether written or spoken. They often use this language to\ndescribe the visual world around them. Studying collections of existing natural image descriptions\nand how to compose descriptions for novel queries will help advance progress toward more com-\nplex human recognition goals, such as how to tell the story behind an image. These goals include\ndetermining what content people judge to be most important in images and what factors they use\nto construct natural language to describe imagery. For example, when given a picture like that on\nthe top row, middle column of \ufb01gure 1, the user describes the girl, the dog, and their location, but\nselectively chooses not to describe the surrounding foliage and hut.\nThis link between visual importance and descriptions leads naturally to the problem of text sum-\nmarization in natural language processing (NLP). In text summarization, the goal is to select or\ngenerate a summary for a document. Some of the most common and effective methods proposed for\nsummarization rely on extractive summarization [25, 22, 28, 19, 23]. where the most important or\n\n1\n\n\fFigure 1: SBU Captioned Photo Dataset: Photographs with user-associated captions from our\nweb-scale captioned photo collection. We collect a large number of photos from Flickr and \ufb01lter\nthem to produce a data collection containing over 1 million well captioned pictures.\n\nrelevant sentence (or sentences) is selected from a document to serve as the document\u2019s summary.\nOften a variety of features related to document content [23], surface [25], events [19] or feature com-\nbinations [28] are used in the selection process to produce sentences that re\ufb02ect the most signi\ufb01cant\nconcepts in the document.\nIn our photo captioning problem, we would like to generate a caption for a query picture that summa-\nrizes the salient image content. We do this by considering a large relevant document set constructed\nfrom related image captions and then use extractive methods to select the best caption(s) for the\nimage. In this way we implicitly make use of human judgments of content importance during de-\nscription generation, by directly transferring human made annotations from one image to another.\nThis paper presents two extractive approaches for image description generation. The \ufb01rst uses global\nimage representations to select relevant captions (Sec 3). The second incorporates features derived\nfrom noisy estimates of image content (Sec 5). Of course, the \ufb01rst requirement for any extractive\nmethod is a document from which to extract. Therefore, to enable our approach we build a web-\nscale collection of images with associated descriptions (ie captions) to serve as our document for\nrelevant caption extraction. A key factor to making such a collection effective is to \ufb01lter it so that\ndescriptions are likely to refer to visual content. Some small collections of captioned images have\nbeen created by hand in the past. The UIUC Pascal Sentence data set1 contains 1k images each of\nwhich is associated with 5 human generated descriptions. The ImageClef2 image retrieval challenge\ncontains 10k images with associated human descriptions. However neither of these collections is\nlarge enough to facilitate reasonable image based matching necessary for our goals, as demonstrated\nby our experiments on captioning with varying collection size (Sec 3). In addition this is the \ufb01rst \u2013\nto our knowledge \u2013 attempt to mine the internet for general captioned images on a web scale!\nIn summary, our contributions are:\n\npeople, \ufb01ltered so that the descriptions are likely to refer to visual content.\n\n\u2022 A large novel data set containing images from the web with associated captions written by\n\u2022 A description generation method that utilizes global image representations to retrieve and\n\u2022 A description generation method that utilizes both global representations and direct esti-\nmates of image content (objects, actions, stuff, attributes, and scenes) to produce relevant\nimage descriptions.\n\ntransfer captions from our data set to a query image.\n\n1.1 Related Work\nStudying the association between words with pictures has been explored in a variety of tasks, in-\ncluding: labeling faces in news photographs with associated captions [2], \ufb01nding a correspondence\nbetween keywords and image regions [1, 6], or for moving beyond objects to mid-level recognition\nelements such as attribute [16, 8, 17, 12].\nImage description generation in particular has been studied in a few recent papers [9, 11, 15, 30].\nKulkarni et al [15] generate descriptions from scratch based on detected object, attribute, and prepo-\nsitional relationships. This results in descriptions for images that are usually closely related to image\ncontent, but that are also often quite verbose and non-humanlike. Yao et al [30] look at the problem\n\n1http://vision.cs.uiuc.edu/pascal-sentences/\n2http://www.imageclef.org/2011\n\n2\n\nMan sits in a rusted car buried in the sand on Waitarere beach \rInterior design of modern white and brown living room furniture against white wall with a lamp hanging.\rEmma in her hat looking super cute \rLittle girl and her dog in northern Thailand. They both seemed interested in what we were doing \r\fFigure 2: System \ufb02ow: 1) Input query image, 2) Candidate matched images retrieved from our web-\nscale captioned collection using global image representations, 3) High level information is extracted\nabout image content including objects, attributes, actions, people, stuff, scenes, and t\ufb01df weighting,\n4) Images are re-ranked by combining all content estimates, 5) Top 4 resulting captions.\n\nof generating text using various hierarchical knowledge ontologies and with a human in the loop\nfor image parsing (except in specialized circumstances). Feng and Lapata [11] generate captions\nfor images using extractive and abstractive generation methods, but assume relevant documents are\nprovided as input, whereas our generation method requires only an image as input.\nA recent approach from Farhadi et al [9] is the most relevant to ours.\nIn this work the authors\nproduce image descriptions via a retrieval method, by translating both images and text descriptions\nto a shared meaning space represented by a single < object, action, scene > tuple. A description\nfor a query image is produced by retrieving whole image descriptions via this meaning space from\na set of image descriptions (the UIUC Pascal Sentence data set). This results in descriptions that are\nvery human \u2013 since they were written by humans \u2013 but which may not be relevant to the speci\ufb01c\nimage content. This limited relevancy often occurs because of problems of sparsity, both in the data\ncollection \u2013 1000 images is too few to guarantee similar image matches \u2013 and in the representation\n\u2013 only a few categories for 3 types of image content are considered.\nIn contrast, we attack the caption generation problem for much more general images (images found\nvia thousands of Flickr queries compared to 1000 images from Pascal) and a larger set of object\ncategories (89 vs 20). In addition to extending the object category list considered, we also include\na wider variety of image content aspects, including: non-part based stuff categories, attributes of\nobjects, person speci\ufb01c action models, and a larger number of common scene classes. We also\ngenerate our descriptions via an extractive method with access to much larger and more general set\nof captioned photographs from the web (1 million vs 1 thousand).\n\n2 Overview & Data Collection\nOur captioning system proceeds as follows (see \ufb01g 2 for illustration): 1) a query image is input to\nthe captioning system, 2) Candidate match images are retrieved from our web-scale collection of\ncaptioned photographs using global image descriptors, 3) High level information related to image\ncontent, e.g. objects, scenes, etc, is extracted, 4) Images in the match set are re-ranked based on\nimage content, 5) The best caption(s) is returned for the query. Captions can also be generated after\nstep 2 from descriptions associated with top globally matched images.\nIn the rest of the paper, we describe collecting a web-scale data set of captioned images from the\ninternet (Sec 2.1), caption generation using a global representation (Sec 3), content estimation for\nvarious content types (Sec 4), and \ufb01nally present an extension to our generation method that incor-\nporates content estimates (Sec 5).\n\n2.1 Building a Web-Scale Captioned Collection\nOne key contribution of our paper is a novel web-scale database of photographs with associated\ndescriptive text. To enable effective captioning of novel images, this database must be good in two\nways: 1) It must be large so that image based matches to a query are reasonably similar, 2) The\ncaptions associated with the data base photographs must be visually relevant so that transferring\ncaptions between pictures is useful. To achieve the \ufb01rst requirement we query Flickr using a huge\nnumber of pairs of query terms (objects, attributes, actions, stuff, and scenes). This produces a very\nlarge, but noisy initial set of photographs with associated text. To achieve our second requirement\n\n3\n\nQuery image\rGist + Tiny images ranking\rTop re-ranked images\rAcross the street from Yannicks apartment. At night the headlight on the handlebars above the door lights up.\rThe building in which I live. My window is on the right on the 4th \ufb02oor\rThis is the car I was in after they had removed the roof and successfully removed me to the ambulance.\rI really like doors. I took this photo out of the car window while driving by a church in Pennsylvania.\rTop associated captions\rExtract High Level Information\rQuery Image\rMatched Images & extracted content\r\fFigure 3: Size Matters: Example matches to a query image for varying data set sizes.\n\nwe \ufb01lter this set of photos so that the descriptions attached to a picture are relevant and visually\ndescriptive. To encourage visual descriptiveness in our collection, we select only those images\nwith descriptions of satisfactory length based on observed lengths in visual descriptions. We also\nenforce that retained descriptions contain at least 2 words belonging to our term lists and at least one\nprepositional word, e.g. \u201con\u201d, \u201cunder\u201d which often indicate visible spatial relationships.\nThis results in a \ufb01nal collection of over 1 million images with associated text descriptions \u2013 the\nSBU Captioned Photo Dataset. These text descriptions generally function in a similar manner to\nimage captions, and usually directly refer to some aspects of the visual image content (see \ufb01g 1 for\nexamples). Hereafter, we will refer to this web based collection of captioned images as C.\nQuery Set: We randomly sample 500 images from our collection for evaluation of our generation\nmethods (exs are shown in \ufb01g 1). As is usually the case with web photos, the photos in this set\ndisplay a wide range of dif\ufb01culty for visual recognition algorithms and captioning, from images that\ndepict scenes (e.g. beaches), to images with a relatively simple depictions (e.g. a horse in a \ufb01eld),\nto images with much more complex depictions (e.g. a boy handing out food to a group of people).\n\n3 Global Description Generation\nInternet vision papers have demonstrated that if your data set is large enough, some very challenging\nproblems can be attacked with very simple matching methods [13, 27, 26]. In this spirit, we harness\nthe power of web photo collections in a non-parametric approach. Given a query image, Iq, our goal\nis to generate a relevant description. We achieve this by computing the global similarity of a query\nimage to our large web-collection of captioned images, C. We \ufb01nd the closest matching image (or\nimages) and simply transfer over the description from the matching image to the query image. We\nalso collect the 100 most similar images to a query \u2013 our matched set of images Im \u2208 M \u2013 for use\nin our our content based description generation method (Sec 5).\nFor image comparison we utilize two image descriptors. The \ufb01rst descriptor is the well known\ngist feature, a global image descriptor related to perceptual dimensions \u2013 naturalness, roughness,\nruggedness etc \u2013 of scenes. The second descriptor is also a global image descriptor, computed by\nresizing the image into a \u201ctiny image\u201d, essentially a thumbnail of size 32x32. This helps us match\nnot only scene structure, but also the overall color of images. To \ufb01nd visually relevant images we\ncompute the similarity of the query image to images in C using a sum of gist similarity and tiny\nimage color similarity (equally weighted).\nResults \u2013 Size Matters! Our global caption generation method is illustrated in the \ufb01rst 2 panes\nand the \ufb01rst 2 resulting captions of Fig 2. This simple method often performs surprisingly well.\nAs re\ufb02ected in past work [13, 27] image retrieval from small collections often produces spurious\nmatches. This can be seen in Fig 3 where increasing data set size has a signi\ufb01cant effect on the\nquality of retrieved global matches. Quantitative results also re\ufb02ect this (see Table 1).\n\nImage Content Estimation\n\n4\nGiven an initial matched set of images Im \u2208 M based on global descriptor similarity, we would like\nto re-rank the selected captions by incorporating estimates of image content. For a query image, Iq\nand images in its matched set we extract and compare 5 kinds of image content:\n\u2022 Objects (e.g. cats or hats), with shape, attributes, and actions \u2013 sec 4.1\n\u2022 Stuff (e.g. grass or water) \u2013 sec 4.2\n\n4\n\nQuery\t\r \u00a0Image\t\r \u00a01k\t\r \u00a0matches\t\r \u00a010k\t\r \u00a0matches\t\r \u00a0100k\t\r \u00a0matches\t\r \u00a01million\t\r \u00a0matches\t\r \u00a0\f\u2022 People (e.g. man), with actions \u2013 sec 4.3\n\u2022 Scenes (e.g. pasture or kitchen) \u2013 sec 4.4\n\u2022 TFIDF weights (text or detector based) \u2013 sec 4.5\n\nEach type of content is used to compute the similarity between matched images (and captions) and\nthe query image. We then rank the matched images (and captions) according to each content measure\nand combine their results into an overall relevancy ranking (Sec 5).\n\n4.1 Objects\nDetection & Actions: Object detection methods have improved signi\ufb01cantly in the last few years,\ndemonstrating reasonable performance for a small number of object categories [7], or as a mid-level\nrepresentation for scene recognition [20]. Running detectors on general web images however, still\nproduces quite noisy results, usually in the form of a large number of false positive detections. As\nthe number of object detectors increases this becomes even more of an obstacle to content prediction.\nHowever, we propose that if we have some prior knowledge about the content of an image, then we\ncan utilize even these imperfect detectors. In our web collection, C, there are strong indicators of\ncontent in the form of caption words \u2013 if an object is described in the text associated with an image\nthen it is likely to be depicted. Therefore, for the images, Im \u2208 M, in our matched set we run only\nthose detectors for objects (or stuff) that are mentioned in the associated caption. In addition, we\nalso include synonyms and hyponyms for better content coverage, e.g. \u201cdalmatian\u201d triggers \u201cdog\u201d\ndetector. This produces pleasingly accurate detection results. For a query image we can essentially\nperform detection veri\ufb01cation against the relatively clean matched image detections.\nSpeci\ufb01cally, we use mixture of multi-scale deformable part detectors [10] to detect a wide variety of\nobjects \u2013 89 object categories selected to cover a reasonable range of common objects. These cat-\negories include the 20 Pascal categories, 49 of the most common object categories with reasonably\neffective detectors from Object Bank [20], and 20 additional common object categories.\nFor the 8 animate object categories in our list (e.g. cat, cow, duck) we \ufb01nd that detection performance\ncan be improved signi\ufb01cantly by training action speci\ufb01c detectors, for example \u201cdog sitting\u201d vs\n\u201cdog running\u201d. This also aids similarity computation between a query and a matched image because\nobjects can be matched at an action level. Our object action detectors are trained using the standard\nobject detector with pose speci\ufb01c training data.\nRepresentation: We represent and compare object detections using 2 kinds of features, shape and\nappearance. To represent object shape we use a histogram of HoG [4] visual words, computed at\nintervals of 8 pixels and quantized into 1000 visual words. These are accumulated into a spatial\npyramid histogram [18]. We also use an attribute representation to characterize object appearance.\nWe use the attribute list from our previous work [15] which cover 21 visual aspects describing color\n(e.g. blue), texture (e.g. striped), material (e.g. wooden), general appearance (e.g.\nrusty), and\nshape (e.g. rectangular). Training images for the attribute classi\ufb01ers come from Flickr, Google, the\nattribute dataset provided by Farhadi et al [8], and ImageNet [5]. An RBF kernel SVM is used to\nlearn a classi\ufb01er for each attribute term. Then appearance characteristics are represented as a vector\nof attribute responses to allow for generalization.\nIf we have detected an object category, c, in a query image window, Oq and a matched image\nwindow, Om, then we compute the probability of an object match as:\n\nP (Oq, Om) = e\u2212Do(Oq,Om)\n\nwhere Do(Oq, Om) is the Euclidean distance between the object (shape or attribute) vector in the\nquery detection window and the matched detection window.\n\n4.2 Stuff\nIn addition to objects, people often describe the stuff present in images, e.g. \u201cgrass\u201d. Because these\ncategories are more amorphous and do not display de\ufb01ned parts, we use a region based classi\ufb01cation\nmethod for detection. We train linear SVMs on the low level region features of [8] and histograms\nof Geometric Context output probability maps [14] to recognize: sky, road, building, tree, water,\nand grass stuff categories. While the low level features are useful for discriminating stuff by their\nappearance, the scene layout maps introduce a soft preference for certain spatial locations dependent\non stuff type. Training images and bounding boxes are taken from ImageNet and evaluated at test\ntime on a coarsely sampled grid of overlapping square regions over whole images. Pixels in any\n\n5\n\n\fFigure 4: Results: Some good captions selected by our system for query images.\n\nregion with a classi\ufb01cation probability above a \ufb01xed threshold are treated as detections, and the max\nprobability for a region is used as the potential value.\nIf we have detected a stuff category, s, in a query image region, Sq and a matched image region, Sm,\nthen we compute the probability of a stuff match as:\n\nP (Sq, Sm) = P (Sq = s) \u2217 P (Sm = s)\n\nwhere P (Sq = s) is the SVM probability of the stuff region detection in the query image and\nP (Sm = s) is the SVM probability of the stuff region detection in the matched image.\n\n4.3 People & Actions\nPeople often take pictures of people, making \u201cperson\u201d the most commonly depicted object category\nin captioned images. We utilize effective recent work on pedestrian detectors to detect and represent\npeople in our images. In particular, we make use of detectors from Bourdev et al [3] which learn\nposelets \u2013 parts that are tightly clustered in con\ufb01guration and appearance space \u2013 from a large num-\nber of 2d annotated regions on person images in a max-margin framework. To represent activities,\nwe use follow on work from Maji et al [21] which classi\ufb01es actions using a the poselet activation\nvector. This has been shown to produce accurate activity classi\ufb01ers for the 9 actions in the PASCAL\nVOC 2010 static image action classi\ufb01cation challenge [7]. We use the outputs of these 9 classi\ufb01ers\nas our action representation vector, to allow for generalization to other similar activities.\nIf we have detected a person, Pq, in a query image, and a person Pm in a matched image, we compute\nthe probability that the people share the same action (pose) as:\nP (Pq, Pm) = e\u2212Dp(Pq,Pm)\n\nwhere Dp(Pq, Pm) is the Euclidean distance between the person action vector in the query detection\nand the person action vector in the matched detection.\n\n4.4 Scenes\nThe last commonly described kind of image content relates to the general scene where an image was\ncaptured. This often occurs when examining captioned photographs of vacation snapshots or general\noutdoor settings, e.g. \u201cmy dog at the beach\u201d. To recognize scene types we train discriminative multi-\nkernel classi\ufb01ers using the large-scale SUN scene recognition data base and code [29]. We select\n23 common scene categories for our representation, including indoor (e.g. kitchen) outdoor (e.g.\nbeach), manmade (e.g. highway), and natural (pasture) settings. Again here we represent the scene\ndescriptor as a vector of scene responses for generalization.\nIf a scene location, Lm, is mentioned in a matched image, then we compare the scene representation\nbetween our matched image and our query image, Lq as:\n\nP (Lq, Lm) = e\u2212Dl(Lq,Lm)\n\nwhere Dl(Lq, Lm) is the Euclidean distance between the scene vector computed on the query image\nand the scene vector computed on the matched image.\n\n6\n\nAmazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind.\rStrange cloud formation literally \ufb02owing through the sky like a river in relation to the other clouds out there.\r\rFresh fruit and vegetables at the market in Port Louis Mauritius.\rClock tower against the sky.\rTree with red leaves in the \ufb01eld in autumn.\rOne monkey on the tree in the Ourika Valley Morocco \rA female mallard duck in the lake at Luukki Espoo\rThe river running through town I cross over this to get to the train \rStreet dog in Lijiang \rThe sun was coming through the trees while I was sitting in my chair by the river \r\fFigure 5: Funny Results: Some particularly funny or poetic results.\n\n4.5 TFIDF Measures\nFor a query image, Iq, we wish to select the best caption from the matched set, Im \u2208 M. For all of\nthe content measures described so far, we have computed the similarity of the query image content\nto the content of each matched image independently. We would also like to use information from\nthe entire matched set of images and associated captions to predict importance. To re\ufb02ect this, we\ncalculate TFIDF on our matched sets. This is computed as usual as a product of term frequency (tf)\nand inverse document frequency (idf). We calculate this weighting both in the standard sense for\nmatched caption document words and for detection category frequencies (to compensate for more\nproli\ufb01c object detectors).\n\ntf idf = ni,jP\n\nk nk,j\n\n\u2217 log\n\n|D|\n\n|j : ti \u2208 dj|\n\nWe de\ufb01ne our matched set of captions (images for detector based t\ufb01df) to be our document, j and\ncompute the t\ufb01df score where ni,j represents the frequency of term i in the matched set of captions\n(number of detections for detector based t\ufb01df). The inverse document frequency is computed as\nthe log of the number of documents |D| divided by the number of documents containing the term i\n(documents with detections of type i for detector based t\ufb01df).\n\n5 Content Based Description Generation\nFor a query image, Iq, with global descriptor based matched images, Im \u2208 M, we want to re-\nrank the matched images according to the similarity of their content to the query. We perform this\nre-ranking individually for each of our content measures: object shape, object attributes, people\nactions, stuff classi\ufb01cation, and scene type (Sec 4). We then combine these individual rankings into\na \ufb01nal combined ranking in two ways. The \ufb01rst method trains a linear regression model of feature\nranks against BLEU scores. The second method divides our training set into two classes, positive\nimages consisting of the top 50% of the training set by BLEU score, and negative images from the\nbottom 50%. A linear SVM is trained on this data with feature ranks as input. For both methods we\nperform 5 fold cross validation with a split of 400 training images and 100 test images to get average\nperformance and standard deviation. For a novel query image, we return the captions from the top\nranked image(s) as our result.\nFor an example matched caption like \u201cThe little boy sat in the grass with a ball\u201d, several types of\ncontent will be used to score the goodness of the caption. This will be computed based on words\nin the caption for which we have trained content models. For example, for the word \u201cball\u201d both the\nobject shape and attributes will be used to compute the best similarity between a ball detection in the\nquery image and a ball detection in the matched image. For the word \u201cboy\u201d an action descriptor will\nbe used to compare the activity in which the boy is occupied between the query and the matched\nimage. For the word \u201cgrass\u201d stuff classi\ufb01cations will be used to compare detections between the\nquery and the matched image. For each word in the caption t\ufb01df overlap (sum of t\ufb01df scores for\nthe caption) is also used as well as detector based t\ufb01df for those words referring to objects. In the\nevent that multiple objects (or stuff, people or scenes) are mentioned in a matched image caption the\n\n7\n\nI tried to cross the street to get in my car but you can see that I failed LOL.\rThe tower is the highest building in Hong Kong.\rthe water the boat was in\rgirl in a box that is a train \rwater under the bridge \rsmall dog in the grass \rwalking the dog in the primeval forest\rcheck out the face on the kid in the black hat he looks so enthused \rshadows in the blue sky \r\fobject (or stuff, people, or scene) based similarity measures will be a sum over the set of described\nterms. For the case where a matched image caption contains a word, but there is no corresponding\ndetection in the query image, the similarity is not incorporated.\nResults & Evaluation: Our content based captioning method often produces reasonable results (exs\nare shown in Fig 4). Usually results describe the main subject of the photograph (e.g. \u201cStreet dog\nin Lijiang\u201d, \u201cOne monkey on the tree in the Ourika Valley Morocco\u201d). Sometimes they describe\nthe depiction extremely well (e.g. \u201cStrange cloud formation literally \ufb02owing through the sky like a\nriver...\u201d, \u201cClock tower against the sky\u201d). Sometimes we even produce good descriptions of attributes\n(e.g. \u201cTree with red leaves in the \ufb01eld in autumn\u201d). Other captions can be quite poetic (Fig 5) \u2013 a\npicture of a derelict boat captioned \u201cThe water the boat was in\u201d, a picture of monstrous tree roots\ncaptioned \u201cWalking the dog in the primeval forest\u201d. Other times the results are quite funny. A\npicture of a \ufb02imsy wooden structure says, \u201cThe tower is the highest building in Hong Kong\u201d. Once\nin awhile they are spookily apropos. A picture of a boy in a black bandana is described as \u201cCheck\nout the face on the kid in the black hat. He looks so enthused.\u201d \u2013 and he doesn\u2019t.\nWe also perform two quantitative evaluations. Several methods have been proposed to evaluate\ncaptioning [15, 9], including direct user ratings of relevance and BLEU score [24]. User rating tends\nto suffer from user variance as ratings are inherently subjective. The BLEU score on the other hand\nprovides a simple objective measure based on n-gram precision. As noted in past work [15], BLEU\nis perhaps not an ideal measure due to large variance in human descriptions (human-human BLEU\nscores hover around 0.5 [15]). Nevertheless, we report it for comparison.\n\nMethod\nGlobal Matching (1k)\nGlobal Matching (10k)\nGlobal Matching (100k)\nGlobal Matching (1million)\nGlobal + Content Matching (linear regression)\nGlobal + Content Matching (linear SVM)\n\nBLEU\n\n0.0774 +- 0.0059\n0.0909 +- 0.0070\n0.0917 +- 0.0101\n0.1177 +- 0.0099\n0.1215 +- 0.0071\n0.1259 +- 0.0060\n\nTable 1: Automatic Evaluation: BLEU score measured at 1\n\nAs can be seen in Table 1 data set size has a signi\ufb01cant effect on BLEU score; more data provides\nmore similar and relevant matched images (and captions). Local content matching also improves\nBLEU score somewhat over purely global matching.\nIn addition, we propose a new evaluation task where a user is presented with two photographs and\none caption. The user must assign the caption to the most relevant image (care is taken to remove\nbiases due to placement). For evaluation we use a query image and caption generated by our method.\nThe other image in the evaluation task is selected at random from the web-collection. This provides\nan objective and useful measure to predict caption relevance. As a sanity check of our evaluation\nmeasure we also evaluate how well a user can discriminate between the original ground truth image\nthat a caption was written about and a random image. We perform this evaluation on 100 images\nfrom our web-collection using Amazon\u2019s mechanical turk service, and \ufb01nd that users are able to\nselect the ground truth image 96% of the time. This demonstrates that the task is reasonable and that\ndescriptions from our collection tend to be fairly visually speci\ufb01c and relevant. Considering the top\nretrieved caption produced by our \ufb01nal method \u2013 global plus local content matching with a linear\nSVM classi\ufb01er \u2013 we \ufb01nd that users are able to select the correct image 66.7% of the time. Because\nthe top caption is not always visually relevant to the query image even when the method is capturing\nsome information, we also perform an evaluation considering the top 4 captions produced by our\nmethod. In this case, the best caption out of the top 4 is correctly selected 92.7% of the time. This\ndemonstrates the strength of our content based method to produce relevant captions for images.\n\n6 Conclusion\nWe have described an effective caption generation method for general web images. This method\nrelies on collecting and \ufb01ltering a large data set of images from the internet to produce a novel web-\nscale captioned photo collection. We present two variations on our approach, one that uses only\nglobal image descriptors to compose captions, and one that incorporates estimates of image content\nfor caption generation.\n\n8\n\n\fReferences\n[1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching words and pictures.\n\nJournal of Machine Learning Research, 3:1107\u20131135, 2003.\n\n[2] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, E. Learned-Miller, Y. Teh, and D. Forsyth. Names and\n\nfaces. In CVPR, 2004.\n\n[3] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting people using mutually consistent poselet activations.\n\nIn ECCV, 2010.\n\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale Hierarchical\n\nImage Database. In CVPR, 2009.\n\n[6] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object recognition as machine translation. In\n\nECCV, 2002.\n\n[7] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nJ. Winn,\n\nPASCAL Visual Object Classes Challenge 2010 (VOC2010) Results.\nnetwork.org/challenges/VOC/voc2010/workshop/index.html.\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\n[8] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their attributes. In CVPR,\n\n2009.\n\n[9] A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth. Every\n\npicture tells a story: generating sentences for images. In ECCV, 2010.\n\n[10] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models,\n\nrelease 4. http://people.cs.uchicago.edu/ pff/latent-release4/.\n\n[11] Y. Feng and M. Lapata. How many words is a picture worth? automatic caption generation for news\n\nimages. In Proc. of the Assoc. for Computational Linguistics, ACL \u201910, pages 1239\u20131249, 2010.\n\n[12] V. Ferrari and A. Zisserman. Learning visual attributes. In NIPS, 2007.\n[13] J. Hays and A. A. Efros. im2gps: estimating geographic information from a single image. In Proceedings\n\nof the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008.\n\n[14] D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layout from an image. Int. J. Comput. Vision,\n\n75:151\u2013172, October 2007.\n\n[15] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Understanding\n\nand generating simple image descriptions. In CVPR, 2011.\n\n[16] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classi\ufb01ers for face veri\ufb01-\n\ncation. In ICCV, 2009.\n\n[17] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\n\n[18] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching. In CVPR,\n\nattribute transfer. In CVPR, 2009.\n\nJune 2006.\n\n[19] W. Li, W. Xu, M. Wu, C. Yuan, and Q. Lu. Extractive summarization using inter- and intra- event\n\nrelevance. In Int Conf on Computational Linguistics, 2006.\n\n[20] E. P. X. Li-Jia Li, Hao Su and L. Fei-Fei. Object bank: A high-level image representation for scene\nIn Neural Information Processing Systems (NIPS),\n\nclassi\ufb01cation and semantic feature sparsi\ufb01cation.\nVancouver, Canada, December 2010.\n\n[21] S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose and\n\nappearance. In CVPR, 2011.\n\n[22] R. Mihalcea. Language independent extractive summarization.\n\nIntelligence, pages 1688\u20131689, 2005.\n\nIn National Conference on Arti\ufb01cial\n\n[23] A. Nenkova, L. Vanderwende, and K. McKeown. A compositional context sensitive multi-document\n\nsummarizer: exploring the factors that in\ufb02uence summarization. In SIGIR, 2006.\n\n[24] K. Papineni, S. Roukos, T. Ward, and W. jing Zhu. Bleu: a method for automatic evaluation of machine\n\ntranslation. pages 311\u2013318, 2002.\n\n[25] D. R. Radev and T. Allison. Mead - a platform for multidocument multilingual text summarization. In Int\n\nConf on Language Resources and Evaluation, 2004.\n\n[26] J. Tighe and S. Lazebnik. Superparsing: Scalable nonparametric image parsing with superpixels.\n\nECCV, 2010.\n\nIn\n\n[27] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: a large dataset for non-parametric object\n\nand scene recognition. PAMI, 30, 2008.\n\n[28] K.-F. Wong, M. Wu, and W. Li. Extractive summarization using supervised and semi-supervised learning.\n\nIn International Conference on Computational Linguistics, pages 985\u2013992, 2008.\n\n[29] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from\n\n[30] B. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image parsing to text description. Proc. IEEE,\n\nabbey to zoo. In CVPR, 2010.\n\n98(8), 2010.\n\n9\n\n\f", "award": [], "sourceid": 671, "authors": [{"given_name": "Vicente", "family_name": "Ordonez", "institution": null}, {"given_name": "Girish", "family_name": "Kulkarni", "institution": null}, {"given_name": "Tamara", "family_name": "Berg", "institution": null}]}