{"title": "Unsupervised Learning of Visual Sense Models for Polysemous Words", "book": "Advances in Neural Information Processing Systems", "page_first": 1393, "page_last": 1400, "abstract": "Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionary-based approach outperforms baseline methods.", "full_text": "Unsupervised Learning of Visual Sense Models for\n\nPolysemous Words\n\nKate Saenko\nMIT CSAIL\n\nCambridge, MA\n\nTrevor Darrell\n\nUC Berkeley EECS / ICSI\n\nBerkeley, CA\n\nsaenko@csail.mit.edu\n\ntrevor@eecs.berkeley.edu\n\nAbstract\n\nPolysemy is a problem for methods that exploit image search engines to build ob-\nject category models. Existing unsupervised approaches do not take word sense\ninto consideration. We propose a new method that uses a dictionary to learn mod-\nels of visual word sense from a large collection of unlabeled web data. The use\nof LDA to discover a latent sense space makes the model robust despite the very\nlimited nature of dictionary de\ufb01nitions. The de\ufb01nitions are used to learn a distri-\nbution in the latent space that best represents a sense. The algorithm then uses the\ntext surrounding image links to retrieve images with high probability of a particu-\nlar dictionary sense. An object classi\ufb01er is trained on the resulting sense-speci\ufb01c\nimages. We evaluate our method on a dataset obtained by searching the web for\npolysemous words. Category classi\ufb01cation experiments show that our dictionary-\nbased approach outperforms baseline methods.\n\n1 Introduction\n\nWe address the problem of unsupervised learning of object classi\ufb01ers for visually polysemous words.\nVisual polysemy means that a word has several dictionary senses that are visually distinct. Web\nimages are a rich and free resource compared to traditional human-labeled object datasets. Potential\ntraining data for arbitrary objects can be easily obtained from image search engines like Yahoo or\nGoogle. The drawback is that multiple word meanings often lead to mixed results, especially for\npolysemous words. For example, the query \u201cmouse\u201d returns multiple senses on the \ufb01rst page of\nresults: \u201ccomputer\u201d mouse, \u201canimal\u201d mouse, and \u201cMickey Mouse\u201d (see Figure 1.) The dataset thus\nobtained suffers from low precision of any particular visual sense.\n\nSome existing approaches attempt to \ufb01lter out unrelated images, but do not directly address poly-\nsemy. One approach involves bootstrapping object classi\ufb01ers from labeled image data [9], others\ncluster the unlabeled images into coherent components [6],[2]. However, most rely on a labeled seed\nset of inlier-sense images to initialize bootstrapping or to select the right cluster. The unsupervised\napproach of [12] bootstraps an SVM from the top-ranked images returned by a search engine, with\nthe assumption that they have higher precision for the category. However, for polysemous words,\nthe top-ranked results are likely to include several senses.\n\nWe propose a fully unsupervised method that speci\ufb01cally takes word sense into account. The only\ninput to our algorithm is a list of words (such as all English nouns, for example) and their dictionary\nentries. Our method is multimodal, using both web search images and the text surrounding them\nin the document in which they are embedded. The key idea is to learn a text model of the word\nsense, using an electronic dictionary such as Wordnet together with a large amount of unlabeled\ntext. The model is then used to retrieve images of a speci\ufb01c sense from the mixed-sense search\nresults. One application is an image search \ufb01lter that automatically groups results by word sense for\neasier navigation for the user. However, our main focus in this paper is on using the re-ranked images\n\n1\n\n\fFigure 1: Which sense of \u201cmouse\u201d? Mixed-sense images returned from an image keyword search.\n\nas training data for an object classi\ufb01er. The resulting classi\ufb01er can predict not only the English word\nthat best describes an input image, but also the correct sense of that word.\n\nA human operator can often re\ufb01ne the search by using more sense-speci\ufb01c queries, for example,\n\u201ccomputer mouse\u201d instead of \u201cmouse\u201d. We explore a simple method that does this automatically\nby generating sense-speci\ufb01c search terms from entries in Wordnet (see Section 2.3). However,\nthis method must rely on one- to three-word combinations and is therefore brittle. Many of the\ngenerated search terms are too unnatural to retrieve any results, e.g., \u201cpercoid bass\u201d. Some retrieve\nmany unrelated images, such as the term \u201cticker\u201d used as an alternative to \u201cwatch\u201d. We regard this\nmethod as a baseline to our main approach, which overcomes these issues by learning a model of\neach sense from a large amount of text obtained by searching the web. Web text is more natural\nand is a closer match to the text surronding web images than dictionary entries, which allows us to\nlearn more robust models. Each dictionary sense is represented in the latent space of hidden \u201ctopics\u201d\nlearned empirically for the polysemous word.\n\nTo evaluate our algorithm, we collect a dataset by searching the Yahoo Search engine for \ufb01ve poly-\nsemous words: \u201cbass\u201d, \u201cface\u201d, \u201cmouse\u201d, \u201cspeaker\u201d and \u201cwatch\u201d. Each of these words has anywhere\nfrom three to thirteen noun senses. Experimental evaluation on this dataset includes both retrieval\nand classi\ufb01cation of unseen images into speci\ufb01c visual senses.\n\n2 Model\n\nThe inspiration for our method comes from the fact that text surrounding web images indexed by a\npolysemous keyword can be a rich source of information about the sense of that word. The main\nidea is to learn a probabilistic model of each sense, as de\ufb01ned by entries in a dictionary (in our case,\nWordnet), from a large amount of unlabeled text. The use of a dictionary is key because it frees us\nfrom needing a labeled set of images to learn the visual sense model.\n\nSince this paper is concerned with objects rather than actions, we restrict ourselves to entries\nfor nouns. Like standard word sense disambiguation (WSD) methods, we make a one-sense-per-\ndocument assumption [14], and rely on words co-occurring with the image in the HTML document\nto indicate that sense. Our method consists of three steps: 1) discovering latent dimensions in text\nassociated with a keyword, 2) learning probabilistic models of dictionary senses in that latent space,\nand 3) using the text-based sense models to construct sense-speci\ufb01c image classi\ufb01ers. We will now\ndescribe each step in detail.\n\n2.1 Latent Text Space\n\nUnlike words in text commonly used in WSD, image links are not guaranteed to be surrounded by\ngrammatical prose. This makes it dif\ufb01cult to extract structured features such as part-of-speech tags.\nWe therefore take a bag-of-words approach, using all available words near the image link to evaluate\nthe probability of the sense. The \ufb01rst idea is to use a large collection of such bags-of-words to learn\ncoherent dimensions which align with different senses or uses of the word.\n\n2\n\n\fWe could use one of several existing techniques to discover latent dimensions in documents consist-\ning of bags-of-words. We choose to use Latent Dirichlet Allocation, or LDA, as introduced by Blei\net. al.[4]. LDA discovers hidden topics, i.e. distributions over discrete observations (such as words),\nin the data. Each document is modeled as a mixture of topics z \u2208 {1, ..., K}. A given collection\nof M documents, each containing a bag of Nd words, is assumed to be generated by the follow-\ning process: First, we sample the parameters \u03c6j of a multinomial distribution over words from a\nDirichlet prior with parameter \u03b2 for each topic j = 1, ..., K. Then, for each document d, we sample\nthe parameters \u03b8d of a multinomial distribution over topics from a Dirichlet prior with parameter\n\u03b1. Finally, for each word token i, we choose a topic zi from the multinomial \u03b8d, and then choose a\nword wi from the multinomial \u03c6zi. The probability of generating a document is de\ufb01ned as\n\nP (w1, ..., wNd |\u03c6, \u03b8d) =\n\nNd\n\nK\n\nY\n\nX\n\ni=1\n\nz=1\n\nP (wi|z, \u03c6) P (z|\u03b8d)\n\n(1)\n\nOur initial approach was to learn hidden topics using LDA directly on the words surrounding the\nimages. However, while the resulting topics were often aligned along sense boundaries, the approach\nsuffered from over-\ufb01tting, due to the irregular quality and low quantity of the data. Often, the only\nclue to the image\u2019s sense is a short text fragment, such as \u201c\ufb01shing with friends\u201d for an image returned\nfor the query \u201cbass\u201d. To allieviate the over\ufb01tting problem, we instead create an additional dataset of\ntext-only web pages returned from regular web search. We then learn an LDA model on this dataset\nand use the resulting distributions to train a model of the dictionary senses, described next.\n\n2.2 Dictionary Sense Model\n\nWe use the limited text available in the Wordnet entries to relate dictionary sense to topics formed\nabove. For example, sense 1 of \u201cbass\u201d contains the de\ufb01nition \u201cthe lowest part of the musical range.\u201d\nTo these words we also add the synonyms (e.g., \u201cpitch\u201d), the hyponyms, if they exist, and the\n\ufb01rst-level hypernyms (e.g., \u201csound property\u201d). We denote the bag-of-words extracted from such a\ndictionary entry for sense s as es = w1, w2, ..., wEs, where Es is the number of words in the bag.\nThe model is trained as follows: Given a query word with sense s \u2208 {1, 2, ...S} we de\ufb01ne the\nlikelihood of a particular sense given the topic j as\n\nP (s|z = j) \u2261\n\n1\nEs\n\nEs\n\nX\n\ni=1\n\nP (wi|z = j),\n\n(2)\n\nor the average likelihood of words in the de\ufb01nition. For a web image with an associated text docu-\nment d = w1, w2, ..., wD, the model computes the probability of a particular sense as\n\nP (s|d) =\n\nK\n\nX\n\nj=1\n\nP (s|z = j)P (z = j|d).\n\n(3)\n\nThe above requires the distribution of LDA topics in the text context, P (z|d), which we compute by\nmarginalizing across words and using Bayes\u2019 rule:\n\nP (z = j|d) =\n\nD\n\nX\n\ni=1\n\nP (z = j|wi) =\n\nD\n\nX\n\ni=1\n\nP (wi|z = j)P (z = j)\n\nP (wi)\n\n,\n\n(4)\n\nand also normalizing for the length of the text context. Finally, we de\ufb01ne the probability of a\nparticular dictionary sense given the image to be equal to P (s|d). Thus, our model is able to assign\nsense probabilities to images returned from the search engine, which in turn allows us to group the\nimages according to sense.\n\n2.3 Visual Sense Model\n\nThe last step of our algorithm uses the sense model learned in the \ufb01rst two steps to generate training\ndata for an image-based classi\ufb01er. The choice of classi\ufb01er is not a crucial part of the algorithm. We\nchoose to use a discriminative classi\ufb01er, in particular, a support vector machine (SVM), because of\nits ability to generalize well in high-dimentional spaces without requiring a lot of training data.\n\n3\n\n\fTable 1: Dataset Description: sizes of the three datasets, and distribution of ground truth sense\nlabels in the keyword dataset.\n\ncategory\n\nBass\nFace\nMouse\nSpeaker\nWatch\n\ntext-only\n\nsize of datasets\n\nsense term keyword\n\ndistribution of labels in the keyword dataset\npositive (good)\nnegative (partial, unrelated)\n\n984\n961\n987\n984\n936\n\n357\n798\n726\n2270\n2373\n\n678\n756\n768\n660\n777\n\n146\n130\n198\n235\n512\n\n532\n626\n570\n425\n265\n\nFor each particular sense s, the model re-ranks the images according to the probability of that sense,\nand selects the N highest-ranked examples as positive training data for the SVM. The negative train-\ning data is drawn from a \u201cbackground\u201d class, which in our case is the union of all other objects that\nwe are asked to classify. We represent images as histograms of visual words, which are obtained by\ndetecting local interest points and vector-quantizing their descriptors using a \ufb01xed visual vocabulary.\n\nWe compare our model with a simple baseline method that attempts to re\ufb01ne the search by automat-\nically generating search terms from the dictionary entry. Experimentally, it was found that queries\nconsisting of more than about three terms returned very few images. Consequently, the terms are\ngenerated by appending the polysemous word to its synonyms and \ufb01rst-level hypernyms. For exam-\nple, sense 4 of \u201cmouse\u201d has synonym \u201ccomputer mouse\u201d and hypernym \u201celectronic device\u201d, which\nproduces the terms \u201ccomputer mouse\u201d and \u201cmouse electronic device\u201d. An SVM classi\ufb01er is then\ntrained on the returned images.\n\n3 Datasets\n\nTo train and evaluate the outlined algorithms, we use three datasets: image search results using the\ngiven keyword, image search results using sense-speci\ufb01c search terms, and text search results using\nthe given keyword.\nThe \ufb01rst dataset was collected automatically by issuing queries to the Yahoo Image SearchTM website\nand downloading the returned images and HTML web pages. The keywords used were: \u201cbass\u201d,\n\u201cface\u201d, \u201cmouse\u201d, \u201cspeaker\u201d and \u201cwatch\u201d.\nIn the results, \u201cbass\u201d can refer to a \ufb01sh or a musical\nterm, as in \u201cbass guitar\u201d; \u201cface\u201d has a multitude of meanings, as in \u201chuman face\u201d, \u201canimal face\u201d,\n\u201cmountain face\u201d, etc.; \u201cspeaker\u201d can refer to audio speakers or human speakers; \u201cwatch\u201d can mean\na timepiece, the act of watching, as in \u201churricane watch\u201d, or the action, as in \u201cwatch out!\u201d Samples\nthat had dead page links and/or corrupted images were removed from the dataset.\n\nThe images were labeled by a human annotator with one sense per keyword. The annotator labeled\nthe presense of the following senses: \u201cbass\u201d as in \ufb01sh, \u201cface\u201d as in a human face, \u201cmouse\u201d as\nin computer mouse, \u201cspeaker\u201d as in an audio output device, and \u201cwatch\u201d as in a timepiece. The\nannotator saw only the images, and not the text or the dictionary de\ufb01nitions. The labels used were\n0 : unrelated, 1 : partial, or 2 : good. Images where the object was too small or occluded were\nlabeled partial. For evaluation, we used only good labels as positive, and grouped partial and\nunrelated images into the negative class. The labels were only used in testing, and not in training.\nThe second image search dataset was collected in a similar manner but using the generated sense-\nspeci\ufb01c search terms. The third, text-only dataset was collected via regular web search for the\noriginal keywords. Neither of these two datasets were labeled. Table 1 shows the size of the datasets\nand distribution of labels.\n\n4 Features\n\nWhen extracting words from web pages, all HTML tags are removed, and the remaining text is\ntokenized. A standard stop-word list of common English words, plus a few domain-speci\ufb01c words\nlike \u201cjpg\u201d, is applied, followed by a Porter stemmer [11]. Words that appear only once and the actual\nword used as the query are pruned. To extract text context words for an image, the image link is\n\n4\n\n\flocated automatically in the corresponding HTML page. All word tokens in a 100-token window\nsurrounding the location of the image link are extracted. The text vocabulary size used for the sense\nmodel ranges between 12K-20K words for different keywords.\n\nTo extract image features, all images are resized to 300 pixels in width and converted to grayscale.\nTwo types of local feature points are detected in the image: edge features [6] and scale-invariant\nsalient points. In our experiments, we found that using both types of points boosts class\ufb01ciation\nperformance relative to using just one type. To detect edge points, we \ufb01rst perform Canny edge\ndetection, and then sample a \ufb01xed number of points along the edges from a distribution proportional\nto edge strength. The scales of the local regions around points are sampled uniformly from the range\nof 10-50 pixels. To detect scale-invariant salient points, we use the Harris-Laplace [10] detector\nwith the lowest strength threshold set to 10. Altogether, 400 edge points and approximately the\nsame number of Harris-Laplace points are detected per image. A 128-dimensional SIFT descriptor\nis used to describe the patch surrounding each interest point. After extracting a bag of interest point\ndescriptors for each image, vector quantization is performed. A codebook of size 800 is constructed\nby k-means clustering a randomly chosen subset of the database (300 images per keyword), and\nall images are converted to histograms over the resulting visual words. To be precise, the \u201cvisual\nwords\u201d are the cluster centers (codewords) of the codebook. No spatial information is included in\nthe image representation, but rather it is treated as a bag-of-words.\n\n5 Experiments\n\n5.1 Re-ranking Image Search Results\n\nIn the \ufb01rst set of experiments, we evaluate how well our text-based sense model can distinguish\nbetween images depicting the correct visual sense and all the other senses. We train a separate LDA\nmodel for each keyword on the text-only dataset, setting the number of topics K to 8 in each case.\nAlthough this number is roughly equal to the average number of senses for the given keywords, we\ndo not expect nor require each topic to align with one particular sense. In fact, multiple topics can\nrepresent the same sense. Rather, we treat K as the dimensionality of the latent space that the model\nuses to represent senses. While our intuition is that it should be on the order of the number of senses,\nit can also be set automatically by cross-validation. In our initial experiments, different values of K\ndid not signi\ufb01cantly alter the results.\n\nTo perform inference in LDA, a number of approximate inference algorithms can be applied. We\nuse a Gibbs sampling approach of [7], implemented in the Matlab Topic Modeling Toolbox [13].\nWe used symmetric Dirichlet priors with scalar hyperparameters \u03b1 = 50/K and \u03b2 = 0.01, which\nhave the effect of smoothing the empirical topic distribution, and 1000 iterations of Gibbs sampling.\n\nThe LDA model provides us with topic distributions P (w|z) and P (z). We complete training the\nmodel by computing P (s|z) for each sense s in Wordnet, as in Equation 2. We train a separate model\nfor each keyword. We then compute P (s|d) for all text contexts d associated with images in the\nkeyword dataset, using Equation 3, and rank the corresponding images according to the probability\nof each sense. Since we only have ground truth labels for a single sense per keyword (see Section\n3), we evaluate the retrieval performance for that particular ground truth sense. Figure 2 shows\nthe resulting ROCs for each keyword, computed by thresholding P (s|d). For example, the \ufb01rst\nplot shows ROCs obtained by the eight models corresponding to each of the senses of the keyword\n\u201cbass\u201d. The thick blue curve is the ROC obtained by the original Yahoo retrieval order. The other\nthick curves show the dictionary sense models that correspond to the ground truth sense (a \ufb01sh). The\nresults demonstrate that we are able to learn a useful sense model that retrieves far more positive-\nclass images than the original search engine order. This is important in order for the \ufb01rst step of\nour method to be able to improve the precision of training data used in the second step. Note that,\nfor some keywords, there are multiple dictionary de\ufb01nitions that are dif\ufb01cult to distinguish visually,\nfor example, \u201chuman face\u201d and \u201cfacial expression\u201d. In our evaluation, we did not make such \ufb01ne-\ngrained distinctions, but simply chose the sense that applied most generally.\n\nIn interactive applications, the human user can specify the intended sense of the word by providing\nan extra keyword, such as by saying or typing \u201cbass \ufb01sh\u201d. The correct dictionary sense can then be\nselected by evaluating the probability of the extra keyword under each sense model, and choosing\nthe highest-scoring one.\n\n5\n\n\fFigure 2: Retrieval of the ground truth sense from keyword search results. Thick blue lines are the\nROCs for the original Yahoo search ranks. Other thick lines are the ROCs obtained by our dictionary\nmodel for the true senses, and thin lines are the ROCs obtained for the other senses.\n\n5.2 Classifying Unseen Images\n\nThe goal of the second set of experiments is to evaluate the dictionary-based object classi\ufb01er. We\ntrain a classi\ufb01er for the object corresponding to the ground-truth sense of each polysemous keyword\nin our data. The clasi\ufb01ers are binary, assigning a positive label to the correct sense and a negative\nlabel to incorrect senses and all other objects. The top N unlabeled images ranked by the sense model\nare selected as positive training images. The unlabeled pool used in our model consists of both the\nkeyword and the sense-term datasets. N negative images are chosen at random from positive data\nfor all other keywords. A binary SVM with an RBF kernel is trained on the image features, with the\nC and \u03b3 parameters chosen by four-fold cross-validation. The baseline search-terms algorithm that\nwe compare against is trained on a random sample of N images from the sense-term dataset. Recall\n\n6\n\n\fterms\ndict\n\n \n\n85%\n\n75%\n\n65%\n\n55%\n\n85%\n\n75%\n\n65%\n\n55%\n\nterms\ndict\n\n \n\n85%\n\nterms\ndict\n\n \n\n75%\n\n65%\n\n55%\n\n \n\nbass\n\nface\n\nmouse speaker watch\n\n \n\nbass\n\nface\n\nmouse\n\nspeaker\n\nwatch\n\n \n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nN\n\n(a) 1-SENSE test set\n\n(b) MIX-SENSE test set\n\n(c) 1-SENSE average vs. N\n\nFigure 3: Classi\ufb01cation accuracy for the search-terms baseline (terms) and our dictionary model\n(dict).\n\nthat this dataset was collected by simply searching with word combinations extracted from the target\nsense de\ufb01nition. Training on the \ufb01rst N images returned by Yahoo did not qualitatively change the\nresults.\n\nWe evaluate the method on two test cases. In the \ufb01rst case, the negative class consists of only the\nground-truth senses of the other objects. We refer to this as the 1-SENSE test set. In the second\ncase, the negative class also includes other senses of the given keyword. For example, we test\ndetection of \u201ccomputer mouse\u201d among other keyword objects as well as \u201canimal mouse\u201d, \u201cMickey\nMouse\u201d and other senses returned by the search, including unrelated images. We refer to this as the\nMIX-SENSE test set. Figure 3 compares the classi\ufb01cation accuracy of our classi\ufb01er to the baseline\nsearch-terms classi\ufb01er. Average accuracy across ten trials with different random splits into train and\ntest sets is shown for each object. Figure 3(a) shows results on 1-SENSE and 3(b) on MIX-SENSE,\nwith N equal to 250. Figure 3(c) shows 1-SENSE results averaged over the categories, at different\nnumbers of training images N. In both test cases, our dictionary method signi\ufb01cantly improves\non the baseline algorithm. As the per-object results show, we do much better for three of the \ufb01ve\nobjects, and comparably for the other two. One explanation why we do not see a large improvement\nin the latter cases is that the automatically generated sense-speci\ufb01c search terms happened to return\nrelatively high-precision images. However, in the other three cases, the term generation fails while\nour model is still able to capture the dictionary sense.\n\n6 Related Work\n\nA complete review of WSD work is beyond the scope of the present paper. Yarowsky [14] proposed\nan unsupervised WSD method, and suggested the use of dictionary de\ufb01nitions as an initial seed.\n\nSeveral approaches to building object models using image search results have been proposed, al-\nthough none have speci\ufb01cally addressed polysemous words. Fei-Fei et. al. [9] bootstrap object\nclassi\ufb01ers from existing labeled image data. Fergus et. al. [6] cluster in the image domain and\nuse a small validation set to select a single positive component. Schroff et. al. [12] incorporate\ntext features (such as whether the keyword appears in the URL) and use them re-rank the images\nbefore training the image model. However, the text ranker is category-independent and does not\nlearn which words are predictive of a speci\ufb01c sense. Berg et. al. [2] discover topics using LDA in\nthe text domain, and then use them to cluster the images. However, their method requires manual\nintervention by the user to sort the topics into positive and negative for each category. The combina-\ntion of image and text features is used in some web retrieval methods (e.g. [5]), however, our work\nis focused not on instance-based image retrieval, but on category-level modeling.\n\nA related problem is modeling images annotated with words, such as the caption \u201csky, airplane\u201d,\nwhich are assigned by a human labeler. Barnard et. al.\n[1] use visual features to help disam-\nbiguate word senses in such loosely labeled data. Models of annotated images assume that there\nis a correspondence between each image region and a word in the caption (e.g. Corr-LDA, [3]).\nSuch models predict words, which serve as category labels, based on image content. In contrast,\nour model predicts a category label based on all of the words in the web image\u2019s text context. In\ngeneral, a text context word does not necessarily have a corresponding visual region, and vice versa.\n\n7\n\n\fIn work closely related to Corr-LDA, a People-LDA [8] model is used to guide topic formation in\nnews photos and captions, using a specialized face recognizer. The caption data is less constrained\nthan annotations, including non-category words, but still far more constrained than text contexts.\n\n7 Conclusion\n\nWe introduced a model that uses a dictionary and text contexts of web images to disambiguate image\nsenses. To the best of our knowledge, it is the \ufb01rst use of a dictionary in either web-based image\nretrieval or classi\ufb01er learning. Our approach harnesses the large amount of unlabeled text available\nthrough keyword search on the web in conjunction with the dictionary entries to learn a generative\nmodel of sense. Our sense model is purely unsupervised, and is appropriate for web images. The use\nof LDA to discover a latent sense space makes the model robust despite the very limited nature of\ndictionary de\ufb01nitions. The de\ufb01nition text is used to learn a distribution over the empirical text topics\nthat best represents the sense. As a \ufb01nal step, a discriminative classi\ufb01er is trained on the re-ranked\nmixed-sense images that can predict the correct sense for novel images.\n\nWe evaluated our model on a large dataset of over 10,000 images consisting of search results for\n\ufb01ve polysemous words. Experiments included retrieval of the ground truth sense and classi\ufb01ca-\ntion of unseen images. On the retrieval task, our dictionary model improved on the baseline search\nengine precision by re-ranking the images according to sense probability. On the classi\ufb01cation\ntask, our method outperformed a baseline method that attempts to re\ufb01ne the search by generating\nsense-speci\ufb01c search terms from Wordnet entries. Classi\ufb01cation also improved when the test objects\nincluded the other senses of the keyword, making distinctions such as \u201cloudspeaker\u201d vs. \u201cinvited\nspeaker\u201d. Of course, we would not expect the dictionary senses to always produce accurate vi-\nsual models, as many senses do not refer to objects (e.g. \u201cbass voice\u201d). Future work will include\nannotating the data with more senses to further explore the \u201cvisualness\u201d of some of them.\n\nReferences\n\n[1] K. Barnard, K. Yanai, M. Johnson, and P. Gabbur. Cross modal disambiguation. In Toward Category-Level\n\nObject Recognition, J. Ponce, M. Hebert, C. Schmidt, eds., Springer-Verlag LNCS Vol. 4170, 2006.\n\n[2] T. Berg and D. Forsyth. Animals on the web. In Proc. CVPR, 2006.\n[3] D. Blei and M. Jordan. Modeling annotated data. In Proc. International ACM SIGIR Conference on Re-\n\nsearch and Development in Information Retrieval, pages 127-134. ACM Press, 2003.\n\n[4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Machine Learning Research, 3:993-1022,\n\nJan 2003.\n\n[5] Z. Chen, L. Wenyin, F. Zhang and M. Li. Web mining for web image retrieval. J. of the American Society\n\nfor Information Science and Technology, 51:10, pages 831-839, 2001.\n\n[6] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning Object Categories from Google\u2019s Image\n\nSearch. In Proc. ICCV 2005.\n\n[7] T. Grif\ufb01ths and M. Steyvers. Finding Scienti\ufb01c Topics. In Proc. of the National Academy of Sciences,\n\n101 (suppl. 1), pages 5228-5235, 2004.\n\n[8] V. Jain, E. Learned-Miller, A. McCallum. People-LDA: Anchoring Topics to People using Face Recogni-\n\ntion. In Proc. ICCV, 2007.\n\n[9] J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: automatic Object Picture collecTion via Incremental MOdel\n\nLearning. In Proc. CVPR, 2007.\n\n[10] K. Mikolajczyk and C. Schmid. Scale and af\ufb01ne invariant interest point detectors. In Proc. IJCV, 2004.\n[11] M. Porter, An algorithm for suf\ufb01x stripping, Program, 14(3) pp 130-137, 1980.\n[12] F. Schroff, A. Criminisi and A. Zisserman. Harvesting image databases from the web. In Proc. ICCV,\n\n2007.\n\n[13] M. Steyvers and T. Grif\ufb01ths. Matlab Topic Modeling Toolbox.\n\nhttp://psiexp.ss.uci.edu/research/programs_data/toolbox.htm\n\n[14] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL, 1995.\n\n8\n\n\f", "award": [], "sourceid": 710, "authors": [{"given_name": "Kate", "family_name": "Saenko", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}