{"title": "A Discriminative Latent Model of Image Region and Object Tag Correspondence", "book": "Advances in Neural Information Processing Systems", "page_first": 2397, "page_last": 2405, "abstract": "We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth region-to-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.", "full_text": "A Discriminative Latent Model of Image Region and\n\nObject Tag Correspondence\n\nYang Wang\u2217\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nyangwang@uiuc.edu\n\nGreg Mori\n\nSchool of Computing Science\n\nSimon Fraser University\n\nmori@cs.sfu.ca\n\nAbstract\n\nWe propose a discriminative latent model for annotating images with unaligned\nobject-level textual annotations. Instead of using the bag-of-words image repre-\nsentation currently popular in the computer vision community, our model explic-\nitly captures more intricate relationships underlying visual and textual informa-\ntion. In particular, we model the mapping that translates image regions to anno-\ntations. This mapping allows us to relate image regions to their corresponding\nannotation terms. We also model the overall scene label as latent information.\nThis allows us to cluster test images. Our training data consist of images and their\nassociated annotations. But we do not have access to the ground-truth region-\nto-annotation mapping or the overall scene label. We develop a novel variant of\nthe latent SVM framework to model them as latent variables. Our experimental\nresults demonstrate the effectiveness of the proposed model compared with other\nbaseline methods.\n\n1 Introduction\n\nImage understanding is a central problem in computer vision that has been extensively studied in\nthe forms of various types of tasks. Some previous work focuses on classifying an image with\na single label [6]. Others go beyond single labels and assign a list of annotations to an image\n[1, 10, 21]. Recently, efforts have been made to combine various tasks (i.e. classi\ufb01cation, annotation,\nsegmentation, etc) together to achieve a more complete understanding of an image [11, 12]. In this\npaper, we consider the problem of image understanding with unaligned textual annotations.\nIn\nparticular, we focus on the scenario where the annotations represent the names of the objects present\nin an image. The input to our learning algorithm is a set of images with unaligned textual annotations\n(object names). Our goal is to learn a model to predict the annotation (i.e. object names) for a new\nimage. As a by-product, our model also roughly localizes the image regions corresponding to the\nannotation, see Fig. 1. The main contribution of this paper is the development of a model that\nincorporates this object annotation to image region correspondence in a discriminative framework.\n\nIn the computer vision literature, there has been a lot of work on exploiting images and their associ-\nated textual information. Barnard et al. [1] predict words associated with whole images or regions\nby learning a joint distribution of image regions and words. Berg et al. [3] learn to name faces ap-\npearing in news pictures by learning a probabilistic model of face appearances, names, and textual\ncontexts. Wang et al. [21] use a learned bag-of-words topic model to simultaneously classify and\nannotate images. Loeff et al. [13] discover scenes by exploiting the correlation between images\nand their annotations. Some recent work towards total scene understanding [11, 12] tries to build\nsophisticated generative models that jointly perform several tasks, e.g. scene classi\ufb01cation, object\nrecognition, image annotation, and image segmentation.\n\n\u2217Work done while the author was with Simon Fraser University.\n\n1\n\n\fmallet\nathlete\nhorse\nground\ntree\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Our goal is to learn a model using images and their associated unaligned textual object annotations\n(a) as the training data. Given a new image (b), we can use the model to predict its textual annotations and\nroughly localize image regions corresponding to each of the annotation terms (c).\n\nMost of the previous work uses fairly crude \u201cbag-of-words\u201d models, treating image features (ex-\ntracted from either segmented regions or local interest points) and textual annotations as unordered\nentities and looking at their co-occurrence statistics. Very little work explicitly models more detailed\nrelationships between image regions and annotations that are obvious to humans. For example, if\nan image is over-segmented into a large number of segments, each segment typically only corre-\nsponds to at most one object. However, most of the previous work ignores this constraint and allows\nan image region being used as evidence to explain different objects mentioned by the annotations.\nIn this paper, we present a discriminative latent model that captures image regions, textual annota-\ntions, mappings between visual and textual information, and overall scene labels in a more explicit\nmanner. Some work [1, 3] tries to incorporate the mapping information into a generative model.\nHowever due to the limitation of the machine learning tools used in those work, they did not prop-\nerly enforce the aforementioned constraint on how image regions are mapped to annotations. There\nis also work [2] on augmenting training data with this mapping information, but it is unclear how it\ncan be generalized on test data. With the recent advancement in learning with complex structured\ndata [7, 18, 21, 25], we believe now it is the time for us to revisit this line of ideas and examine other\nmodeling tools.\n\nThe work by Socher et al. [17] is the most relevant to ours. In that work, they learn to annotate\nand segment images by mapping image regions and textual words to a latent meaning space using\ncontext and adjective features. There are important distinctions between our work and [17]. First of\nall, the input to [17] is a set of images (a handful of which are manually labeled) of a single sport\ncategory, and a collection of news articles for that sport. The news articles are generic for that sport,\nand the images are not the news photographs directly associated with those news articles. Although\nthey have experimented on applying their model on image collections with mixed sport categories,\ntheir method seems to work better with single sport category training. In contrast, the input to our\nlearning problem is a set of images from several sport categories, together with their associated\ntextual annotations. We treat the sport category as a latent variable (we call it the scene label) and\nimplicitly infer it during learning.\n\n2 Model\n\nWe propose a discriminative latent model that jointly captures the relationships between image seg-\nments, textual annotations, region-text correspondence, and overall image visual scene labels. Of\ncourse, only the image segments and textual annotations are observed on training data. All the other\ninformation (e.g. scene labels, the mapping between regions and annotations) are treated as latent\nvariables in the model. A graphical illustration of our model is shown in Fig. 2.\nThe input to our learning module is a set of hx, yi pairs where x denotes an image, and y denotes the\nannotation associated with this image. We partition the image into R regions using the segmentation\nalgorithm in [8], i.e. x = [x1, x2, ..., xR]. For each image region xi, we extract four types of visual\nfeatures (see [14]): shape, texture, color, and location. Each of these feature types is vector quantized\nto obtain codewords for this feature type. Following [17], we use 20, 25, 40, 8 codewords for each\nof the four feature types, respectively. In the end, each region xi is represented as a 4-dimensional\nvector xi = (xi1, xi2, xi3, xi4), where each xic is the corresponding codeword of the c-th feature\ntype for this region.\nThe annotation y of an image is represented as a binary vector y = (y1, y2, ..., yV ), where V is the\ntotal number of possible annotation terms. As a terminological convention, we use \u201cannotation\u201d to\ndenote the vector y and \u201cannotation term\u201d to denote each component yj of the vector. An annotation\n\n2\n\n\fimage regions\n\n......\n\nc\na\nr\n\n0\n\nd\no\ng\n\n0\n\nannotation\n\nh\no\nr\ns\ne\n\n1\n\na\n\nt\n\nl\n\nh\ne\ne\n\nt\n\n1\n\nFigure 2: Graphical illustration of our model. An input image is segmented into several regions. The annotation\nof the image is represented as a 0-1 vector indicating the presence/absence of each possible annotation term.\nOur model captures the unobserved mapping that translate image regions to annotation terms associated with\nthe image (e.g. horse, athlete). For annotation terms not associated with the image (e.g. car, dog), there are no\nmapped image regions. Our model also captures relationship between the unobserved scene label (e.g. polo)\nand image regions/annotations.\nterm yj is \u201cactive\u201d (yj = 1) if it is associated with this image, and is \u201cinactive\u201d (yj = 0) otherwise.\nWe further assume the number of regions of an image is larger than or equal to the number of active\nj=1 yj. In this work, we assume there are no visually\nirrelevant annotation terms (e.g. \u201cwind\u201d), and there are no annotation terms (e.g. \u201cpeople\u201d and\n\u201cathlete\u201d) of an image that refer to the same concept. These can be achieved by pre-processing the\nannotation terms with Wordnet (see [17]).\n\nannotation terms for an image, i.e. R \u2265 PV\n\nGiven an image x and its annotation y, we assume there is an underlying unobserved many-to-one\nmapping which translates R image regions to each of the active annotation terms. We restrict the\nmapping to have the following conditions: (i) each image region is mapped to at most one anno-\ntation term. This condition will ensure that an image region is not used to explain two different\nannotations; (ii) an active annotation term has one or more image regions mapped to it. This con-\ndition will make sure that if an annotation term (say \u201cbuilding\u201d) is assigned to an image, there is\nat least one image region supporting this annotation term; (iii) an inactive annotation term has no\nimage regions mapped to it. This condition will guarantee there are no image regions supporting an\ninactive annotation term.\nMore formally, we introduce a matrix z = {zij : 1 \u2264 i \u2264 R, 1 \u2264 j \u2264 V } de\ufb01ned in the following\nto represent this mapping for an image with R regions:\n\nzij = (cid:26) 1\n\n0\n\nif the i-th image region is mapped to the j-th annotation term\notherwise\n\n(1)\n\nWe use Y to denote the domain of all possible assignments of y. For a \ufb01xed annotation y, we use\nZ(y) to denote the set of all possible many-to-one mappings that satisfy the conditions (i,ii,iii). It is\neasy to verify that any z \u2208 Z(y) can be represented using the following three sets of constraints:\n\nXj\n\nzij \u2264 1, \u2200i;\n\nmax\n\ni\n\nzij = yj, \u2200j;\n\nzij \u2208 {0, 1}, \u2200i, \u2200j\n\n(2)\n\nFor a given image, we also assume a discrete unobserved \u201cscene\u201d label s which takes its value\nbetween 1 and S. We introduce the scene label to capture the fact that the annotations of images are\ntypically well clustered according to their underlying scenes. For example, an image of a \u201csailing\u201d\nscene tends to have annotation terms like \u201cathlete\u201d, \u201csailboat\u201d, \u201cwater\u201d, etc. However, it is not quite\nsimple to de\ufb01ne the vocabulary to label scenes [13]. In our work, we treat the scene label as a latent\nvariable (hence we do not need its ground-truth label or even a vocabulary for de\ufb01ning it) and let\nthe learning algorithm automatically \ufb01gure out what constitutes a scene. As we will demonstrate\nin the experiment, the \u201cscenes\u201d learned by our model on a collection of sport images do match our\nintuitions, e.g. they roughly correspond to different sport categories in the data.\n\nInspired by the latent SVM [7, 25], we measure the compatibility between an image x and an\nannotation y using the following scoring function:\n\nf\u03b8(x, y) = max\ns\u2208S\n\nmax\nz\u2208Z(y)\n\n\u03b8\u22a4 \u00b7 \u03a6(x, y, z, s)\n\n(3)\n\nwhere \u03b8 are the model parameters and \u03a6(x, y, z, s) is a feature vector de\ufb01ned on x, y, z and s. The\nmodel parameters have three parts \u03b8 = {\u03b1, \u03b2, \u03b3}, and \u03b8\u22a4 \u00b7 \u03a6(x, y, z, s) is de\ufb01ned as:\n\n\u03b8\u22a4 \u00b7 \u03a6(x, y, z, s) = \u03b1\u22a4\u03c6(x, z) + \u03b2\u22a4\u03c8(x, s) + \u03b3\u22a4\u03d5(y, s)\n\n(4)\n\n3\n\n\fThe details of each of the terms in (4) are described in the following.\nRegion-Annotation Matching Potential \u03b1\u22a4\u03c6(x, z): This potential function measures the com-\npatibility of mapping image regions to their corresponding annotation terms. Recall an image re-\ngion xi consists of codewords from four different feature types xi = (xi1, xi2, xi3, xi4). Let Nc\n(c = 1, 2, 3, 4) denotes the number of codewords of feature type c. The parameters \u03b1 consist of four\nc=1 corresponding to each of the four feature types. Each \u03b1c is a matrix of\ncomponents \u03b1 = {\u03b1c}4\nw,j can be interpreted as the compatibility between the codeword w\nsize Nc \u00d7 V , where an entry \u03b1c\n(1 \u2264 w \u2264 Nc) of feature type c and the annotation term j (1 \u2264 j \u2264 V ). The potential function is\nwritten as:\n\n\u03b1\u22a4\u03c6(x, z) =\n\n4\n\nR\n\nV\n\nXc=1\n\nXi=1\n\nXj=1\n\n\u03b1c\n\nxic,j \u00b7 zij =\n\n4\n\nR\n\nNc\n\nV\n\nXc=1\n\nXi=1\n\nXw=1\n\nXj=1\n\n\u03b1c\n\nw,j \u00b7 1(xic = w) \u00b7 zij\n\n(5)\n\nwhere 1(\u00b7) is the indicator function. Note that the de\ufb01nition of this potential function does not\ninvolve y since y is implicitly determined by z, i.e. yj = maxi zij.\nImage-Scene Potential \u03b2\u22a4\u03c8(x, s): This potential function measures the compatibility between an\nimage x and a scene label s. Similarly, the parameters \u03b2 consist of four parts \u03b2 = {\u03b2c}4\nc=1 corre-\nsponding to the four feature types, where an entry \u03b2c\nw,s is the compatibility between the codeword\nw of type c and the scene label s. This potential function is written as:\n\n\u03b2\u22a4\u03c8(x, s) =\n\n4\n\nR\n\nXc=1\n\nXi=1\n\n\u03b2c\nxic,s =\n\n4\n\nR\n\nNc\n\nS\n\nXc=1\n\nXi=1\n\nXw=1\n\nXt=1\n\n\u03b2c\nw,t \u00b7 1(xic = w) \u00b7 1(s = t)\n\n(6)\n\nAnnotation-Scene Potential \u03b3\u22a4\u03d5(y, s): This potential function measures the compatibility be-\ntween an annotation y and a scene label s. The parameters \u03b3 consist of S components \u03b3 = {\u03b3t}S\ncorresponding to each of the scene label. Each component \u03b3t is a V \u00d7 2 matrix, where \u03b3t\ncompatibility of setting yj = 1 for the scene label t, and \u03b3t\nfor the scene label t. This potential function is written as:\n\nt=1\nj,1 is the\nj,0 is the compatibility of setting yj = 0\n\n\u03b3s\nj,yj =\n\nV\n\nXj=1\n\nS\n\nXt=1(cid:16)\u03b3t\n\nj,0 \u00b7 1(yj = 0) \u00b7 1(s = t) + \u03b3t\n\nj,1 \u00b7 1(yj = 1) \u00b7 1(s = t)(cid:17)(7a)\n\n\u03b3\u22a4\u03d5(y, s) =\n\nV\n\nXj=1\nXt=1 (cid:16)\u03b3t\n\nS\n\n=\n\nV\n\nXj=1\n\nj,0 \u00b7 (1 \u2212 yj) \u00b7 1(s = t) + \u03b3t\n\nj,1 \u00b7 yj \u00b7 1(s = t)(cid:17)\n\n(7b)\n\nThe equivalence of (7a) and (7b) is due to 1(yj = 0) \u2261 1 \u2212 yj and 1(yj = 1) \u2261 yj for yj \u2208 {0, 1},\nwhich are easy to verify.\n\n3 Inference\n\nGiven the model parameters \u03b8 = {\u03b1, \u03b2, \u03b3}, the inference problem is to \ufb01nd the best annotation y\nfor a new image x, i.e. y\n\u2217 = arg maxy f\u03b8(x, y). The inference requires solving the following\noptimization problem:\n\n\u2217\n\nmax\ny\u2208Y\n\nf\u03b8(x, y) = max\ns\u2208S\n\nmax\ny\u2208Y\n\nmax\nz\u2208Z(y)\n\n\u03b8\u22a4\u03a6(x, y, z, s)\n\n(8)\n\nSince we can enumerate all the possible values of the scene label s, the main dif\ufb01culty of solving\n(8) is the inner maximization over y and z for a \ufb01xed s, i.e.:\n\nmax\ny\u2208Y\n\nmax\nz\u2208Z(y)\n\n\u03b8\u22a4\u03a6(x, y, z, s)\n\n(9)\n\nIn the following, we develop a method for solving (9) based on linear program (LP) relaxation. To\nformulate the problem as an LP, we \ufb01rst de\ufb01ne the following:\n\naij =\n\n4\n\nNc\n\nXc=1\n\nXw=1\n\n\u03b1c\n\nw,j\n\n1(xic = w), \u2200i, \u2200j\n\n4\n\nbj = rs\n\nj,1 \u2212 rs\n\nj,0, \u2200j\n\n(10)\n\n\fThen it is easy to verify that the optimization problem in (9) can be equivalently written as (the\nconstant in the objective not involving y or z is omitted):\n\nmax\n\ny,z Xi,j\n\naij zij +Xj\n\nbjyj\n\ns.t. Xj\n\nzij \u2264 1, max\n\ni\n\nzij = yj, zij \u2208 {0, 1}, \u2200i \u2200j\n\n(11)\n\nThe optimization problem (11) is not convex. But we can relax its constraints to make it an LP. First\nwe reformulate (11) as an integer linear program (ILP):\n\nmax\n\ny,z Xi,j\n\naijzij +Xj\n\nbjyj s.t.Xj\n\nzij \u2264 1, zij \u2264 yj \u2264 Xi\n\nzij, zij \u2208 {0, 1}, yj \u2208 {0, 1}, \u2200i \u2200j (12)\n\nIt is easy to verify that (11) and (12) are equivalent. Of course, (12) still has the integral constraint\nzij \u2208 {0, 1}, which makes the optimization problem NP-hard. So we further relax the value of zij\nto a real value in the range of [0, 1].\nPutting everything together, the LP relaxation of (11) can be written as:\n\nmax\n\naijzij +Xj\n\ny,z Xi,j\nAfter solving (13) with any LP solver, we round zij to the closest integer and obtain yj as yj =\nmaxi zij.\n\nzij \u2264 1, zij \u2264 yj \u2264 Xi\n\nzij, 0 \u2264 zij \u2264 1, 0 \u2264 yj \u2264 1, \u2200i \u2200j (13)\n\nbjyj s.t.Xj\n\n4 Learning\n\nWe now describe how to learn the model parameters \u03b8 from a set of N training examples hx\nni\n(n = 1, 2, ..., N). Note that the training data only contain images and their annotations. We do not\nhave the ground-truth scene label s or the mapping z for any of the training images, so we have to\ntreat them as latent variables during learning.\n\nn, y\n\nWe adopt the latent SVM (LSVM) framework [7, 25] for learning. LSVMs extend the popular\nstructural SVMs [18, 19] to handle latent variables during training. LSVMs and their variants have\nbeen successfully applied in several computer vision applications, e.g. object detection [7, 20],\nhuman action recognition [22, 16], human-object interaction [4], objects and attributes [23], human\nposes and actions [24], group activity recognition [9], etc.\nThe latent SVM learns the model parameters \u03b8 by solving the following optimization problem:\n\nmin\n\n\u03b8\n\n1\n2\n\n||\u03b8||2 + C\n\nN\n\nXn=1\n\n\u03ben\n\ns.t. f\u03b8(x\n\nn, y\n\nn) \u2212 f\u03b8(x\n\nn, y) \u2265 \u2206(y, y\n\nn) \u2212 \u03ben, \u2200n, \u2200y\n\n(14)\n\nwhere \u2206(y, y\ntruth annotation is y\n\nn) is a loss function measuring the cost incurred by predicting y when the ground-\nn) =\nj and 0 otherwise. Note that our loss function\n\nn. We use a simple Hamming loss which decomposes as \u2206(y, y\n\nj ) is 1 if yj 6= yn\n\nj ), where \u2113(yj, yn\n\nonly involves the annotation y, because this is the only ground-truth label we have access to.\n\nPV\nj=1 \u2113(yj, yn\n\nThe problem in (14) can be equivalently written as an unconstrained problem:\n\nmin\n\n\u03b8\n\n1\n2\n\n||\u03b8||2 + C\n\nN\n\nXn=1\n\n(Ln \u2212 Rn), where Ln = max\n\ny (cid:16)\u2206(y, y\n\nn) + f\u03b8(x\n\nn, y)(cid:17), Rn = f\u03b8(x\n\nn, y\n\nn) (15)\n\nWe use the non-convex bundle optimization in [5] to solve (15). In a nutshell, the algorithm itera-\ntively builds an increasingly accurate piecewise quadratic approximation to the objective function.\nDuring each iteration, a new linear cutting plane is found via a subgradient of the objective function\nand added to the piecewise quadratic approximation. The key of applying this algorithm to solve\n(15) is computing the two subgradients \u2202\u03b8Ln and \u2202\u03b8Rn for a particular \u03b8, which we describe in\ndetail below.\nFirst we describe how to compute \u2202\u03b8L. Let (y\nproblem (called loss-augmented inference in the structural SVM literature):\n\n\u2217, s\u2217) be the solution to the following optimization\n\n\u2217, z\n\nmax\n\ns\n\nmax\n\ny\n\nmax\nz\u2208Z(y)\n\n\u2206(y, y\n\nn) + f\u03b8(x\n\nn, y)\n\n(16)\n\n5\n\n\fThen it is easy to show that a subgradient \u2202\u03b8Ln can be calculated as \u2202\u03b8Ln = \u03a6(x\n\u2217, s\u2217).\nThe loss-augmented inference problem in (16) is similar to the inference problem in (8), except for\nn). We can modify the LP relaxation method in Sec. 3 to solve (16) for a\nan additional term \u2206(y, y\n\ufb01xed s (and enumerate s to get the \ufb01nal solution). First of all, it is easy to verify that \u2113(yj, yn\nj ) can\nbe re-formulated as:\n\nn, y\n\n\u2217, z\n\n\u2113(yj, yn\n\nj ) \u2261 (cid:26) 1 \u2212 yj\n\nyj\n\nif yn\nif yn\n\nj = 1\nj = 0\n\n(17)\n\nUsing (17), it is easy to show that if we re-de\ufb01ne bj as below, the ILP in (12) will solve the loss-\naugmented inference (16) for a \ufb01xed s:\n\nbj = (cid:26) \u03b3s\n\nj,1 \u2212 \u03b3s\n\u03b3s\nj,1 \u2212 \u03b3s\n\nj,0 \u2212 1 if yn\nj,0 + 1 if yn\n\nj = 1\nj = 0\n\n(18)\n\nSimilarly, we can relax the problem to an LP using the same method in Sec. 3.\nNow we describe how to compute \u2202\u03b8R. Let (z\nproblem: maxs maxz\u2208Z(yn) f\u03b8(x\ncalculated as \u2202\u03b8Rn = \u03a6(x\nn, z\nz can be solved by the following ILP:\n\n\u22c6, s\u22c6) be the solution to the following optimization\nn). Then it can be shown that a subgradient \u2202\u03b8Rn can be\nn, y\n\u22c6, s\u22c6). For a \ufb01xed s, it is easy to show that the maximization over\n\nn, y\n\nmax\n\nz Xi,j\n\naij zij, s.t. Xj\n\nzij = yn\n\nj , \u2200i; zij \u2208 {0, 1}, \u2200i \u2200j\n\n(19)\n\nSimilarly, we can solve (19) via LP relaxation by replacing the integral constraint zij \u2208 {0, 1} with\na linear constraint 0 \u2264 zij \u2264 1.\n\n5 Experiments\n\nWe test our model on the UIUC sport dataset [11]. It contains images collected from eight sport\nclasses: badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snowboarding. Each\nimage is annotated with a set of tags denoting the objects in it. We remove annotation terms occur-\nring fewer than three times. We randomly choose half of the data as the test set. From the other half,\nwe randomly select 50 images from each class to form the validation set. The remaining data are\nused as the training set.\n\nWe feed the training images and associated annotations (but not the ground-truth sport category\nlabels) to our learning algorithm and set the number of latent scene labels to be eight (i.e.\nthe\nnumber of sport classes). We initialize the parameters of our model as follows. First we cluster the\ntraining images into eight cluster using the following method. For each training image, we construct\na feature vector from the visual information of the image itself and the textual information of its\nannotation. The visual information is simply the concatenation of visual word counts from all the\nregions in the image (normalized between 0 and 1), i.e. the dimensionality of the visual feature is\nPC\nc=1 Nc. The textual information is the 0-1 vector of the annotation, i.e. the dimensionality is V .\nWe then run k-means clustering based on the combined visual and textual features to cluster training\nimages into eight clusters. We use the cluster membership of each training image as the initial\nguess of the scene label s (which we call pseudo-scene label). We then initialize the parameters\n\u03b2 by examining the co-occurrence counts of visual words and pseudo-scene labels on the training\ndata. Similarly, we initialize the parameters \u03b3 by the co-occurrence counts of annotation terms and\npseudo-scene labels. The parameters \u03b1 are initialized by the co-occurrence counts of visual words\nand annotation terms with the mapping constraints ignored.\n\nWe compare our model with a baseline method which is a set of linear SVMs separately trained\nfor predicting the 0/1 output of each annotation term based on the feature vector from the visual\ninformation. Following [21], we use the F-measure to measure the annotation performance. The\ncomparison is shown in Table 1(a). Our model outperforms the baseline SVM method. We also list\nthe published result of [22] in the table. However, it is important to remember that it is not directly\ncomparable to other numbers in Table 1(a), since [22] uses different image features and different\nsubsets of the dataset unspeci\ufb01ed in the paper. We visualize some results on the test data in Fig. 5.\n\nThe scene labels s produced by our model for the test images can be considered as a clustering of\nthe scenes in those images. We can measure the quality of the scene clustering by comparing with\n\n6\n\n\fFigure 3: Visualization of \u03b3 parameters. Each plot corresponds to a scene label s, we show the weights of top\n\ufb01ve components of \u03b3 s\n\nj,1 of all j \u2208 {1..V } (y-axis) and the corresponding annotation terms (x-axis).\n\nathlete\n\nceiling\n\n\ufb02oor\n\ngrass\n\nrowboat sailboat\n\nsky\n\nsun\n\ntree\n\nwater\n\nFigure 4: Visualization of the \u201cposition\u201d components of the \u03b1 parameters for some annotation terms. Bright\nareas correspond to high values.\n\nthe ground-truth scene labels (i.e. sport categories) of the test images. For comparison, we consider\nthree baseline algorithms. The \ufb01rst baseline algorithm is to run k-means clustering on the test data\nbased on the visual features. However the comparison to this baseline algorithm is not completely\nfair, since the baseline does not exploit any information from the annotations on the training data.\nSo we de\ufb01ne other two baseline algorithms that use this extra information.\n\nFor the second baseline algorithm (which we call pseudo-label+SVM), we run k-means clustering\non both training and validation data. We use both visual features and textual features for the clus-\ntering. After running k-means clustering, we assign a pseudo-label to each image in the training\nor validation set by its cluster membership. Then we train a multi-class SVM based on the visual\nfeatures of the training images and their pseudo-labels. The parameters of the SVM classi\ufb01er are\nchosen by validating on the validation images (visual features only) with their pseudo-labels. For a\ntest image, we use the trained SVM classi\ufb01er to assign a pseudo-label based on the visual feature of\nthis image. The predicted pseudo-labels of test images serve as a clustering of those images.\n\nFor the third baseline algorithm (which we call pseudo-annotation+K-means), we \ufb01rst train separate\nSVM classi\ufb01ers to predict the annotation from the visual feature, using the ground-truth annotations\nof the validation set to choose the free parameters in SVM classi\ufb01ers. For a set of test images,\nwe use the trained SVM classi\ufb01ers to predict their associated annotations (which we call pseudo-\nannotations). Then we run k-means to cluster those test images based on both visual features and\ntextual features. The textual features are obtained from the pseudo-annotations.\n\nWe use the normalized mutual information (NMI) [15] to quantitatively measure the clustering re-\nsults. Let \u2126 = {\u03c91, \u03c92, ..., \u03c9K} be a set of clusters, and D = {d1, d2, ..., dK} be the set of ground-\n[H(\u2126)+H(D)]/2 , where I(\u00b7) and H(\u00b7) are the\ntruth categories. The NMI is de\ufb01ned as NMI(\u2126, D) =\nmutual information and the entropy, respectively. The minimum of NMI is 0 if the cluster is random\nwith respect to the ground-truth. Higher NMIs means better clustering results. The comparison is\nshown in Table 1(b). Our model outperforms other baseline methods on the scene clustering task.\n\nI(\u2126;D)\n\nWe can visualize some of the parameters to get insights about the learned model. For a particular\nj,1 measures the compatibility of setting the j-th annotation term\nscene label s, the parameter \u03b3s\nactive for the scene label s. We sort the annotation terms according to \u03b3s\nj,1. In Fig 3, we visualize\nthe top \ufb01ve annotation terms for each of the eight possible values of s. Intuitively, these eight scene\nclusters obtained from our model seem to match well to the eight different sport categories of this\ndataset. We also visualize the \u201cposition\u201d (i.e. c = 4) components of the \u03b1 parameters (Fig. 4) for\nseveral annotation terms as follows. For a particular annotation term j, we \ufb01nd the most preferred\n\u201cposition\u201d visual word w\u2217 for this annotation term by w\u2217 = arg maxw \u03b14\nw,j. The cluster center of\nthe visual word w\u2217 de\ufb01nes an 8 \u00d7 8 position mask of image locations (see [14]), which is visualized\nin Fig. 4. We can see that the learned \u03b1 parameters make intuitive sense, e.g. \u201cwater\u201d is preferred at\nthe bottom of the image, while \u201csky\u201d is preferred at the top of the image.\n\n7\n\n\fmethod\n\nF-measure\n\nour approach\n\nSVM\n[21]\n\n0.4552\n0.4112\n0.3500\n\n(a)\n\nmethod\n\nour approach\n\npseudo-label + SVM\n\npseudo-annotation + K-means\n\nK-means\n\n(b)\n\nNMI\n0.5295\n0.4134\n0.3267\n0.2227\n\nTable 1: Comparison of image annotation (a) and scene clustering (b). The number of clusters is set to be eight\nfor all methods. See the text for more descriptions.\n\nFigure 5: (Best viewed in color) Results of annotation and segmentation on the UIUC sport dataset. Different\nannotation terms are shown in different colors. Image regions mapped to an annotation term are overlayed with\nthe color corresponding to that annotation term.\n6 Conclusion\nWe have presented a discriminatively trained latent model for capturing the relationships among\nimage regions, textual annotations, and overall scenes. Our ultimate goal is to achieve total scene\nunderstanding from cheaply available Internet data. Although most previous work in scene under-\nstanding focuses on generative probabilistic models (e.g. [1, 3, 11, 12, 21]), this paper offers an\nalternative path towards this goal via a discriminative framework. We believe discriminative meth-\nods offer a complementary advantage over generative ones. Certain relationships (e.g. the mapping\nbetween images regions and annotation terms) are hard to model, hence largely ignored in the gen-\nerative approaches. But those relationships are easy to incorporate in a max-margin discriminative\napproach like ours.\nIn this work we have provided evidence that modeling these relationships can improve image an-\nnotation. Our work provides a general solution that can be broadly applied in other applications\ninvolving mapping relationships, e.g. Youtube videos with annotations, movie clips with captions,\nface detection with person names, etc. There are many open issues to address in future research:\n(1) extending our model to handle a richer set of annotation terms (nouns, verbs, adjectives, etc) by\nmodifying the many-to-one correspondence assumption. (2) exploring the use of this model with\nnoisier annotation data (e.g. raw Flickr or YouTube tags); (3) exploiting the linguistic structure of\ntags.\n\n8\n\n\fReferences\n\n[1] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan. Matching words and\n\npictures. Journal of Machine Learning Research, 3:1107\u20131135, 2003.\n\n[2] K. Barnard and Q. Fan. Reducing correspondence ambibuity in loosely labeled training data. In IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recognition, 2007.\n\n[3] T. L. Berg, A. C. Berg, J. Edwards, and D. Forsyth. Who\u2019s in the picture.\nInformation Processing Systems, volume 17, pages 137\u2013144. MIT Press, 2004.\n\nIn Advances in Neural\n\n[4] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for static human-object interactions. In\n\nWorkshop on Structured Models in Computer Vision, 2010.\n\n[5] T.-M.-T. Do and T. Artieres. Large margin training for hidden markov models with partially observed\n\nstates. In International Conference on Machine Learning, 2009.\n\n[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL visual object\n\nclasses (VOC) challenge. International Journal of Computer Vision, 88(2):303\u2013338, 2010.\n\n[7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimi-\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\n\nnatively trained part based models.\n2009.\n\n[8] P. F. Felzenszwalb and D. P. Huttenlocher. Ef\ufb01cient graph-based image segmentation.\n\nJournal of Computer Vision, 2004.\n\nInternational\n\n[9] T. Lan, Y. Wang, W. Yang, and G. Mori. Beyond actions: Discriminative models for contextual group\n\nactivities. In Advances in Neural Information Processing Systems. MIT Press, 2010.\n\n[10] J. Li and J. Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 25(9):1075\u20131088, September 2003.\n\n[11] L.-J. Li and L. Fei-Fei. What, where and who? classifying events by scene and object recognition. In\n\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007.\n\n[12] L.-J. Li, R. Socher, and L. Fei-Fei. Towards total scene understanding: Classi\ufb01cation, annotation and\nIn IEEE Computer Society Conference on Computer Vision\n\nsegmentation in an automatic framework.\nand Pattern Recognition, 2009.\n\n[13] N. Loeff and A. Farhadi. Scene discovery by matrix factorization. In European Conference on Computer\n\nVision, 2008.\n\n[14] T. Malisiewicz and A. A. Efros. Recognition by association via learning per-exemplar distances. In IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recongition, 2008.\n\n[15] C. D. Manning. Introduction to Information Retrieval. Cambridge University Press, 2008.\n[16] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments\n\nfor activity classi\ufb01cation. In European Conference on Computer Vision, 2010.\n\n[17] R. Socher and L. Fei-Fei. Connecting modalities: Semi-supervised segmentation and annotation of images\nusing unaligned text corpora. In IEEE Computer Society Conference on Computer Vision and Pattern\nRecognition, 2010.\n\n[18] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural Information\n\nProcessing Systems, volume 16. MIT Press, 2004.\n\n[19] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484, 2005.\n\n[20] A. Vedaldi and A. Zisserman. Structured output regression for detection with partial truncation.\n\nAdvances in Neural Information Processing Systems. MIT Press, 2009.\n\nIn\n\n[21] C. Wang, D. Blei, and L. Fei-Fei. Simultaneous image classi\ufb01cation and annotation. In IEEE Computer\n\nSociety Conference on Computer Vision and Pattern Recognition, 2009.\n\n[22] Y. Wang and G. Mori. Max-margin hidden conditional random \ufb01elds for human action recognition. In\n\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.\n\n[23] Y. Wang and G. Mori. A discriminative latent model of object classes and attributes.\n\nConference on Computer Vision, 2010.\n\nIn European\n\n[24] W. Yang, Y. Wang, and G. Mori. Recognizing human actions from still images with latent poses. In IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recognition, 2010.\n\n[25] C.-N. Yu and T. Joachims. Learning structural SVMs with latent variables. In International Conference\n\non Machine Learning, 2009.\n\n9\n\n\f", "award": [], "sourceid": 97, "authors": [{"given_name": "Yang", "family_name": "Wang", "institution": null}, {"given_name": "Greg", "family_name": "Mori", "institution": null}]}