{"title": "Learning Visual Attributes", "book": "Advances in Neural Information Processing Systems", "page_first": 433, "page_last": 440, "abstract": null, "full_text": "Learning Visual Attributes\n\nVittorio Ferrari \u2217\n\nUniversity of Oxford (UK)\n\nAndrew Zisserman\n\nUniversity of Oxford (UK)\n\nAbstract\n\nWe present a probabilistic generative model of visual attributes, together with an ef\ufb01cient\nlearning algorithm. Attributes are visual qualities of objects, such as \u2018red\u2019, \u2018striped\u2019, or\n\u2018spotted\u2019. The model sees attributes as patterns of image segments, repeatedly sharing some\ncharacteristic properties. These can be any combination of appearance, shape, or the layout\nof segments within the pattern. Moreover, attributes with general appearance are taken\ninto account, such as the pattern of alternation of any two colors which is characteristic\nfor stripes. To enable learning from unsegmented training images, the model is learnt\ndiscriminatively, by optimizing a likelihood ratio.\nAs demonstrated in the experimental evaluation, our model can learn in a weakly supervised\nsetting and encompasses a broad range of attributes. We show that attributes can be learnt\nstarting from a text query to Google image search, and can then be used to recognize the\nattribute and determine its spatial extent in novel real-world images.\n\n1 Introduction\nIn recent years, the recognition of object categories has become a major focus of computer vision and\nhas shown substantial progress, partly thanks to the adoption of techniques from machine learning\nand the development of better probabilistic representations [1, 3]. The goal has been to recognize\nobject categories, such as a \u2018car\u2019, \u2018cow\u2019 or \u2018shirt\u2019. However, an object also has many other qualities\napart from its category. A car can be red, a shirt striped, a ball round, and a building tall. These visual\nattributes are important for understanding object appearance and for describing objects to other\npeople. Figure 1 shows examples of such attributes. Automatic learning and recognition of attributes\ncan complement category-level recognition and therefore improve the degree to which machines\nperceive visual objects. Attributes also open the door to appealing applications, such as more speci\ufb01c\nqueries in image search engines (e.g. a spotted skirt, rather than just any skirt). Moreover, as\ndifferent object categories often have attributes in common, modeling them explicitly allows part\nof the learning task to be shared amongst categories, or allows previously learnt knowledge about\nan attribute to be transferred to a novel category. This may reduce the total number of training\nimages needed and improve robustness. For example, learning the variability of zebra stripes under\nnon-rigid deformations tells us a lot about the corresponding variability in striped shirts.\n\nIn this paper we propose a probabilistic generative model of visual attributes, and a procedure for\nlearning its parameters from real-world images. When presented with a novel image, our method in-\nfers whether it contains the learnt attribute and determines the region it covers. The proposed model\nencompasses a broad range of attributes, from simple colors such as \u2018red\u2019 or \u2018green\u2019 to complex pat-\nterns such as \u2018striped\u2019 or \u2018checked\u2019. Both the appearance and the shape of pattern elements (e.g. a\nsingle stripe) are explicitly modeled, along with their layout within the overall pattern (e.g. adjacent\nstripes are parallel). This enables our model to cover attributes de\ufb01ned by appearance (\u2018red\u2019), by\nshape (\u2018round\u2019), or by both (the black-and-white stripes of zebras). Furthermore, the model takes\ninto account attributes with general appearance, such as stripes which are characterized by a pattern\nof alternation ABAB of any two colors A and B, rather than by a speci\ufb01c combination of colors.\n\nSince appearance, shape, and layout are modeled explictly, the learning algorithm gains an under-\nstanding of the nature of the attribute. As another attractive feature, our method can learn in a\nweakly supervised setting, given images labeled only by the presence or absence of the attribute,\n\n\u2217This research was supported by the EU project CLASS. The authors thank Dr. Josef Sivic for fruitful\n\ndiscussions and helpful comments on this paper.\n\n\funary\n\nred\n\nround\n\nbinary\n\nblack/white stripes\n\ngeneric stripes\n\nFigure 1: Examples of different kinds of attributes. On the left we show two simple attributes, whose charac-\nteristic properties are captured by individual image segments (appearance for red, shape for round). On the\nright we show more complex attributes, whose basic element is a pair of segments.\n\nwithout indication of the image region it covers. The presence/absence labels can be noisy, as the\ntraining method can tolerate a considerable number of mislabeled images. This enables attributes to\nbe learnt directly from a text speci\ufb01cation by collecting training images using a web image search\nengine, such as Google-images, and querying on the name of the attribute.\n\nOur approach is inspired by the ideas of Jojic and Caspi [4], where patterns have constant appearance\nwithin an image, but are free to change to another appearance in other images. We also follow the\ngenerative approach to learning a model from a set of images used by many authors, for example\nLOCUS [10]. Our parameter learning is discriminative \u2013 the bene\ufb01ts of this have been shown\nbefore, for example for training the constellation model of [3]. In term of functionality, the closest\nworks to ours are those on the analysis of regular textures [5, 6]. However, they work with textures\ncovering the entire image and focus on \ufb01nding distinctive appearance descriptors. In constrast, here\ntextures are attributes of objects, and therefore appear in complex images containing many other\nelements. Very few previous works appeared in this setting [7, 11]. The approach of [7] focuses\non colors only, while in [11] attributes are limited to individual regions. Our method encompasses\nalso patterns de\ufb01ned by pairs of regions, allowing to capture more complex attributes. Moreover,\nwe take up the additional challenge of learning the pattern geometry.\n\nBefore describing the generative model in section 3, in the next section we brie\ufb02y introduce image\nsegments, the elementary units of measurements observed in the model.\n\n2 Image segments \u2013 basic visual representation\n\nThe basic units in our attribute model are image segments extracted using the algorithm of [2]. Each\nsegment has a uniform appearance, which can be either a color or a simple texture (e.g. sand, grain).\nFigure 2a shows a few segments from a typical image.\n\nInspired by the success of simple patches as a basis for appearance descriptors [8, 9], we randomly\nsample a large number of 5 \u00d7 5 pixel patches from all training images and cluster them using k-\nmeans [8]. The resulting cluster centers form a codebook of patch types. Every pixel is soft-assigned\nto the patch types. A segment is then represented as a normalized histogram over the patch types\nof the pixels it contains. By clustering the segment histograms from the training images we obtain\na codebook A of appearances (\ufb01gure 2b). Each entry in the codebook is a prototype segment\ndescriptor, representing the appearance of a subset of the segments from the training set.\n\nEach segment s is then assigned the appearance a \u2208 A with the smallest Bhattacharya distance to the\nhistogram of s. In addition to appearance, various geometric properties of a segment are measured,\nsummarizing its shape. In our current implementation, these are: curvedness, compactness, elonga-\ntion (\ufb01gure 2c), fractal dimension and area relative to the image. We also compute two properties of\npairs of segments: relative orientation and relative area (\ufb01gure 2d).\n\n\fC\n\nC\nP\n\na\n\nb\n\nc\n\nP\n\nA\n\nM\n\nm\n\nA\nP2\n\nm\nM\n\nA1\n\nA2\n\n(ln\n\n)\n\nA\n1\nA2\n\n\u03b81 \u03b82\n\n\u2212\n\n\u03b8 1\n\n\u03b8 2\nd\n\nFigure 2: Image segments as visual features. a) An image with a few segments overlaid, including two pairs\nof adjacent segments on a striped region. b) Each row is an entry from the appearance codebook A (i.e.\none appearance; only 4 out of 32 are shown). The three most frequent patch types for each appearance are\ndisplayed. Two segments from the stripes are assigned to the white and black appearance respectively (arrows).\nc) Geometric properties of a segment: curvedness, which is the ratio between the number of contour points C\nwith curvature above a threshold and the total perimeter P ; compactness; and elongation, which is the ratio\nbetween the minor and major moments of inertia. d) Relative geometric properties of a pair of segments:\nrelative area and relative orientation. Notice how these measures are not symmetric (e.g. relative area is the\narea of the \ufb01rst segment wrt to the second).\n\n3 Generative models for visual attributes\nFigure 1 shows various kinds of attributes. Simple attributes are entirely characterized by properties\nof a single segment (unary attributes). Some unary attributes are de\ufb01ned by their appearance, such\nas colors (e.g. red, green) and basic textures (e.g. sand, grainy). Other unary attributes are de\ufb01ned by\na segment shape (e.g. round). All red segments have similar appearance, regardless of shape, while\nall round segments have similar shape, regardless of appearance. More complex attributes have a\nbasic element composed of two segments (binary attributes). One example is the black/white stripes\nof a zebra, which are composed of pairs of segments sharing similar appearance and shape across\nall images. Moreover, the layout of the two segments is characteristic as well: they are adjacent,\nnearly parallel, and have comparable area. Going yet further, a general stripe pattern can have any\nappearance (e.g. blue/white stripes, red/yellow stripes). However, the pairs of segments forming\na stripe pattern in one particular image must have the same appearance. Hence, a characteristic of\ngeneral stripes is a pattern of alternation ABABAB. In this case, appearance is common within an\nimage, but not across images.\n\nThe attribute models we present in this section encompass all aspects discussed above. Essentially,\nattributes are found as patterns of repeated segments, or pairs of segments, sharing some properties\n(geometric and/or appearance and/or layout).\n\n3.1 Image likelihood.\nWe start by describing how the model M explains a whole image I. An image I is represented by a\nset of segments {s}. A latent variable f is associated with each segment, taking the value f = 1 for\na foreground segment, and f = 0 for a background segment. Foreground segments are those on the\nimage area covered by the attribute. We collect f for all segments of I into the vector F. An image\nhas a foreground appearance a, shared by all the foreground segments it contains. The likelihood of\nan image is\n\np(I|M; F, a) =Yx\u2208I\n\np(x|M; F, a)\n\n(1)\n\nwhere x is a pixel, and M are the model parameters. These include \u03b1 \u2282 A, the set of appearances\nallowed by the model, from which a is taken. The other parameters are used to explain segments and\nare dicussed below. The probability of pixels is uniform within a segment, and independent across\nsegments:\n\n(2)\nwith sx the segment containing x. Hence, the image likelihood can be expressed as a product over\nthe probability of each segment s, counted by its area Ns (i.e. the number of pixels it contains)\n\np(x|M; F, a) = p(sx|M; f, a)\n\np(I|M; F, a) =Yx\u2208I\n\np(sx|M; f, a) =Ys\u2208I\n\np(s|M; f, a)Ns\n\n(3)\n\n\f\u03b1\n\na\n\ns\n\n\u03b2\n\nf\n\niS\n\nD\n\n(b)\n\n(a)\n\n\u03bb\n\nG\n\n\u03b1\n\na\n\nc\n\nR\n\nG\n\n\u03b4\n\n\u03b2\n\nC\ni\n\ns\n\nf\n\nSi D\n\n\u03b3\n\n\u03bb\n\u03bb\n\n1\n\n2\n\nFigure 3: a) Graphical model for unary attributes. D is the number of images in the dataset, Si is the number\nof segments in image i, and G is the total number of geometric properties considered (both active and inactive).\nb) Graphical model for binary attributes. c is a pair of segments. \u03a61,2 are the geometric distributions for each\nsegment a pair. \u03a8 are relative geometric distributions (i.e. measure properties between two segments in a pair,\nsuch as relative orientation), and there are R of them in total (active and inactive). \u03b4 is the adjacency model\nparameter. It tells whether only adjacent pairs of segments are considered (so p(c|\u03b4 = 1) is one only iff c is a\npair of adjacent segments).\n\nNote that F and a are latent variables associated with a particular image, so there is a different F\nand a for each image. In contrast, a single model M is used to explain all images.\n\n3.2 Unary attributes\ng}) is de\ufb01ned\nSegments are the only observed variables in the unary model. A segment s = (sa, {sj\nby its appearance sa and shape, captured by a set of geometric measurements {sj\ng}, such as elon-\ngation and curvedness. The graphical model in \ufb01gure 3a illustrates the conditional probability of\nimage segments\n\np(s|M; f, a) =(cid:26) p(sa|a) \u00b7Qj p(sj\n\n\u03b2\n\ng|\u03a6j )vj\n\nif f = 1\nif f = 0\n\n(4)\n\nThe likelihood for a segment depends on the model parameters M = (\u03b1, \u03b2, {\u03bbj}), which specify\na visual attribute. For each geometric property \u03bbj = (\u03a6j , vj), the model de\ufb01nes its distribution\n\u03a6j over the foreground segments and whether the property is active or not (vj = 1 or 0). Active\nproperties are relevant for the attribute (e.g. elongation is relevant for stripes, while orientation is\nnot) and contribute substantially to its likelihood in (4). Inactive properties instead have no impact\non the likelihood (exponentiation by 0).\nIt is the task of the learning stage to determine which\nproperties are active and their foreground distribution.\nThe factor p(sa|a) = [sa = a] is 1 for segments having the foreground appearance a for this image,\nand 0 otherwise (thus it acts as a selector). The scalar value \u03b2 represents a simple background model:\nall segments assigned to the background have likelihood \u03b2. During inference and learning we want\nto maximize the likelihood of an image given the model over F, which is achieved by setting f to\nforeground when the f = 1 case of equation (4) is greater than \u03b2.\nAs an example, we give the ideal model parameters for the attribute \u2018red\u2019. \u03b1 contains the red\nappearance only. \u03b2 is some low value, corresponding to how likely it is for non-red segments to\nbe assigned the red appearance. No geometric property {\u03bbj} is active (i.e. all vj = 0).\n\n3.3 Binary attributes\nThe basic element of binary attributes is a pair of segments. In this section we extend the unary\nmodel to describe pairs of segments. In addition to duplicating the unary appearance and geomet-\nric properties, the extended model includes pairwise properties which do not apply to individual\nsegments. In the graphical model of \ufb01gure 3b, these are relative geometric properties \u03b3 (area, orien-\ntation) and adjacency \u03b4, and together specify the layout of the attribute. For example, the orientation\nof a segment with respect to the other can capture the parallelism of subsequent stripe segments.\nAdjacency expresses whether the two segments in the pair are adjacent (like in stripes) or not (like\nthe maple leaf and the stripes in the canadian \ufb02ag). We consider two segments adjacent if they share\npart of the boundary. A pattern characterized by adjacent segments is more distinctive, as it is less\nlikely to occur accidentally in a negative image.\n\nSegment likelihood. An image is represented by a set of segments {s}, and the set of all possible\npairs of segments {c}. The image likelihood p(I|M; F, a) remains as de\ufb01ned in equation (3), but\n\n\fappearance\n\n|\n\n{z\n\n}\n\n\u00b7Yj (cid:16)p(sj\n|\ni = (\u03a6j\n\ni , vj\n\nshape\n\n{z\n\nj\n\n2(cid:17)\n}\n\n\u00b7Yk (cid:16)p(ck\n|\n\nr |\u03a8k)vk\n\nlayout\n\n{z\n\nr(cid:17) \u00b7 p(c|\u03b4)\n}\n\nnow a = (a1, a2) speci\ufb01es two foreground appearances, one for each segment in the pair. The\nlikelihood of a segment s is now de\ufb01ned as the maximum over all pairs containing it\n\np(s|M; f, a) =(cid:26) max{c|s\u2208c} p(c|M, t)\n\n\u03b2\n\nif f = 1\nif f = 0\n\n(5)\n\nPair likelihood. The observed variables in our model are segments s and pairs of segments c. A\npair c = (s1, s2, {ck\nr }) is de\ufb01ned by two segments s1, s2 and their relative geometric measurements\nr } (relative orientation and relative area in our implementation). The likelihood of a pair given\n{ck\nthe model is\n\np(c|M, a) = p(s1,a, s2,a|a)\n\nj\n\n1,g|\u03a6j\n\n1)v\n\n1 \u00b7 p(sj\n\n2,g|\u03a6j\n\n2)v\n\n(6)\n\n1}, {\u03bbj\n\nThe binary model parameters M = (\u03b1, \u03b2, \u03b4, {\u03bbj\n2}, {\u03b3k}) control the behavior of the pair\nlikelihood. The two sets of \u03bbj\ni ) are analogous to their counterparts in the unary model,\nand de\ufb01ne the geometric distributions and their associated activation states for each segment in the\npair respectively. The layout part of the model captures the interaction between the two segments in\nr ) the model gives its distribution \u03a8k over\nthe pair. For each relative geometric property \u03b3k = (\u03a8k, vk\npairs of foreground segments and its activation state vk\nr . The model parameter \u03b4 determines whether\nthe pattern is composed of pairs of adjacent segments (\u03b4 = 1) or just any pair of segments (\u03b4 = 0).\nThe factor p(c|\u03b4) is de\ufb01ned as 0 iff \u03b4 = 1 and the segments in c are not adjacent, while it is 1 in all\nother cases (so, when \u03b4 = 1, p(c|\u03b4) acts as a pair selector). The appearance factor p(s1,a, s2,a|a) =\n[s1,a = a1 \u2227 s2,a = a2] is 1 when the two segments have the foreground appearances a = (a1, a2)\nfor this image.\n\nAs an example, the model for a general stripe pattern is as follows. \u03b1 = (A, A) contains all\npairs of appearances from A. The geometric properties \u03bbelong\n1 = 1) and their\ndistributions \u03a6j\n1 peaked at high elongation and low curvedness. The corresponding properties {\u03bbj\n2}\nhave similar values. The layout parameters are \u03b4 = 1, and \u03b3rel area, \u03b3rel orient are active and\npeaked at 0 (expressing that the two segments are parallel and have the same area). Finally, \u03b2 is a\nvalue very close to 0, as the probability of a random segment under this complex model is very low.\n\nare active (vj\n\n, \u03bbcurv\n\n1\n\n1\n\n4 Learning the model\nImage Likelihood. The image likelihood de\ufb01ned in (3) depends on the foreground/background\nlabels F and on the foreground appearance a. Computing the complete likelihood, given only the\nmodel M, involves maximizing a over the appearances \u03b1 allowed by the model, and over F:\n\np(I|M) = max\na\u2208\u03b1\n\nmax\n\nF\n\np(I|M; F, a)\n\n(7)\n\nThe maximization over F is easily achieved by setting each f to the greater of the two cases in\nequation (4) (equation (5) for a binary model). The maximization over a requires trying out all\nallowed appearances \u03b1. This is computationally inexpensive, as typically there are about 32 entries\nin the appearance codebook.\n\n+} and negative images I\u2212 = {I i\n\nTraining data. We learn the model parameters in a weakly supervised setting. The training data\nconsists of positive I+ = {I i\n\u2212}. While many of the positive\nimages contain examples of the attribute to be learnt (\ufb01gure 4), a considerable proportion don\u2019t.\nConversely, some of the negative images do contain the attribute. Hence, we must operate under a\nweak assumption: the attribute occurs more frequently on positive training images than on negative.\nMoreover, only the (unreliable) image label is given, not the location of the attribute in the image.\nAs demonstrated in section 5, our approach is able to learn from this noisy training data.\nAlthough our attribute models are generative, learning them in a discriminative fashion greatly helps\ngiven the challenges posed by the weakly supervised setting. For example, in \ufb01gure 4 most of the\noverall surface for images labeled \u2018red\u2019 is actually white. Hence, a maximum likelihood estimator\nover the positive training set alone would learn white, not red. A discriminative approach instead\n\n\fpositive training images\n\nnegative training images\n\nFigure 4: Advantages of discriminative training. The task is to learn the attribute \u2018red\u2019. Although the most\nfrequent color in the positive training images is white, white is also common across the negative set.\n\nnotices that white occurs frequently also on the negative set, and hence correctly picks up red, as it\nis most discriminative for the positive set. Formally, the task of learning is to determine the model\nparameters M that maximize the likelihood ratio\n\np(I+|M)\np(I\u2212|M)\n\n+\n\n= QIi\nQIi\n\n\u2212\n\np(I i\n\n+|M)\n\n\u2208I+\n\np(I i\n\n\u2212|M)\n\n\u2208I\u2212\n\n(8)\n\n1}, {\u03bbj\n\nLearning procedure. The parameters of the binary model are M = (\u03b1, \u03b2, \u03b4, {\u03bbj\n2}, {\u03b3k}),\nas de\ufb01ned in the previous sections. Since the binary model is a superset of the unary one, we only\nexplain here how to learn the binary case. The procedure for the unary model is derived analogously.\nIn our implementation, \u03b1 can contain either a single appearance, or all appearances in the codebook\nA. The former case covers attributes such as colors, or patterns with speci\ufb01c colors (such as zebra\nstripes). The latter case covers generic patterns, as it allows each image to pick a different appearance\na \u2208 \u03b1, while at the same time it properly constrains all segments/pairs within an image to share the\nsame appearance (e.g. subsequent pairs of stripe segments have the same appearance, forming a\npattern of alternation ABABAB). Because of this de\ufb01nition, \u03b1 can take on (1 + |A|)2/2 different\nvalues (sets of appearances). As typically a codebook of |A| \u2264 32 appearances is suf\ufb01cient to model\nthe data, we can afford exhaustive search over all possible values of \u03b1. The same goes for \u03b4, which\ncan only take on two values.\n\nGiven a \ufb01xed \u03b1 and \u03b4, the learning task reduces to estimating the background probability \u03b2, and the\ngeometric properties {\u03bbj\n2}, {\u03b3k}. To achieve this, we need determine the latent variable F for\neach training image, as it is necessary for estimating the geometric distributions over the foreground\nsegments. These are in turn necessary for estimating \u03b2. Given \u03b2 and the geometric properties we\ncan estimate F (equation (6)). This particular circular dependence in the structure of our model\nsuggests a relatively simple and computationally cheap approximate optimization algorithm:\n\n1}, {\u03bbj\n\n1. For each I \u2208 {I+S I\u2212}, estimate an initial F and a via equation (7), using an initial\n\n\u03b2 = 0.01, and no geometry (i.e. all activation variables set to 0).\n\n2. Estimate all geometric distributions \u03a6j\n\n1, \u03a6j\n\n2, \u03a8k over the foreground segments/pairs from\n\nall images, according to the initial estimates {F}.\n\n3. Estimate \u03b2 and the geometric activations v iteratively:\n\n(a) Update \u03b2 as the average probability of segments from I\u2212. This is obtained using the\n\nforeground expression of (5) for all segments of I\u2212.\n\n(b) Activate the geometric property which most increases the likelihood-ratio (8) (i.e. set\n\nthe corresponding v to 1). Stop iterating when no property increases (8).\n\n4. The above steps already yield a reasonable estimate of all model parameters. We use it as\n\ninitialization for the following EM-like iteration, which re\ufb01nes \u03b2 and \u03a6j\n(a) Update {F} given the current \u03b2 and geometric properties (set each f to maximize (5))\n(b) Update \u03a6j\n(c) Update \u03b2 over I\u2212 using the current \u03a6j\n\n2, \u03a8k given the current {F}.\n1, \u03a6j\n\n2, \u03a8k.\n\n1, \u03a6j\n\n1, \u03a6j\n\n2, \u03a8k\n\nThe algorithm is repeated over all possible \u03b1 and \u03b4, and the model maximizing (8) is selected. Notice\nhow \u03b2 is continuously re-estimated as more geometric properties are added. This implicitly offers to\nthe selector the probability of an average negative segment under the current model as an up-to-date\nbaseline for comparison. It prevents the model from overspecializing as it pushes it to only pick up\nproperties which distinguish positive segments/pairs from negative ones.\n\n\f(a)\n\n(b)\n\n(c)\n\n0\n\n0\n\nSegment 1\n\nSegment 2\n\nLayout\n\n1\n\n<.33\n\n>.67\n\n0\n\n1\n\n<.33\n\n>.67\n\n\u2212\u03c0/2\n\n0\n\n\u03c0/2\n\n\u22124\n\n0\n\n4\n\nelongation\n\n1\n\n<.33\ncurvedness\n\n>.67\n\n0\n\narea\n\n1\n\n0\n\ncompactness\n\n0.4\n\n0\n\nelongation\n\n1\n\n<.33\ncurvedness\n\n>.67\n\nrelative orientation\n\n\u22124\n\n0\n\nrelative area\n\n4\n\nFigure 5: a) color models learnt for red, green, blue, and yellow. For each, the three most frequent patch\ntypes are displayed. Notice how each model covers different shades of a color. b+c) geometric properties of the\nlearned models for stripes (b) and dots (c). Both models are binary, have general appearance, i.e. \u03b1 = (A, A),\nand adjacent segments, i.e. \u03b4 = 1. The \ufb01gure shows the geometric distributions for the activated geometric\nproperties. Lower elongation values indicate more elongated segments. A blank slot means the property is not\nactive for that attribute. See main text for discussion.\n\nOne last, implicit, parameter is the model complexity: is the attribute unary or binary ? This is\ntackled through model selection: we learn the best unary and binary models independently, and then\nselect the one with highest likelihood-ratio. The comparison is meaningful because image likelihood\nis measured in the same way in both unary and binary cases (i.e. as the product over the segment\nprobabilities, equation (3)).\n\n5 Experimental results\nLearning. We present results on learning four colors (red, green, blue, and yellow) and three\npatterns (stripes, dots, and checkerboard). The positive training set for a color consists of the 14\nimages in the \ufb01rst page returned by Google-images when queried by the color name. The proportion\nof positive images unrelated to the color varies between 21% and 36%, depending on the color (e.g.\n\ufb01gure 4). The negative training set for a color contains all positive images for the other colors. Our\napproach delivers an excellent performance. In all cases, the correct model is returned: unary, no\nactive geometric property, and the correct color as a speci\ufb01c appearance (\ufb01gure 5a).\n\nStripes are learnt from 74 images collected from Google-images using \u2018striped\u2019, \u2018stripe\u2019, \u2018stripes\u2019\nas queries. 20% of them don\u2019t contain stripes. The positive training set for dots contains 35 images,\n29% of them without dots, collected from textile vendors websites and Google-images (keywords\n\u2018dots\u2019, \u2018dot\u2019, \u2018polka dots\u2019). For both attributes, the 56 images for colors act as negative training\nset. As shown in \ufb01gure 5, the learnt models capture well the nature of these attributes. Both stripes\nand dots are learnt as binary and with general appearance, while they differ substantially in their\ngeometric properties. Stripes are learnt as elongated, rather straight pairs of segments, with largely\nthe same properties for the two segments in a pair. Their layout is meaningful as well: adjacent,\nnearly parallel, and with similar area.\nIn contrast, dots are learnt as small, unelongated, rather\ncurved segments, embedded within a much larger segment. This can be seen in the distribution of\nthe area of the \ufb01rst segment, the dot, relative to the area of the second segment, the \u2018background\u2019\non which dots lie. The background segments have a very curved, zigzagging outline, because they\ncircumvent several dots. In contrast to stripes, the two segments that form this dotted pattern are not\nsymmetric in their properties. This characterisic is modeled well by our approach, con\ufb01rming its\n\ufb02exibility. We also train a model from the \ufb01rst 22 Google-images for the query \u2018checkerboard\u2019, 68%\nof which show a black/white checkerboard. The learnt model is binary, with one segment for a black\nsquare and the other for an adjacent white square, demonstrating the learning algorithm correctly\ninfers both models with speci\ufb01c and generic appearance, adapting to the training data.\n\nRecognition. Once a model is learnt, it can be used to recognize whether a novel image contains\nthe attribute, by computing the likelihood (7). Moreover, the area covered by the attribute is local-\nized by the segments with f = 1 (\ufb01gure 6). We report results for red, yellow, stripes, and dots. All\ntest images are downloaded from Yahoo-images, Google-images, and Flickr. There are 45 (red), 39\n(yellow), 106 (stripes), 50 (dots) positive test images. In general, the object carrying the attribute\nstands against a background, and often there are other objects in the image, making the localization\ntask non-trivial. Moreover, the images exhibit extreme variability: there are paintings as well as pho-\ntographs, stripes appear in any orientation, scale, and appearance, and they are often are deformed\n\n\fFigure 6: Recognition results. Top row: red (left) and yellow (right). Middle rows: stripes. Bottom row:\ndots. We give a few example test images and the corresponding localizations produced by the learned models.\nSegments are colored according to their foreground likelihood, using matlab\u2019s jet colormap (from dark blue to\ngreen to yellow to red to dark red). Segments deemed not to belong to the attribute are not shown (black). In\nthe case of dots, notice how the pattern is formed by the dots themselves and by the uniform area on which they\nlie. The ROC plots shows the image classi\ufb01cation performance for each attribute. The two lower curves in\nthe stripes plot correspond to a model without layout, and without either layout nor any geometry respectively.\nBoth curves are substantially lower, con\ufb01rming the usefulness of the layout and shape components of the model.\n\n(human body poses, animals, etc.). The same goes for dots, which can vary in thickness, spacing,\nand so on. Each positive set is coupled with a negative one, in which the attribute doesn\u2019t appear,\ncomposed of 50 images from the Caltech-101 \u2018Things\u2019 set [12]. Because these negative images are\nrich in colors, textures and structure, they pose a considerable challenge for the classi\ufb01cation task.\n\nAs can be seen in \ufb01gure 6, our method achieves accurate localizations of the region covered by the\nattribute. The behavior on stripe patterns composed of more than two appearances is particularly\ninteresting (the trousers in the rightmost example). The model explains them as disjoint groups of\nbinary stripes, with the two appearances which cover the largest image area. In terms of recognizing\nwhether an image contains the attribute, the method performs very well for red and yellow, with ROC\nequal-error rates above 90%. Performance is convincing also for stripes and dots, especially since\nthese attributes have generic appearance, and hence must be recognized based only on geometry and\nlayout. In contrast, colors enjoy a very distinctive, speci\ufb01c appearance.\n\nReferences\n\n[1] N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR, 2005.\n[2] P. Felzenszwalb and D Huttenlocher, Ef\ufb01cient Graph-Based Image Segmentation, IJCV, (50):2, 2004.\n[3] R. Fergus, P. Perona, and A. Zisserman, Object Class Recognition by Unsupervised Scale-Invariant\n\nLearning, CVPR, 2003.\n\n[4] N. Jojic and Y. Caspi, Capturing image structure with probabilistic index maps, CVPR, 2004\n[5] S. Lazebnik, C. Schmid, and J. Ponce, A Sparse Texture Representation Using Local Af\ufb01ne Regions,\n\nPAMI, (27):8, 2005\n\n[6] Y. Liu, Y. Tsin, and W. Lin, The Promise and Perils of Near-Regular Texture, IJCV, (62):1, 2005\n[7] J. Van de Weijer, C. Schmid, and J. Verbeek, Learning Color Names from Real-World Images, CVPR,\n\n2007.\n\n[8] M. Varma and A. Zisserman, Texture classi\ufb01cation: Are \ufb01lter banks necessary?, CVPR, 2003.\n[9] J. Winn, A. Criminisi, and T. Minka, Object Categorization by Learned Universal Visual Dictionary,\n\nICCV, 2005.\n\n[10] J. Winn and N. Jojic. LOCUS: Learning Object Classes with Unsupervised Segmentation, ICCV, 2005.\n[11] K. Yanai and K. Barnard, Image Region Entropy: A Measure of \u201dVisualness\u201d of Web Images Associated\n\nwith One Concept, ACM Multimedia, 2005.\n\n[12] Caltech 101 dataset: www.vision.caltech.edu/Image Datasets/Caltech101/Caltech101.html\n\n\f", "award": [], "sourceid": 546, "authors": [{"given_name": "Vittorio", "family_name": "Ferrari", "institution": null}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": null}]}