{"title": "Who\u2019s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation", "book": "Advances in Neural Information Processing Systems", "page_first": 1168, "page_last": 1176, "abstract": "Given a corpus of news items consisting of images accompanied by text captions, we want to find out ``whos doing what, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint `face and pose model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images.", "full_text": "Who\u2019s Doing What: Joint Modeling of Names and\nVerbs for Simultaneous Face and Pose Annotation\n\nLuo Jie\n\nIdiap and EPF Lausanne\n\njluo@idiap.ch\n\nBarbara Caputo\n\nIdiap Research Institute\nbcaputo@idiap.ch\n\nVittorio Ferrari\n\nETH Zurich\n\nferrari@vision.ee.ethz.ch\n\nAbstract\n\nGiven a corpus of news items consisting of images accompanied by text captions,\nwe want to \ufb01nd out \u201cwho\u2019s doing what\u201d, i.e. associate names and action verbs in\nthe captions to the face and body pose of the persons in the images. We present\na joint model for simultaneously solving the image-caption correspondences and\nlearning visual appearance models for the face and pose classes occurring in the\ncorpus. These models can then be used to recognize people and actions in novel\nimages without captions. We demonstrate experimentally that our joint \u2018face and\npose\u2019 model solves the correspondence problem better than earlier models cover-\ning only the face, and that it can perform recognition of new uncaptioned images.\n\nIntroduction\n\n1\nA huge amount of images with accompanying text captions are available on the Internet. Websites\nselling various items such as houses and clothing provide photographs of their products along with\nconcise descriptions. Online newspapers 1 have pictures illustrating events and comment them in\nthe caption. These news websites are very popular because people are interested in other people,\nespecially if they are famous (\ufb01gure 1). Exploiting the associations between images and text hidden\nin this wealth of data can lead to a virtually in\ufb01nite source of annotations from which to learn visual\nmodels without explicit manual intervention.\n\nThe learned models could then be used in a variety of Computer Vision applications, including face\nrecognition, image search engines, and to annotate new images for which no caption is available.\nMoreover, recovering image-text associations is useful for auto-annotating a closed corpus of data,\ne.g. for users of news website to see \u201cwho\u2019s in the picture\u201d [6], or to search for images where a\ncertain person does a certain thing.\n\nPrevious works on news items has focused on associating names in the captions to faces in the im-\nages [5, 6, 16, 21]. This is dif\ufb01cult due to the correspondence ambiguity problem: multiple persons\nappear in the image and the caption. Moreover, persons in the image are not always mentioned in the\ncaption, and not all names in the caption appear in the image. The techniques tackle the correspon-\ndence problem by exploiting the fact that different images show different combinations of persons.\nAs a result, these methods work well for frequently occurring persons (typical for famous people)\nappearing in dataset with thousands of news items.\n\nIn this paper we propose to go beyond the above works, by modeling both names and action verbs\njointly. These correspond to faces and body poses in the images (\ufb01gure 3). The connections be-\ntween the subject (name) and verb in a caption can be found by well established language analysis\ntechniques [1, 8]. Essentially, by considering the subject-verb language construct, we generalize the\n\u201cwho\u2019s in the picture\u201d line of works to \u201cwho\u2019s doing what\u201d. We present a new generative model\nwhere the observed variables are names and verbs in the caption as well as detected persons in the\nimage. The image-caption correspondences are carried by latent variables, while the visual appear-\nance of face and pose classes corresponding to different names and verbs are model parameters.\nDuring learning, we simultaneously solve for the correspondence and learn the appearance models.\n\n1www.daylife.com, news.yahoo.com,\n\nnews.google.com\n\n1\n\n\f(a) Four sets ... Roger Federer prepares\nto hit a backhand in a quarter-\ufb01nal match\nwith Andy Roddick at the US Open.\n\n(b) US Democratic presidential candidate Senator Barack\nObama waves to supporters together with his wife Michelle\nObama standing beside him at his North Carolina and In-\ndiana primary election night rally in Raleigh.\n\nFigure 1: Examples of image-caption pairs in our dataset. The face and upper body of the persons in the image\nare marked by bounding-boxes. We stress a caption might contain names and/or verbs not visible in the image,\nand vice versa.\n\nIn our joint model, the correspondence ambiguity is reduced because the face and pose information\nhelp each other. For example, in \ufb01gure 1b, knowing what \u2018waves\u2019 means would reveal who of the\ntwo imaged persons is Obama. The other way around, knowing who is Obama would deliver a\nvisual example for the \u2018waving\u2019 pose.\n\nWe show experimentally that (i) our joint \u2018face and pose\u2019 model solves the correspondence problem\nbetter than simpler models covering either face or pose alone; (ii) the learned model can be used to\neffectively annotate new images with or without captions; (iii) our model with face alone performs\nbetter than the existing face-only methods based on Gaussian mixture appearance models.\n\nRelated works. This paper is most closely related to works on associating names and faces, which\nwe discussed above. There exist also works on associating nouns to image regions [2, 3, 10], starting\nfrom images annotated with a list of nouns indicating the objects it contains (typical datasets contain\nnatural scenes and objects such as \u2018water\u2019 and \u2018tiger\u2019). A recent work in this line is that of Gupta\nand Davis [17], who model prepositions in addition to nouns (e.g. \u2018bear in water\u2019, \u2018car on street\u2019).\nTo the best of our knowledge, ours is the \ufb01rst work on jointly modeling names and verbs.\n2 Generative model for faces and body poses\nThe news item corpus used to train our face and pose model consists of still images of person(s)\nperforming some action(s). Each image is annotated with a caption describing \u201cwho\u2019s doing what\u201d\nin the image (\ufb01gure 1). Some names from the caption might not appear in the image, and vice-\nversa some imaged persons might not be mentioned in the caption. The basic units in our model\nare persons in the image, consisting of their face and upper body. Our system automatically detects\nthem by bounding-boxes in the image using a face detector [23] and an upper body detector [14].\nIn the rest of the paper, we say \u201cperson\u201d to indicate a detected face and the upper body associated\nwith it (including false positive detections). A face and an upper-body are considered to belong to\nthe same person if the face lies near the center of the upper body bounding-box. For each person,\nwe obtain a pose estimate using [11] (\ufb01gure 3(right)). In addition to these image features, we use\na language parser [1, 8] to extract a set of name-verb pairs from each caption. Our goals are to:\n(i) associate the persons in the images to the name-verb pairs in the captions, and (ii) learn visual\nappearance models corresponding to names and verbs. These can then be used for recognition on\nnew images with or without caption. Learning in our model can be seen as a constrained clustering\nproblem [4, 24, 25].\n\n2.1 Generative model\nWe start by describing how our generative model explains the image-caption data (\ufb01gure 2). The\nnotation is summarized in Table I. Suppose we have a collection of documents D = {D1, . . . , DM }\nwith each document Di consisting of an image I i and its caption C i. These captions implicitly\nprovide the labels of the person(s)\u2019 name(s) and pose(s) in the corresponding images. For each\ncaption C i, we consider only the name-verb pairs ni returned by a language parser [1, 8] and ignore\nother words. We make the same assumptions as for the name-face problem [5, 6, 16, 21] that the\nlabels can only come from the name-verb pairs in the captions or null (for persons not mentioned\nin the caption). Based on this, we generate the set of all possible assignments Ai from the ni in\n\n2\n\n\fi=1 = {I i, C i}i=M\n\ni=1\n\nI i,p: pth person in image I i\nI i,p = (I i,p\n\nface , I i,p\npose)\n\nM: Number of documents in D (image-caption pairs) D = {Di}i=M\nP i: Number of detected persons in image I i\nW i: Number of name-verb pairs in caption C i\nY : Latent variables encoding the true assignments\nY i: Y i = (yi,1, . . . , yi,P i\nAi: Set of possible assignments for document i\nLi: Number of possible assignments for document Di\nl: lth assignment ai\nai\n\u0398: Appearance models for face and pose classes\nV : Number of different verbs\nU: Number of different names\n\u03b8k: Sets of class representative vectors for class k\n\u03b8v\nverb = {\u00b5v,1\n\npose , . . . , \u00b5v,Rv\npose }\n\n, . . . , ai,P i\n\n}, where ai,p\n\nl = {ai,1\n\nAi = {ai\n\nl\n\nl\n\nl\n\nis the label for the pth person\n\u0398 = (\u03b8name, \u03b8verb)\n\u03b8verb = (\u03b81\n\u03b8name = (\u03b81\n\u00b5k\nr : a representative vector for class k\nname = {\u00b5u,1\n\u03b8u\n\nverb, . . . , \u03b8V\nname, . . . , \u03b8U\n\nverb, \u03b2verb)\nname, \u03b2name)\n\nface , . . . , \u00b5u,Ru\nface }\n\n), yi,p is the assignment of the pth person in ith image\n\n1, . . . , ai\n\nLi }\n\nTable I: The mathematical notation used in the paper\n\n\u03b8\n\nVerb\n\nV\n\n\u03b8\n\nname\n\nU\n\nI\n\nY\n\nP\n\nW\n\nC\n\nA\n\nL\n\nM\n\nFigure 2: Graphical\nplate representation of\nthe generative model.\n\nC i (see section 2.4 for details). Hence, we replace the captions by the sets of possible assignments\nA = {A1, . . . , AM }. Let Y = {Y 1, . . . , Y M } be latent variables encoding the true assignments\n(i.e. name/verb labels for the faces/poses), and Y i = (yi,1, . . . , yi,P i\n) be the assignment for the P i\npersons in the ith image. Each yi,p = (yi,p\npose) is a pair of indices de\ufb01ning the assignment of\na person\u2019s face to a name and pose to a verb. These take on values from the set of name indices\n{1, . . . , U, null}, and verb indices {1, . . . , V, null}. N/V is the number of different names/verbs\nover all the captions and null represents unknown names/verbs and false positive person detections.\nDocument collection likelihood. Assuming independence between documents, the likelihood of\nthe whole document collection is\n\nface, yi,p\n\nP (I, Y , A|\u0398) =\n\nM\n\nY\n\ni=1\n\nP (I\n\ni\n\n, Y\n\ni\n\n, Ai|\u0398) =\n\nM\n\nY\n\ni=1\n\nP (I\n\ni|Y\n\ni\n\n, Ai\n\n, \u0398)P (Y\n\ni|Ai\n\n, \u0398)P (Ai|\u0398)\n\n(1)\n\nwhere \u0398 are the model parameters explaining the visual appearance of the persons\u2019 faces and poses\n\nin the images. Therefore, equation (1) can be written asQ P (I i|Y i, \u0398)P (Y i|Ai)P (Ai). The goal\n\nof learning is to \ufb01nd the parameters \u0398 and the labels Y that maximize the likelihood. Below we\nfocus on P (I i|Y i, \u0398), and then de\ufb01ne P (Y i|Ai) and P (Ai) in section 2.4.\n\nImage likelihood. The basic image units in our model are persons. Assuming independence be-\ntween multiple persons in an image, the likelihood of an image can be expressed as the product over\nthe likelihood of each person:\n\nP (I i|Y i, \u0398) = YI i,p\u2208I i\n\nP (I i,p|yi,p, \u0398)\n\n(2)\n\nwhere yi,p de\ufb01ne the name-verb indices of the pth person in the image. A person I i,p = (I i,p\npose)\nis represented by the appearance of her face I i,p\npose. Assuming independence between\nthe face and pose appearance of a person, the conditional probability for the appearance of the pth\nperson in image I i given the latent variable yi,p is:\nface|yi,p\n\nP (I i,p|yi,p, \u0398) = P (I i,p\n\nface and pose I i,p\n\nface, \u03b8name)P (I i,p\n\npose, \u03b8verb)\n\nface, I i,p\n\npose|yi,p\n\n(3)\n\nverb in \u03b8verb = (\u03b81\n\nwhere \u0398 = (\u03b8name, \u03b8verb) are the appearance models associated with the various names and verbs.\nEach \u03b8v\nverb, \u03b2verb) is a set of representative vectors modeling the variability\nwithin the pose class corresponding to a verb v. For example, the verb \u201cserve\u201d in tennis could\ncorrespond to different poses such as holding the ball on the racket, tossing the ball and hitting it.\nAnalogously, \u03b8u\n\nname models the variability within the face class corresponding to a name u.\n\nverb, . . . , \u03b8V\n\n2.2 Face and pose descriptors and similarity measures\nAfter detecting faces from the images with the multi-view algorithm [23], we use [12] to detect nine\ndistinctive feature points within the face bounding box (\ufb01gure 3(left)). Each feature is represented by\nSIFT descriptors [18], and their concatenation gives the overall descriptor vector for the face. We use\nthe cosine as a naturally normalized similarity measure between two face descriptors: simface(a, b) =\naT b\nkak kbk . The distance between two faces is distface(a, b) = 1 \u2212 simcos(a, b).\nWe use [14] to detect upper-bodies and [11] to estimate their pose. A pose E consists of a distri-\nbution over the position (x, y and orientation) for each of 6 body parts (head, torso, upper/lower\n\n3\n\n\fFigure 3: Example images with facial features and pose estimates superimposed. Left Facial features (left\nand right corners of each eye, two nostrils, tip of the nose, and the left and right corners of the mouth) located\nusing [12] in the detected face bounding-box. Right Example estimated poses corresponding to verbs: \u201chit\nbackhand\u201d, \u201cshake hands\u201d and \u201chold\u201d. Red indicates torso, blue upper arms, green lower arms and head.\nBrighter pixels are more likely to belong to a part. Color planes are added up, so that yellow indicates overlap\nbetween lower-arm and torso, purple between upper-arm and torso, and so on (best viewed in color).\nleft/right arms). The pose estimator factors out variations due to clothing and background, so E\nconveys purely spatial arrangements of body parts. We derive three relatively low-dimensional pose\ndescriptors from E, as proposed in [13]. These descriptors represent pose in different ways, such as\nthe relative position between pairs of body parts, and part-speci\ufb01c soft-segmentations of the image\n(i.e. the probability of pixels as belonging to a part). We refer to [13, 11] for more details and the\nsimilarity measure associated with each descriptor. We normalize the range of each similarity to\n[0, 1], and denote their average as simpose(a, b). The \ufb01nal distance between two poses a, b used in\nthe rest of this paper is distpose(a, b) = 1 \u2212 simpose(a, b).\n\n2.3 Appearance model\nThe appearance model for a pose class (corresponding to a verb) is de\ufb01ned as:\npose|\u03b8k\n\npose, k) \u00b7 P (I i,p\n\npose|yi,p\n\nP (I i,p\n\n\u03b4(yi,p\n\nverb)\n\npose, \u03b8verb) = Xk\u2208{1,...,V,null}\n\nwhere \u03b8k\n\u03b4(yi,p\npose class, as the face model is derived analogously.\n\nverb are the parameters of the kth pose class (or \u03b2verb if k = null). The indicator function\npose, k) = 0 otherwise. We only explain here the model for a\n\npose = k and \u03b4(yi,p\n\npose, k) = 1 if yi,p\n\n(4)\n\npose|\u03b8k\n\nHow to model the conditional probability P (I i,p\nverb) is a key ingredient for the success of our\napproach. Some previous works on names-faces used a Gaussian mixture model [6, 21]: each name\nis associated with a Gaussian density, plus an additional Gaussian to model the null class. Using\nfunctions of the exponential family like a Gaussian simpli\ufb01es computations. However, a Gaussian\nmay restrict the representative power of the appearance model. Problems such as face and pose\nrecognition are particularly challenging because they involve complex non-Gaussian multimodal\ndistributions. Figure 3(right) shows a few examples of the variance within the pose class for a verb.\nMoreover, we cannot easily employ existing pose similarity measures [13]. Therefore, we represent\nthe conditional probability using a exemplar-based likelihood function:\n\nP (I i,p\n\npose|\u03b8k\n\nverb) =( 1\n\n1\n\nZ\u03b8verb\n\nZ\u03b8verb\n\npose,\u03b8k\n\nverb)\n\ne\u2212dpose(I i,p\ne\u2212\u03b2verb\n\nif k \u2208 {known verbs}\nif k = null\n\n(5)\n\nr \u2208 \u03b8k\n\nverb = {\u00b5k,1\n\npose, . . . , \u00b5k,Rk\n\nis the normalizer and dpose is the distance between the pose descriptor I i,p\n\nwhere Z\u03b8verb\npose and its\npose }, where Rk is the number of\nclosest class representative vector \u00b5k\nrepresentative poses for verb k. The likelihood depends on the model parameters \u03b8k\nverb, and the\ndistance function dpose. The scalar \u03b2verb represents the null model, thus poses assigned to null have\nlikelihood\ne\u2212\u03b2verb. It is important to have this null model, as some detected persons might not\ncorrespond to any verb in the caption or they might be false detections. By generalizing the similarity\nmeasure simpose(a, b) as a kernel product K(a, b) = \u03c6(a) \u00b7 \u03c6(b), the distance from a vector a to the\nsample center vector \u00b5k\n\nr can be written similarly as in the weighted kernel k-means method [9]:\nr w(b)w(d)k(b, d)\n\nr w(b)k(a, b)\n\nr w(b)\u03c6(b)\n\n\u03a3b,d\u2208\u03c0k\n\n2\u03a3b\u2208\u03c0k\n\n\u03a3b\u2208\u03c0k\n\nZ\u03b8verb\n\n2\n\n1\n\n= K(a, a) \u2212\n\n+\n\n(\u03a3b\u2208\u03c0k\n\nr w(b))2\n\n(6)\n\n\u03c6(a) \u2212\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u03a3b\u2208\u03c0k\n\nr w(b) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u03a3b\u2208\u03c0k\n\nr w(b)\n\n4\n\n\fr w(b)\u03c6(b)(cid:1)/(cid:0)\u03a3b\u2208\u03c0k\n\nr is de\ufb01ned as (cid:0)\u03a3b\u2208\u03c0k\n\nThe center vector \u00b5k\nr is the cluster of\nr , and w(b) is the weight for each point b, representing the likelihood that b\nvectors assigned to \u00b5k\nbelongs to the class of \u00b5k\nr (as in equation (11)). This formulation can be considered as a modi\ufb01ed\nversion of the k-means [19] clustering algorithm. The number of centers Rk can vary for different\nverbs, depending on the distribution of the data and the number of samples. As we are interested\nonly in computing the distance between \u00b5k\nr and each data point, and not in the explicit value of \u00b5k\nr ,\nthe only term that needs to be computed in equation (6) is the second (the third term is constant for\neach assigned \u00b5k\n\nr w(b)(cid:1), where \u03c0k\n\nr ).\n\n2.4 Name-verb assignments\nThe name-verb pairs ni for a document are observed in its caption C i. We derive from them the set\n} of name-verb pairs to persons in the image. The\nof all possible assignments Ai = {ai\nnumber of possible assignments Li depends both on the number of persons and of name-verb pairs.\nAs opposed to the standard matching problem, here the assignments have to take into account null.\nMoreover, we have the same constraints as in the name-face problem [6]: a person can be assigned\nto at most one name-verb pair, and vice-versa. Therefore, given a document with P i persons and\n\n1, . . . , ai\nLi\n\nW i name-verb pairs, the number of possible assignments is Li =Pmin(P i,W i)\n\nj is the number of persons assigned to a name-verb pair instead of null. Even by imposing the\nabove constraints, this number grows rapidly with P i and W i. However, since different assignments\nshare many common sub-assignments, the number of unique likelihood computations is much lower,\nnamely P i \u00b7 (W i + 1). Thus, we can evaluate all possible assignments for an image ef\ufb01ciently.\nAlthough certain assignments are unlikely to happen (e.g. all persons are assigned to null), here we\nuse an uniform prior over all assignments, i.e. P (ai\nl) = 1/Li. Since the true assignment Y i can\nonly come from Ai, we de\ufb01ne the conditional probability over the latent variables Y i as:\n\nj (cid:1) \u00b7(cid:0)W i\n(cid:0)P i\n\nj (cid:1), where\n\nj=0\n\nP (Y i|Ai) =(cid:26)1/Li\n\n0\n\nif Y i \u2208 Ai\notherwise\n\n(7)\n\nThe latent assignment Y i play the role of the annotations necessary for learning appearance models.\n3 Learning the model\nThe task of learning is to \ufb01nd the model parameters \u0398 and the assignments Y which maximize\nthe likelihood of the complete dataset {I, Y , A}. The joint probability of {I, Y , A} given \u0398 from\nequation (1) can be written as\n\nP (I, Y , A|\u0398) =\n\nM\n\nYi=1\n\n\uf8eb\n\uf8edP (Y i|Ai)P (Ai)\n\nP i\n\nYp=1\n\nP (I i,p\n\nface|yi,p\n\nface, \u03b8name)P (I i,p\n\npose|yi,p\n\n(8)\n\npose, \u03b8verb)\uf8f6\n\uf8f8\n\nMaximizing the log of this joint likelihood is equivalent to minimizing the following clustering\nobjective function over the latent variables Y and parameters \u0398:\n\ndface(I i,p\n\nface, \u03b8\n\nyi,p\nface\n\nface 6=null\n\nJ = Xi,p,yi,p\n+ Xi,p,yi,p\n\npose=null\n\nface =null\n\nname) + Xi,p,yi,p\n\u03b2name + Xi,p,yi,p\n(logP (Y i|Ai) + logP (Ai)) +Xi,p\n\npose6=null\n\n\u03b2verb \u2212Xi\n\ndpose(I i,p\n\npose, \u03b8\n\nyi,p\npose\nverb )\n\n(9)\n\n(logZ\u03b8name + logZ\u03b8verb)\n\nThus, to minimize J , each latent variable Y i must belong to the set of possible assignments Ai. If\nY would be known, the cluster centers \u00b5 \u2208 \u03b8name, \u00b5 \u2208 \u03b8verb which minimize J could be determined\nuniquely (given also the number of class centers R). However, it is dif\ufb01cult to set R before seeing\nthe data. In our implementation, we determine the centers approximately using the data points and\ntheir K nearest neighbors. Since estimating the normalization constants Z\u03b8name and Z\u03b8verb is compu-\ntationally expensive, we make an approximation by considering them as constant in the clustering\nprocess (i.e. drop their terms from J ).\nIn our experiments, this did not signi\ufb01cantly affect the\nresults, as also noted in several other works (e.g. [4]).\n\nSince the assignments Y are unknown, we use a generalized EM procedure [7, 22] for simultane-\nously learning the parameters \u0398 and solving the correspondence problem (i.e. \ufb01nd Y ):\n\n5\n\n\fName\nAss.\n\nVerb\nAss.\n\nName\nAss.\n\nVerb\nAss.\n\n \n\n GMM Face\n Model Face\n Model Face+Pose\n Model Pose\n\nName\nAss.\n\nVerb\nAss.\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n]\n\n%\n\n[\n \ny\nc\na\nr\nu\nc\nc\nA\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\n55\n\n50\n\n45\n\n40\n\n \n\na. ground\u2212truth\n\nb. automated \n\nc. multiple\n\n Model Face\n Model Face+Pose\n\n0.2\n\n0.3\n\n0.4\n\n0.65\n\n \n0.1\n\nB. Clinton\n\nB. Clinton\n\nS. Williams S. Williams\n\nA. Agassi\n\nR. Federer\n\n \n\nA. Agassi\n\nR. Federer\n\nJ. Jankovic\n\nJ. Jankovic\n\nG. Bush\n\nG. Bush\n\nK. Garnett\n\nN. Sarkozy\n\nT. Woods\n\nA. Merkel\n\nT. Woods\n\nN. Sarkozy\n\nA. Merkel\n\nK. Garnett\n\n0.5\n\n0.6\n\nRecall\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 4: Left. Comparison of different models under different setups: using the manually annotated name-\nverb pairs (ground-truth); using the Named Entity detector and language parser (automated); and using the\nmore dif\ufb01cult subset (multiple). The accuracy for name (Name Ass.) and verb (Verb Ass.) assignments\nare reported separately. GMM Face refers to the face-only model using GMM appearance models, as in [6].\nRight. Comparison of precision and recall for 10 individuals using the stripped-down face only model, and our\nface+pose model. The reported results are based on automatically parsed captions for learning.\n\nInput. Data D; hyper-parameters \u03b2name, \u03b2verb, K\n1. Initialization. We start by computing the distance matrix between faces/poses from images\nsharing some name/verb in the caption. Next we initialize \u0398 using all documents in D. For each\ndifferent name/verb, we select all captions containing only this name/verb. If the corresponding\nverb.\nimages contain only one person, their faces/poses are used to initialize the center vectors \u03b8k\nThe center vectors are found approximately using each data point and their K nearest neighbors\nof the same name/verb class. If a name/verb only appears in captions with multiple names/verbs\nor if the corresponding images always contain multiple persons (e.g. verbs like \u201cshake hand\u201d),\nwe randomly assign the name/verb to any face/pose in each image. The center vectors are then\ninitialized using these data points. The initial weights w for all data points are set to one (equation 6).\nThis step yields an initial estimate of the model parameters \u0398. We re\ufb01ne the parameters and assign-\nments by repeating the following EM-steps until convergence.\n2. E-step. Compute the labels Y using the parameters \u0398old from the previous iteration\n\nname/\u03b8k\n\narg max\n\nP (Y |I, A, \u0398old) \u221d arg max\n\nP (I|Y , \u0398old)P (Y |A)\n\n(10)\n\nY\n\nY\n\n3. M-step. Given the labels Y , update \u0398 so as to minimize J (i.e. update the cluster centers \u00b5).\nOur algorithm assigns each point to exactly one cluster. Each point I i,p in a cluster is given a weight\n\nwi,p\n\nY i =\n\nP (Y i|I i,p, Ai, \u0398)\n\nPY j \u2208Ai P (Y j|I i,p, Ai, \u0398)\n\n(11)\n\nwhich represents the likelihood that I i,p\npose belong to the name and verb de\ufb01ned by Y i.\nTherefore, faces and poses from images with many detections have a lower weights and contribute\nless to the cluster centers, re\ufb02ecting the larger uncertainty in their assignments.\n\nface and I i,p\n\n4 Experiments and conclusions\nDatasets There are datasets of news image-caption pairs such as those in [6, 16]. Unfortunately,\nthese datasets are not suitable in our scenario for two reasons. Faces often occupy most of the image\nso the body pose is not visible. Second, the captions frequently describe the event at an abstract\nlevel, rather than using a verb to describe the actions of the persons in the image (compare \ufb01gure 1\nto the \ufb01gures in [6, 16]). Therefore, we collected a new dataset 2 by querying Google-images using\na combination of names and verbs (from sports and social interactions), corresponding to distinct\nupper body poses. An example query is \u201cBarack Obama\u201d + \u201cshake hands\u201d. Our dataset contains\n1610 images, each with at least one person whose face occupies less than 5% of the image, and with\nthe accompanying snippet of text returned by Google-images. External annotators were asked to\n\n2We released this dataset online at http://www.vision.ee.ethz.ch/\u223cferrari\n\n6\n\n\fC:\n\nF::\nFP:\n\nR. Nadal - clench \ufb01st\nE. Gulbis - null\nE. Gulbis\nR. Nadal\n\nK. Garnett - hold\nCeltics - null\nCeltics\nK. Garnett\n\nJ. Jankovic - serve\nM. Bartoli - null\nnull\nJ. Jankovic\n\nJ. Jankovic - hold\nD. Sa\ufb01na - null\nD. Sa\ufb01na\nJ. Jankovic\n\nR. Nadal - null\nR. Federer - hit forehand\nR. Nadal; null\nR. Federer; null\n\nC:\n\nV. Williams - hit backhand\nS. Williams - hold\n\nR. Nadal - hit forehand\n\nF::\nFP:\n\nV. Williams\nS. Williams\n\nnull\nR. Nadal\n\nC. Clinton - clap\nB. Clinton - kiss\nH. Clinton - kiss\nC. Clinton\nnull\n\nN. Sarkozy - embrace\nBrian Cowen - null\n\nHu Jintao - Wave\nR. Venables - wave\n\nBrian Cowen\nN. Sarkozy\n\nnull\nHu Jintao\n\nHu Jintao - shake hands\nJ. Chirac - shake hands\n\nHu Jintao - shake hands\nN. Sarkozy - shake hands\n\nA. Garcia - toast\nA. Merkel - drink\n\nA. Merkel - gesture\n\nnull;null;null\nnull;null;Hu Jintao\n\nnull;Hu Jintao\nN. Sarkozy; Hu Jintao\n\nA. Merkel\nA. Garcia\n\nnull;null;A. Merkel\nA. Merkel;null;null;\n\nC:\n\nF::\nFP:\n\nHu Jintao - shake hands\nK. Bakjyev - shake hands\nKyrgyzstan - null\nHu Jintao;null\nHu Jintao;K. Bakjyev\n\nFigure 5: Examples of when modeling pose improves the results at learning time. Below the images we report\nthe name-verb pairs (C) from the caption as returned by the automatic parser and compare the association\nrecovered by a model using only faces (F) and using both faces and poses (FP). The assigned names (left to\nright) correspond to the detected face bounding-boxes (left to right).\n\n]\n\n%\n\n[\n \ny\nc\na\nr\nu\nc\nc\nA\n\n0\n\n110\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\n1\n\nImage Query Keywords\n2\n4\n\n3\n\n5\n\n6\n \n\n Baseline on Face Annoation (with caption)\n Face Model on Face Annoation (with caption)\n Face + Pose Model on Face Annotation (with caption)\n Face + Pose Model on Face Annotation (without caption)\n Face + Pose Model on Pose Annotation (without caption)\n\nFederer\nBackhand\n\nSharapova\nHold trophy\n\nNadal\n\nForehand\n\nObama\nWave\n\nHu Jintao\nShakehands\n\nWave\n\nShake Hand\n\nHold\n\nWave\n\nJ. Jankoviv\n\nB. Obama\n\nB. Obama\n\nB. Obama\n\nNULL\n\nShake Hands\n\nR. Federer\n\nHit Backhand\n\nNULL\n\nHu Jintao\n\nNULL\n\nHu Jintao\n\nShake Hands\n\nShake Hands\n\nShake Hands\n\nM.Sharapova\n\nHold\n\nFigure 6: Recognition results on images without text captions (using models learned from automatically parsed\ncaptions). Left compares face annotation using different models and scenarios (see main text); Right shows a\nfew examples of the labels predicted by the joint face and pose model (without using captions).\n\nextend these snippets into realistic captions when necessary, with varied long sentences, mentioning\nthe action of the persons in the image as well as names/verbs not appearing in the image (as \u2018noise\u2019,\n\ufb01gure 1). Moreover, they also annotated the ground-truth name-verb pairs mentioned in the captions\nas well as the location of the target persons in the images, enabling to evaluate results quantitatively.\nIn total the ground-truth consists of 2627 name-verb pairs. In our experiments we only consider\n\n7\n\n\fnames and verbs occurring in at least 3 captions for a name, and 20 captions for a verb. This leaves\n69 names corresponding to 69 face classes and 20 verbs corresponding to 20 pose classes.\n\nWe used an open source Named Entity recognizer [1] to detect names in the captions and a language\nparser [8] to \ufb01nd name-verbs pairs (or name-null if the language parser could not \ufb01nd a verb as-\nsociated with a name). By using simple stemming rules, the same verb under different tenses and\npossessive adjectives was merged together. For instance \u201cshake their hands\u201d, \u201cis shaking hands\u201d and\n\u201cshakes hands\u201d all correspond to the action verb \u201cshake hands\u201d. In total, the algorithms achieves\nprecision 85.5% and recall 68.8% on our dataset over the ground-truth name-verb pair. By discard-\ning infrequent names and verbs as explained above, we retain 85 names and 20 verbs to be learned by\nour model (recall that some of these are false positives rather than actual person names and verbs).\n\nResults for learning The learning algorithm takes about \ufb01ve iterations to converge. We compare\nexperimentally our face and pose model to stripped-down versions using only face or pose informa-\ntion. For comparison, we also implement the constrained mixture model [6] described in section 2.3.\nAlthough [6] also originally incorporates also a language model of the caption, we discard it here\nso that both methods use the same amount of information. We run the experiments in three setups:\n(a) using the ground-truth name-verb annotations from the captions; (b) using the name-verb pairs\nautomatically extracted by the language parser; (c) similar as (b) but only on documents with multi-\nple persons in the image or multiple name-verb pairs in the caption. These setups are progressively\nmore dif\ufb01cult, as (b) has more noisy name-verb pairs, and (c) has no documents with a single name\nand person, where our initialization is very reliable.\n\nFigure 4(left) compares the accuracy achieved by different models on these setups. The accuracy is\nde\ufb01ned as the percentage of correct assignments over all detected persons, including assignments to\nnull, as in [5, 16]. As the \ufb01gure shows, our joint \u2018face and pose\u2019 model outperforms both models\nusing face or pose alone in all setups. Both the annotation of faces and poses improve, demonstrating\nthey help each other when successfully integrated by our model. This is the main point of the\npaper. Figure 4(right) shows improvements on precision and recall over models using faces or poses\nalone. As a second point, our model with face alone also outperforms the baseline approach using\nGaussian mixture appearance models (e.g. used in [6]). Figure 5 shows a few examples of how\nincluding pose improves the learning results and solve some of the correspondence ambiguities.\nImprovements happen mainly in three situations: (a) when there are multiple names in a caption, as\nnot all names in the captions are associated to action verbs (\ufb01gure 1(a) and \ufb01gure 5(top)); (b) when\nthere are multiple persons in an image, because the pose disambiguates the assignment (\ufb01gure 1(b)\nand \ufb01gure 5(bottom)) and (c) when there are false detections, rare faces or faces at viewpoints\ndifferent than frontal (i.e. where face recognition works less well, e.g. \ufb01gure 5(middle)).\nResults for recognition Once the model is learned, we can use it to recognize \u201cwho\u2019s doing what\u201d\nin novel images with or without captions. We collected a new set of 100 images and captions from\nGoogle-images using \ufb01ve keywords based on names and verbs from the training dataset. We evaluate\nthe learned model in two scenarios: (a) the test data consists of images and captions. Here we run\ninference on the model, recovering the best assignment Y from the set of possible assignments\ngenerated from the captions; (b) the same test images are used but the captions are not given, so\nthe problem degenerates to a standard face and pose recognition task. Figure 6(left) reports face\nannotation accuracy for three methods using captions (scenario (a)): (\u22c4) a baseline which randomly\nassigns a name (or null) from the caption to each face in the image; (x) our face and pose model; ((cid:3))\nour model using only faces. The \ufb01gure also shows results for scenario (b), where our full model tries\nto recognize faces (+) and poses (\u25b3) in the test images without captions. On scenario (a) all models\noutperform the baseline, and our joint face and pose model improves signi\ufb01cantly on the face-only\nmodel for all keywords, especially when there are multiple persons in the image.\nConclusions. We present an approach for the joint modeling of faces and poses in images and\ntheir association to names and action verbs in accompanying text captions. Experimental results\nshow that our joint model performs better than face-only models both in solving the image-caption\ncorrespondence problem on the training data, and in annotating new images. Future work aims at\nincorporating an effective web crawler and html/language parsing tools to harvest image-caption\npairs from the internet fully automatically. Other techniques such as learning distance functions [4,\n15, 20] may also be incorporated during learning to improve recognition results.\n\nAcknowledgments We thank K. Deschacht and M.F. Moens for providing the language parser. L. J. and B.\nCaputo were supported by EU project DIRAC IST-027787 and V. Ferrari by the Swiss National Science Found.\n\n8\n\n\fReferences\n\n[1] http://opennlp.sourceforge.net/.\n[2] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan. Matching words\n\nand pictures. JMLR, 3:1107\u20131135, 2003.\n\n[3] K. Barnard and Q. Fan. Reducing correspondence ambiguity in loosely labeled training data.\n\nIn Proc. CVPR\u201907.\n\n[4] S. Basu, M. Bilenko, A. Banerjee, and R. J. Mooney. Probabilistic semi-supervised cluster-\nIn O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors, Semi-Supervised\n\ning with constraints.\nLearning, pages 71\u201398. MIT Press, 2006.\n\n[5] T. Berg, A. Berg, J. Edwards, and D. Forsyth. Names and faces in the news. In Proc. CVPR\u201904.\n[6] T. Berg, A. Berg, J. Edwards, and D. Forsyth. Who\u2019s in the picture? In Proc. NIPS\u201904.\n[7] A. P. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the\n\nem algorithm. Journal Royal Statistical Society, 39:1\u201338, 1977.\n\n[8] K. Deschacht and M.-F. Moens. Semi-supervised semantic role labeling using the latent words\n\nlanguage model. In Proc. EMNLP\u201909.\n\n[9] I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts. In\n\nProc. KDD\u201904.\n\n[10] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object recognition as machine transla-\n\ntion: Learning a lexicon for a \ufb01xed image vocabulary. In Proc. ECCV\u201902.\n\n[11] M. Eichner and V. Ferrari. Better appearance models for pictorial structures.\n\nBMVC\u201909.\n\nIn Proc.\n\n[12] M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... buffy - automatic naming of\n\ncharacters in tv video. In Proc. BMVC\u201906.\n\n[13] V. Ferrari, M. Marin, and A. Zisserman. Pose search: retrieving people using their pose. In\n\nProc. CVPR\u201909.\n\n[14] V. Ferrari, M. Marin, and A. Zisserman. Progressive search space reduction for human pose\n\nestimation. In Proc. CVPR\u201908.\n\n[15] A. Frome, Y. Singer, and J. Malik.\n\nfunctions. In Proc. NIPS\u201906.\n\nImage retrieval and classi\ufb01cation using local distance\n\n[16] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Automatic face naming with caption-\n\nbased supervision. In Proc. CVPR\u201908.\n\n[17] A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for\n\nlearning visual classi\ufb01ers. In Proc. ECCV\u201908.\n\n[18] D. Lowe. Distinctive image features from scale-invariant keypoints.\n\n2004.\n\nIJCV, 60(2):91\u2013110,\n\n[19] J. B. MacQueen. Some methods for classi\ufb01cation and analysis of multivariate observations. In\n\nProc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.\n\n[20] T. Malisiewicz and A. Efros. Recognition by association via learning per-exemplar distances.\n\nIn Proc. CVPR\u201908.\n\n[21] T. Mensink and J. Verbeek. Improving people search using query expansions: How friends\n\nhelp to \ufb01nd people. In Proc. ECCV\u201908.\n\n[22] R. Neal and G. E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and\nother variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355\u2013368. Kluwer\nAcademic Publishers, 1998.\n\n[23] Y. Rodriguez. Face Detection and Veri\ufb01cation using Local Binary Patterns. PhD thesis, \u00b4Ecole\n\nPolytechnique F\u00b4ed\u00b4erale de Lausanne, 2006.\n\n[24] N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian mixture models\n\nwith em using equivalence constraints. In Proc. NIPS\u201903.\n\n[25] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with\n\nbackground knowledge. In Proc. ICML\u201901.\n\n9\n\n\f", "award": [], "sourceid": 1159, "authors": [{"given_name": "Jie", "family_name": "Luo", "institution": null}, {"given_name": "Barbara", "family_name": "Caputo", "institution": null}, {"given_name": "Vittorio", "family_name": "Ferrari", "institution": null}]}