{"title": "Simultaneous Object Detection and Ranking with Weak Supervision", "book": "Advances in Neural Information Processing Systems", "page_first": 235, "page_last": 243, "abstract": "A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images. In this work are goal is to learn from heterogeneous labels, in which some images are only weakly supervised, specifying only the presence or absence of the object or a weak indication of object location, whilst others are fully annotated. To this end we develop a discriminative learning approach and make two contributions: (i) we propose a structured output formulation for weakly annotated images where full annotations are treated as latent variables; and (ii) we propose to optimize a ranking objective function, allowing our method to more effectively use negatively labeled images to improve detection average precision performance. The method is demonstrated on the benchmark INRIA pedestrian detection dataset of Dalal and Triggs and the PASCAL VOC dataset, and it is shown that for a significant proportion of weakly supervised images the performance achieved is very similar to the fully supervised (state of the art) results.", "full_text": "Simultaneous Object Detection and Ranking with\n\nWeak Supervision\n\nMatthew B. Blaschko\n\nAndrea Vedaldi\n\nAndrew Zisserman\n\nDepartment of Engineering Science\n\nUniversity of Oxford\n\nUnited Kingdom\n\nAbstract\n\nA standard approach to learning object category detectors is to provide strong su-\npervision in the form of a region of interest (ROI) specifying each instance of\nthe object in the training images [17]. In this work are goal is to learn from het-\nerogeneous labels, in which some images are only weakly supervised, specifying\nonly the presence or absence of the object or a weak indication of object location,\nwhilst others are fully annotated.\nTo this end we develop a discriminative learning approach and make two contribu-\ntions: (i) we propose a structured output formulation for weakly annotated images\nwhere full annotations are treated as latent variables; and (ii) we propose to op-\ntimize a ranking objective function, allowing our method to more effectively use\nnegatively labeled images to improve detection average precision performance.\nThe method is demonstrated on the benchmark INRIA pedestrian detection dataset\nof Dalal and Triggs [14] and the PASCAL VOC dataset [17], and it is shown that\nfor a signi\ufb01cant proportion of weakly supervised images the performance achieved\nis very similar to the fully supervised (state of the art) results.\n\n1\n\nIntroduction\n\nLearning from weakly annotated data is a long standing goal for the practical application of ma-\nchine learning techniques to real world data. Expensive manual labeling steps should be avoided if\npossible, while weakly labeled and unlabeled data sources should be exploited in order to improve\nperformance with little to no additional cost. In this work, we propose a uni\ufb01ed framework for\nlearning to detect objects in images from data with heterogeneous labels. In particular, we consider\nthe case of image collections for which we would like to predict bounding box localizations, but that\n(for a signi\ufb01cant proportion of the training data) only image level binary annotations are provided\nindicating the presence or absence of an object, or that weak indications of object location are given\nwithout a precise bounding box annotation.\nWe approach this task from the perspective of structured output learning [3, 35, 36], building on the\napproach of Blaschko and Lampert [8], in which a structured output support vector machine formu-\nlation [36] is used to directly learn a regressor from images to object localizations parameterized\nby the coordinates of a bounding box. We extend this framework here to weakly annotated images\nby treating missing information in a latent variable fashion following [2, 40]. Available annotation,\nsuch as the presence or absence of an object in an image, constrains the set of values the latent vari-\nable can take. In the case that complete label information is provided [40] reduces to [36], giving\na uni\ufb01ed framework for data with heterogeneous levels of annotation. We empirically observe that\nthe localization approach of [8] fails in the case that there are many images with no object present,\nmotivating a slight modi\ufb01cation of the learning algorithm to optimize detection ranking analogous\n\n1\n\n\fto [11, 21, 41]. We extend these works to the case that the predictions to be ranked are structured\noutputs. When combined with discriminative latent variable learning, this results in an algorithm\nsimilar to multiple instance ranking [6], but we exploit the full generality of structured output learn-\ning.\nThe computer vision literature has approached learning from weakly annotated data in many differ-\nent ways. Search engine results [20] or associated text captions [5, 7, 13, 34] are attractive due to\nthe availability of millions of tagged or captioned images on the internet, providing a weak form of\nlabels beyond unsupervised learning [37]. This generally leads to ambiguity as captions tend to be\ncorrelated with image content, but may contain errors. Alternatively, one may approach the problem\nof object detection by considering generic properties of objects or their attributes in order to com-\nbine training data from multiple classes [1, 26, 18]. Deselaers et al. learn the common appearance\nof multiple object categories, which yields an estimate of where in an image an object is without\nspecifying the speci\ufb01c class to which it belongs [15]. This can then be utilized in a weak supervision\nsetting to learn a detector for a speci\ufb01c object category. Carbonetto et al. consider a Bayesian frame-\nwork for learning across incomplete, noisy, segmentation-level annotation [10]. Structured output\nlearning with latent variables has been proposed for inferring partial truncation of detections due to\nocclusion or image boundaries [38]. Image level binary labels have often been used, as this generally\ntakes less time for a human annotator to produce [4, 12, 23, 28, 30, 31, 33]. Here, we consider this\nlatter kind of weak annotation, and will also consider cases where the object center is constrained to\na region in the image, but that exact coordinates are not given [27]. Simultaneous localization and\nclassi\ufb01cation using a discriminative latent variable model has been recently explored in [29], but\nthat work has not considered mixed annotation, or a structured output loss.\nThe rest of this paper is structured as follows. In Section 2 we review a structured output learning\nformulation for object detection that will form the basis of our optimization. We then propose to\nimprove that approach to better handle negative training instances by developing a ranking objective\nin Section 3. The resulting objective allows us to approach the problem of weakly annotated data in\nSection 4, and the methods are empirically validated in Section 5.\n\n2 Object Detection with Structured Output Learning\n\nStructured output learning generalizes traditional learning settings to the prediction of more complex\noutput spaces, in which there may be non-trivial interdependencies between components of the\noutput. In our case, we would like to learn a mapping f : X \u2192 Y where X the space of images and\n\nY is the space of bounding boxes or no bounding box: Y \u2261 \u2205S(l, t, r, b), where (l, t, r, b) \u2208 R4\n\nspeci\ufb01es the left, top, right, and bottom coordinates of a bounding box. This approach was \ufb01rst\nproposed by [8] using the Structured Output SVM formulation of [36]:\n\n(1)\n\nX\n\ni\n\nmin\nw,\u03be\n\n\u03bei\n\n1\nn\n\n1\nkwk2 + C\n2\nhw, \u03c6(xi, yi)i \u2212 hw, \u03c6(xi, y)i \u2265 \u2206(yi, y) \u2212 \u03bei,\n\u03bei \u2265 0 \u2200i\n\n\u2200i, y \u2208 Y \\ {yi}\n\ns.t.\n\n(2)\n(3)\nwhere \u2206(yi, y) is a loss for predicting y when the true output is yi, and \u03c6(xi, yi) is a joint kernel\nmap that measures statistics of the image, xi, local to the bounding box, yi [8, 9].1 Training is\nachieved using delayed constraint generation, and at test time, a prediction is made by computing\nf(x) = argmaxyhw, \u03c6(x, y)i.\nIt was proposed in [8] to treat images in which there is no instance of the object of interest as zero\nvectors in the Hilbert space induced by \u03c6, i.e. \u03c6(x, y\u2212) = 0 \u2200x where y\u2212 indicates the label that\nthere is no object in the image (i.e. y\u2212 \u2261 \u2205). During training, constraints are generated by \ufb01nding\ni = argmaxy\u2208Y\\{yi}hw, \u03c6(xi, y)i+\u2206(yi, y). For negative images, \u2206(y\u2212, y) = 1 if y indicates an\n\u02dcy\u2217\nobject is present, so the maximization corresponds simply to \ufb01nding the bounding box with highest\nscore. The resulting constraint corresponds to:\n\n\u03bei \u2265 1 + hw, \u03c6(xi, \u02dcy\u2217\ni )i\n\n(4)\n\n1As in [8], we make use of the margin rescaling formulation of structured output learning. The slack\n\nrescaling variant is equally applicable [36].\n\n2\n\n\fbe dominated by the terms in P\n\nwhich tends to decrease the score associated with all bounding boxes in the image. The primary\nproblem with this approach is that it optimizes a regularized risk functional for which negative\nimages are treated equally with positive images.\nIn the case of imbalances in the training data\nwhere a large majority of images do not contain the object of interest, the objective function may\ni \u03bei for which there is no bounding box present. The learning\nprocedure may focus on decreasing the score of candidate detections in negative images rather than\non increasing the score of correct detections. We show empirically in Section 5 that this treatment\nof negative images is in fact detrimental to localization performance. The results presented in [8]\nwere achieved by training only on images with an instance of the object present, ignoring large\nquantities of negative training data. Although one may attempt to address this problem by adjusting\nthe loss function, \u2206, to penalize negative images less than positive images, this approach is heuristic\nand requires searching over an additional parameter during training (the relative size of the loss\nfor negative images). We address this imbalance more elegantly without introducing additional\nparameters in the following section.\n\n3 Learning to Rank\n\nWe propose to remedy the shortcomings outlined in the previous section by modifying the objective\nin Equation (1) to simultaneously localize and rank object detections. The following constraints\napplied to the test set ensure a perfect ranking, that is that every true detection has a higher score\nthan all false detections:\n\nhw, \u03c6(xi, yi)i > hw, \u03c6(xj, \u02dcyj)i \u2200i, j, \u02dcyj \u2208 Y \\ {yj}.\n\n(5)\n\nWe modify these constraints, incorporating a structured output loss, in the following structured\noutput ranking objective\n\nX\n\nmin\nw,\u03be\n\ns.t.\n\n1\n\nn \u00b7 n+\n\n1\nkwk2 + C\n2\nhw, \u03c6(xi, yi)i \u2212 hw, \u03c6(xj, \u02dcyj)i \u2265 \u2206(yj, \u02dcyj) \u2212 \u03beij \u2200i, j, \u02dcyj \u2208 Y \\ {yj}\n\u03beij \u2265 0 \u2200i, j\n\n\u03beij\n\ni,j\n\n(6)\n\n(7)\n(8)\n\nwhere n+ denotes the number of positive instances in the training set. As compared with Equa-\ntions (1)-(3), we now compare each positive instance to all bounding boxes in all images in the\ntraining set instead of just the bounding boxes from the image it comes from. The constraints at-\ntempt to give all positive instances a score higher than all negative instances, where the size of\nthe margin is scaled to be proportional to the loss achieved by the negative instance. We note that\none can use this same approach to optimize related ranking objectives, such as precision at a given\ndetection rate, by extending the formulations of [11, 41] to incorporate our structured output loss\nfunction, \u2206.\nAs in [8, 36] we have an intractable number of constraints in Equation (7). We will address this\nproblem using a constraint generation approach with a 1-slack formulation\n\nmin\nw,\u03be\n\n1\n2\n\ns.t. X\n\nkwk2 + C\u03be\n\nhw, \u03c6(xi, yi)i \u2212 hw, \u03c6(xj, \u02dcyj)i \u2265X\n\n\u2206(yj, \u02dcyj) \u2212 \u03be \u2200\u02dcy \u2208M\n\n(9)\nY \\ {yj} (10)\n\nij\n\n\u03be \u2265 0\n\nij\n\nj\n\n(11)\nwhere \u02dcy is a vector with jth element \u02dcyj. Although this results in a number of constraints exponential\nin the number of training examples, we can solve this ef\ufb01ciently using a cutting plane algorithm. The\nproof of equivalence between this optimization problem and that in Equations (6)-(8) is analogous\nto the proof in [22, Theorem 1]. We are only left to \ufb01nd the maximally violated constraints in\nEquation (10). Algorithm 1 gives an ef\ufb01cient procedure for doing so.\nAlgorithm 1 works by \ufb01rst scoring all positive regions, as well as \ufb01nding and scoring the maxi-\nmally violated regions from each image. We make use of the transitivity of ordering these two sets\nof scores to avoid comparing all pairs in a na\u00a8\u0131ve fashion. If hw, \u03c6(xj, \u02dcy\u2217\nj )i \u2265 hw, \u03c6(xi, yi)i and\n\n3\n\n\fAlgorithm 1 1-slack structured output ranking \u2013 maximally violated constraint.\nEnsure: Maximally violated constraint is \u03b4 \u2212 hw, \u03c8i \u2264 \u03be\n\nend for\n(s+, p+) = sort(s+) {p+ is a vector of indices specifying a given score\u2019s original index.}\n(s\u2212, p\u2212) = sort(s\u2212)\n\u03b4 = 0, k = 1, \u03c8 = \u03c6+ = 0\nfor all j do\nwhile s\u2212\n\nfor all i do\n\ni = hw, \u03c6(xi, yi)i\ns+\n\nend for\nfor all j do\n\nj = argmaxyhw, \u03c6(xj, y)i + \u2206(yj, y)\n\u02dcy\u2217\nj = hw, \u03c6(xj, \u02dcy\u2217\ns\u2212\n\nj )i + \u2206(yj, \u02dcy\u2217\nj )\n\n(cid:16)\n(cid:17)\nk \u2227 k \u2264 n+ + 1 do\n\nxp+\n\nk\n\n, yp+\n\nj > s+\n\u03c6+ = \u03c6+ + \u03c6\nk = k + 1\n\nend while\n\u03c8 = \u03c8 + \u03c6+ \u2212 (k \u2212 1)\u03c6\n\u03b4 = \u03b4 + (k \u2212 1)\u2206(yj, \u02dcy\u2217\nj )\n\nend for\n\nk\n\n(cid:18)\n\n(cid:19)\n\nxp\n\u2212\nj\n\n, \u02dcy\u2217\n\u2212\np\nj\n\nj )i and hw, \u03c6(xp, yp)i. In-\nhw, \u03c6(xi, yi)i \u2265 hw, \u03c6(xp, yp)i, we do not have to compare hw, \u03c6(xj, \u02dcy\u2217\nstead, we sort the instances of the class by their score, and sort the negative instances by their score\nas well. We keep an accumulator vector for positive images, \u03c6+, and a count of the number of\nviolated constraints (k \u2212 1). We iterate through each violated region, ordered by score, and sum the\nviolated constraints into \u03c8 and \u03b4, yielding the maximally violated 1-slack constraint.\n\n4 Weakly Supervised Data\n\nNow that we have developed a structured output learning framework that is capable of appropriately\nhandling images from the background class, we turn our attention to the problem of learning with\nweakly annotated data. We will consider the problem in full generality by assuming that we have\nbounding box level annotation for some training images, but only binary labels or weak location\ninformation for others. For negatively labeled images, we know that no bounding box in the entire\nimage contains an instance of the object class, while for positive images at least one bounding box\nbelongs to the class of interest. We approach this issue by considering the location of a bounding\nbox to be a latent variable to be inferred during training. The value that this variable can take is\nconstrained by the weak annotation. In the case that we have only a binary image-level label, we\nconstrain the latent variable to indicate that some region of the image corresponds to the object of\ninterest. In a more constrained case, such as annotation indicating the object center, we constrain\nthe latent variable to belong to the set of bounding boxes that have a center consistent with the anno-\ntation. There is an asymmetry in the image level labeling in that negative labels can be considered\nto be full annotation (i.e. all bounding boxes do not contain an instance of the object), while posi-\ntive labels are incomplete.2 We consider the index variable j to range over all completely labeled\nimages, including negative images.\nWe consider a modi\ufb01cation of the constrained objective developed in the previous section to include\nconstraints of the form given in Equation (7), but also constraints for our weakly annotated positive\nimages, which we index by m,\n\nhw, \u03c6(xm, \u02c6ym)i\n\n\u2212 hw, \u03c6(xj, \u02dcyj)i \u2265 \u2206(yj, \u02dcyj) \u2212 \u03bemj \u2200m, j, \u02dcyj \u2208 Y \\ {yj},\n\n(12)\n\n(cid:18)\n\nmax\n\u02c6ym\u2208Ym\n\n(cid:19)\n\n2Note that this is exactly the asymmetry discussed in [2] in the context of multiple instance learning. Our\n\nsetting can be seen as a generalization to mixed annotations.\n\n4\n\n\fwhere Ym is the set of bounding boxes consistent with the weak annotation for image m. Due to the\nmaximization over \u02c6ym, the optimization is no longer convex, but we can \ufb01nd a local optimum using\nthe CCCP algorithm [40]. This is effectively equivalent to the case of loss-rescaled multiple instance\nlearning, and we note that the resulting objective has similarities to that of [2]. Viewed another way,\nwe treat the location of the hypothesized bounding box as a latent variable. In order to use this in our\ndiscriminative optimization, we will try to put a large margin between the maximally scoring box\nand all bounding boxes with high loss. Though our algorithm does not have direct information about\nthe true location of the object of interest, it tries to learn a discriminant function that can distinguish\na region in the positively labeled images from all regions in the negatively labeled images.\n\n5 Results\n\nWe validate our model on the benchmark INRIA pedestrian detection dataset of Dalal and\nTriggs [14] using a histogram of oriented gradients (HOG) representation, and the PASCAL VOC\ndataset [16, 17]. Following [9, 24, 25], we provide detailed results on the cat class as the high vari-\nation in pose is appropriate for testing a bag of words model, but also provide summary results for all\nclasses in the form of improvement in mean average precision (mean AP). We \ufb01rst illustrate the per-\nformance of the ranking objective developed in Section 3 and subsequently show the performance\nof learning with weakly supervised data using the latent variable approach of Section 4.\n\n5.1 Experimental Setup\n\nWe have implemented variants of two popular object detection systems in order to show the gen-\neralization of the approaches developed in this work to different levels of supervision and feature\ndescriptors. In the \ufb01rst variant, we have used a linear bag of words model similar to that developed\nin [8, 24, 25]. Inference of maximally violated constraints and object detection was performed using\nEf\ufb01cient Subwindow Search (ESS) branch-and-bound inference [24, 25]. The joint kernel map, \u03c6,\nwas constructed using a concatenation of the bounding box visual words histogram (the restriction\nkernel) and a global image histogram, similar to the approach described in [9]. Results are presented\non the VOC 2007 dataset [16, 17].\nThe second variant of the detector is based on the histogram of oriented gradients (HOG) represen-\ntation [14]. HOG subdivides the image into cells, usually of size 8 \u00d7 8 pixels, and computes for\neach cell a weighed histogram of the gradient orientations. The experiments use the HOG variant\nof [19], which results in a 31-dimensional histogram for each cell. The HOG features are extracted\nat multiple scales, forming a pyramid. An object is described by a rectangular arrangement of HOG\ncells (the aspect ratio of the rectangular grouping is \ufb01xed). The joint feature map, \u03c6, extracts from\nthe HOG representation of the image the rectangular group of HOG cells at a given scale and loca-\ntion [38]. A constant bias term is appended to the resulting feature [38] for all but the ranking cost\nfunctional, as the bias term cancels out in that formulation. Note that the model is analogous to the\nHOG detector of [14], and in particular does not use \ufb02exible parts as in [19]. Results are presented\nfor the INRIA pedestrian data set [14].\n\n5.2 Learning to Rank\n\nIn order to evaluate the effects of optimizing the ranking objective developed in this work, we begin\nby comparing the performance of the objective in Equations (6)-(8) in a fully supervised setting with\nthat of the objective in Equations (1)-(3), which correspond to the optimization proposed in [8].\nIn Figure 1, we show the relative performance of the linear bag of visual words model applied to the\nPASCAL VOC 2007 data set [17]. We \ufb01rst show results for the cat class in which 10% of negative\nimages are included in the training set (Figure 1(a)), and subsequently results for which all negative\nimages are used for training (Figure 1(b)). While the ranking objective can appropriately handle\nvarying amounts of negative training data, the objective in Equation (1) fails, resulting in worse\nperformance as the amount of negative training data increases. These results empirically show the\nshortcomings of the treatment of negative images proposed in [8], but the ranking objective by\ncontrast is robust to large imbalances between positive and negative images. Mean AP increases by\n69% as a result of using the ranking objective when 10% of negative images are included during\ntraining, and mean AP improves by 71% when all negative images are used.\n\n5\n\n\f(a) cat class trained with 10% of available\nnegative images.\n\n(b) cat class trained with 100% of avail-\nable negative images.\n\nFigure 1: Precision-recall curves for the structured output ranking objective proposed in this paper\n(blue) vs. the structured output objective proposed in [8] (red) for varying amounts of negative\ntraining data. Results are shown on the cat class from the PASCAL VOC 2007 data set for 10%\nof negative images (1(a)) and for 100% of negatives (1(b)). In all cases a linear bag of visual words\nmodel was employed (see text for details). The structured output objective proposed in [8] performs\nworse with increasing amounts of negative training data, and the algorithm completely fails in 1(b).\nThe ranking objective, on the other hand, does not suffer from this shortcoming (blue curves).\n\nFigure 2.(a) analyzes the performance of the HOG pedestrian detection on the INRIA data set.\nThree cost functionals are compared: a simple binary SVM, the structural SVM model of (1), and\nthe ranking SVM model of (6). The INRIA dataset contains 1218 negative images (i.e. images not\ncontaining people). Each image is subdivided (in scale and space) into twenty sub-images and a\nmaximally violating window (object location) is extracted from each of those. This results in 24360\nnegative windows. The dataset contains also 612 positive images, for a total of 1237 labeled pedes-\ntrians. Thus there are about twenty times more negative examples than positive ones. Reweighted\nversions of the binary and structural SVM models that balance the number of positive and negative\nexamples are also tested. As the \ufb01gure shows, balancing the data in the cost functional is important,\nespecially for the binary SVM model; the ranking model is slightly superior to the other formula-\ntions, with average precision of 77%, and does not require an adjustment to the loss to account for\na given level of data imbalance. By comparison, the state-of-the-art detector of [32] has average\nprecision 78%. We conjecture that this small difference in performance is due to their use of color\ninformation.\n\n5.3 Learning with Weak Annotations\n\nTo evaluate the objective in the case of weak supervision, we have additionally performed experi-\nments in which we have varied the percentage of bounding box annotations provided to the learning\nalgorithm.\nFigure 3 contrasts the performance on the VOC dataset of our proposed discriminative latent vari-\nable algorithm with that of a fully supervised algorithm in which weakly annotated training data are\nignored. We have run the algorithm for 10% of images having full bounding box annotations (with\nthe other 90% weakly labeled) and for 50% of images having complete annotation. In the fully su-\npervised case, we ignore all images that do not have full bounding box annotation and train the fully\nsupervised ranking objective developed in Section 3. In all cases, the latent variable model performs\nconvincingly better than subsampling. For 10% of images fully annotated, mean AP increases by\n64%, and with 50% of images fully annotated, mean AP increases by 83%.\nFigure 2.(b) reports the performance of the latent variable ranking model (8) for the HOG-based de-\ntector on the INRIA pedestrian dataset. Only one positive image is fully labeled with the pedestrian\nbounding boxes while the remaining positive images are weakly labeled. Since most positive images\ncontain multiple pedestrians, the weak annotations carry a minimal amount of information that is\nstill suf\ufb01cient to distinguish the different pedestrian instances. Speci\ufb01cally, the bounding boxes are\ndiscarded and only their centers are kept. Estimating the latent variables consists of a search over\n\n6\n\n00.050.10.150.20.250.300.050.10.150.20.250.30.350.4precisionrecall Ranking objectiveStandard objective00.050.10.150.20.250.300.050.10.150.20.250.30.350.4precisionrecall Ranking objectiveStandard objective\f(a)\n\n(b)\n\nFigure 2: (a) Precision-recall curves for different formulations: binary and structural SVMs, bal-\nanced binary and structural SVMs, ranking SVM. The unbalanced SVMs, and in particular the\nbinary one, do not work well due to the large number of negative examples compared to the positive\nones. The ranking formulation is slightly better than the other balanced costs for this dataset. (b)\nPrecision-recall curves for increasing amounts of weakly supervised data for the ranking formula-\ntion. For all curves, only one image is fully labeled with bounding boxes around pedestrians, while\nthe other images are labeled only by the pedestrian centers. The \ufb01rst curve (AP 32%) corresponds\nto the case in which only the fully supervised image is used; the last curve (AP 75%) to the case in\nwhich all the other training images are added with weak annotations. The performance is almost as\ngood as the fully supervised case (AP 77%) of (a).\n\n(a) cat class trained with 10% of bounding\nboxes.\n\n(b) cat class trained with 50% of bounding\nboxes.\n\nFigure 3: Precision-recall curves for the structured output ranking objective proposed in this paper\ntrained with a linear bag of words image representation and weak supervision (blue) vs. only using\nfully labeled samples (red). Results are shown for 10% of bounding boxes (left) and for 50% of\nbounding boxes (right), the remainder of the images were provided with weak annotation indicating\nthe presence or absence of an object in the image, but not the object location. In both cases, the\nlatent variable model (blue) results in performance that is substantially better than discarding weakly\nannotated images and using a fully supervised setting (red).\n\nall object locations and scales for which the corresponding bounding box center is within a given\nbound of the labeled center (the bound is set to 25% of the length of the box diagonal). In other\nwords, a weak annotation contains only approximate location information. This gives robustness to\ninaccuracies in manually labeling the centers. The \ufb01gure shows how the model performs when, in\naddition to the singly fully annotated image, an increasing number of weakly annotated images are\nadded. Starting from 32% AP, the method improves up to 75% AP, which is remarkably similar to\nthe best result (77% AP) obtained with full supervision.\n\n7\n\n00.20.40.60.8100.20.40.60.81recallprecision 73.97% (structural)59.51% (binaray)76.23% (structural bal.)75.85% (binary bal.)77.33% (rank)00.20.40.60.8100.20.40.60.81recallprecision 31.89% (no weak)50.83% (50 weak)54.30% (100 weak)59.68% (200 weak)66.10% (500 weak)75.35% (all weak)00.050.10.150.20.250.30.350.40.4500.10.20.30.40.50.60.70.80.91precisionrecall Weak supervisionSubsampling00.050.10.150.20.250.30.350.40.4500.050.10.150.20.250.30.350.40.45precisionrecall Weak supervisionSubsampling\f6 Discussion\n\nWe can draw several conclusions from the results in Section 5. First, using the learning formulation\ndeveloped in [8], negative images are not handled properly, resulting in the undesired behavior\nthat additional negative images in the training data decrease performance. The special case of the\nobjective in Equations (1)-(3), for which no negative training data are incorporated, can be viewed\nroughly as an estimate of the log probability of an object being present at a location conditioned on\nthat an object is present in the image. While this results in reasonable performance in terms of recall\n(c.f. [8]), it does not result in a good average precision (AP) score. In fact, the results presented in [8]\nwere computed by training the objective function only on positive images, and then using a separate\nnon-linear ranking function based on global image statistics. Using only positively labeled images\nin the objective presented in Section 2 only incorporates a subset of the constraints in Equation (7)\ncorresponding to i = j. Incorporating all these constraints directly optimizes ranking, enabling the\nuse of all available negative training data to improve localization performance.\nReweighting the loss corresponding to positive and negative examples resulted in similar perfor-\nmance to the ranking objective on the INRIA pedestrian data set, but requires a search across an\nadditional parameter. From the perspective of regularized risk, subsampling negative images can be\nviewed as a noisy version of this reweighting, and experiments on PASCAL VOC using the objec-\ntive in (1) showed poor performance over a wide range of sampling rates. The ranking objective\nby contrast weights loss from the negative examples appropriately (Algorithm 1) according to their\ncontribution to the loss for the precision-recall curve. This is a much more principled and robust\ncriterion for setting the loss function.\nBy using the ranking objective to treat negative images, learning with weak annotations was made\ndirectly applicable using a discriminative latent variable model. Results showed consistent improve-\nment across different proportions of weakly and fully supervised data. Our formulation handled\ndifferent ratios of weakly annotated and fully annotated training data without additional parameter\ntuning in the loss function. The discriminative latent variable approach has been able to achieve\nperformance within a few percent of that achieved by a fully supervised system using only one fully\nsupervised label. The weak labels used for the remaining data are signi\ufb01cantly less expensive to\nsupply [39]. That this is consistent across the data sets reported here indicates that discriminative\nlatent variable models are a promising strategy for treating weak annotation in general.\n\nAcknowledgments\n\nThe \ufb01rst author is supported by the Royal Academy of Engineering through a Newton International\nFellowship. The research leading to these results has received funding from the European Research\nCouncil under the European Community\u2019s Seventh Framework Programme (FP7/2007- 2013) / ERC\ngrant agreement no. 228180, and from the PASCAL2 network of excellence.\n\nReferences\n[1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In Proceedings of the IEEE Conference on\n\nComputer Vision and Pattern Recognition, June 2010.\n\n[2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning.\n\nIn Advances in Neural Information Processing Systems, pages 561\u2013568. MIT Press, 2003.\n\n[3] G. H. Bak\u0131r, T. Hofmann, B. Sch\u00a8olkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting\n\nStructured Data. MIT Press, 2007.\n\n[4] A. Bar Hillel, T. Hertz, and D. Weinshall. Ef\ufb01cient learning of relational object class models. In Proceed-\n\nings of the International Conference on Computer Vision, pages 1762\u20131769, 2005.\n\n[5] T. Berg, A. Berg, J. Edwards, M. Mair, R. White, Y. Teh, E. Learned-Miller, and D. Forsyth. Names and\nFaces in the News. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\nWashington, DC, 2004.\n\n[6] C. Bergeron, J. Zaretzki, C. Breneman, and K. P. Bennett. Multiple instance ranking. In Proceedings of\n\nthe International Conference on Machine Learning, pages 48\u201355, 2008.\n\n[7] M. B. Blaschko and C. H. Lampert. Correlational spectral clustering. In Proceedings of the IEEE Con-\n\nference on Computer Vision and Pattern Recognition, 2008.\n\n[8] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In\n\nProceedings of the European Conference on Computer Vision, 2008.\n\n[9] M. B. Blaschko and C. H. Lampert. Object localization with global and local context kernels. In Pro-\n\nceedings of the British Machine Vision Conference, 2009.\n\n[10] P. Carbonetto, G. Dork\u00b4o, C. Schmid, H. K\u00a8uck, and N. Freitas. Learning to recognize objects with little\n\nsupervision. International Journal of Computer Vision, 77(1\u20133):219\u2013237, 2008.\n\n8\n\n\f[11] O. Chapelle and S. S. Keerthi. Ef\ufb01cient algorithms for ranking with svms. Information Retrieval, 2009.\n[12] O. Chum and A. Zisserman. An exemplar model for learning object classes. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, 2007.\n\n[13] T. Cour, B. Sapp, C. Jordan, and B. Taskar. Learning from ambiguously labeled images. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, 2009.\n\n[14] N. Dalal and B. Triggs. Histogram of Oriented Gradients for Human Detection. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 886\u2013893, 2005.\n\n[15] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In Proceedings\n\nof the European Conference on Computer Vision, 2010.\n\n[16] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nJ. Winn,\n\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.\nnetwork.org/challenges/VOC/voc2007/workshop/index.html, 2007.\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\n[17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. International Journal of Computer Vision, 88(2):303\u2013338, June 2010.\n\n[18] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1778\u20131785, 2009.\n\n[19] P. Felzenszwalb, D. Mcallester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.\n\n[20] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from Google\u2019s image\n\nsearch. In Proceedings of the International Conference on Computer Vision, 2005.\n\n[21] T. Joachims. Optimizing search engines using clickthrough data. In KDD \u201902: Proceedings of the eighth\nACM SIGKDD international conference on Knowledge discovery and data mining, pages 133\u2013142, New\nYork, NY, USA, 2002. ACM.\n\n[22] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Machine Learning,\n\n77(1):27\u201359, 2009.\n\n[23] G. Kim and A. Torralba. Unsupervised detection of regions of interest using iterative link analysis. In\nY. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural\nInformation Processing Systems, pages 961\u2013969. 2009.\n\n[24] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by ef\ufb01-\ncient subwindow search. Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-\ntion, 2008.\n\n[25] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Ef\ufb01cient subwindow search: A branch and bound\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\nframework for object localization.\n2009.\n[26] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class\nattribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 951\u2013958, 2009.\n\n[27] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit\n\nshape model. In Workshop on Statistical Learning in Computer Vision, ECCV, May 2004.\n\n[28] F. Moosmann, D. Larlus, and F. Jurie. Learning saliency maps for object categorization.\nInternational Workshop on The Representation and Use of Prior Knowledge in Vision, 2006.\n\n[29] M. H. Nguyen, L. Torresani, F. De la Torre Frade, and C. Rother. Weakly supervised discriminative\nlocalization and classi\ufb01cation: A joint learning process. In Proceedings of the International Conference\non Computer Vision, 2009.\n\n[30] A. Opelt, A. Fussenegger, A. Pinz, and P. Auer. Weak hypotheses and boosting for generic object detection\nand recognition. In Proceedings of the 8th European Conference on Computer Vision, Prague, Czech\nRepublic, volume 2, pages 71\u201384, 2004.\n\n[31] A. Opelt and A. Pinz. Object localization with boosting and weak supervision for generic object recogni-\n\nIn ECCV\n\ntion. In Scandinavian Conference on Image Analysis, pages 862\u2013871, 2005.\n\n[32] P. Ott and M. Everingham. Implicit color segmentation features for pedestrian and object detection. In\n\nProceedings of the International Conference on Computer Vision, 2009.\n\n[33] C. Pantofaru and M. Hebert. A framework for learning to recognize and segment object classes using\n\nweakly supervised training data. In Proceedings of the British Machine Vision Conference, 2007.\n\n[34] N. Rasiwasia and N. Vasconcelos. Scene classi\ufb01cation with low-dimensional semantic spaces and weak\nsupervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.\nIn S. Thrun, L. Saul, and\n\n[35] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks.\n\nB. Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems. 2004.\n\n[36] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for inter-\nIn Proceedings of the International Conference on Machine\n\ndependent and structured output spaces.\nLearning, 2004.\n\n[37] T. Tuytelaars, C. H. Lampert, M. B. Blaschko, and W. Buntine. Unsupervised object discovery: A com-\n\nparison. International Journal of Computer Vision, 88(2):61\u201385, 2010.\n\n[38] A. Vedaldi and A. Zisserman. Structured output regression for detection with partial truncation.\n\nIn\n\nAdvances in Neural Information Processing Systems, 2009.\n\n[39] S. Vijayanarasimhan and K. Grauman. Multi-level active prediction of useful image annotations for recog-\nnition. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems, pages 1705\u20131712. 2009.\n\n[40] C.-N. J. Yu and T. Joachims. Learning structural svms with latent variables.\n\nIn Proceedings of the\n\nInternational Conference on Machine Learning, 2009.\n\n[41] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average preci-\n\nsion. In Special Interest Group on Information Retrieval, 2007.\n\n9\n\n\f", "award": [], "sourceid": 331, "authors": [{"given_name": "Matthew", "family_name": "Blaschko", "institution": null}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": null}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": null}]}