{"title": "Learning From Weakly Supervised Data by The Expectation Loss SVM (e-SVM) algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1125, "page_last": 1133, "abstract": "In many situations we have some measurement of confidence on ``positiveness for a binary label. The ``positiveness\" is a continuous value whose range is a bounded interval. It quantifies the affiliation of each training data to the positive class. We propose a novel learning algorithm called \\emph{expectation loss SVM} (e-SVM) that is devoted to the problems where only the ``positiveness\" instead of a binary label of each training sample is available. Our e-SVM algorithm can also be readily extended to learn segment classifiers under weak supervision where the exact positiveness value of each training example is unobserved. In experiments, we show that the e-SVM algorithm can effectively address the segment proposal classification task under both strong supervision (e.g. the pixel-level annotations are available) and the weak supervision (e.g. only bounding-box annotations are available), and outperforms the alternative approaches. Besides, we further validate this method on two major tasks of computer vision: semantic segmentation and object detection. Our method achieves the state-of-the-art object detection performance on PASCAL VOC 2007 dataset.\"", "full_text": "Learning From Weakly Supervised Data by The\n\nExpectation Loss SVM (e-SVM) Algorithm\n\nJun Zhu, Junhua Mao, Alan Yuille\n\nDepartment of Statistics\n\nUniversity of California, Los Angeles\n\n{jzh@,mjhustc@,yuille@stat.}ucla.edu\n\nAbstract\n\nIn many situations we have some measurement of con\ufb01dence on \u201cpositiveness\u201d\nfor a binary label. The \u201cpositiveness\u201d is a continuous value whose range is a\nbounded interval. It quanti\ufb01es the af\ufb01liation of each training data to the positive\nclass. We propose a novel learning algorithm called expectation loss SVM (e-\nSVM) that is devoted to the problems where only the \u201cpositiveness\u201d instead of a\nbinary label of each training sample is available. Our e-SVM algorithm can also\nbe readily extended to learn segment classi\ufb01ers under weak supervision where the\nexact positiveness value of each training example is unobserved. In experiments,\nwe show that the e-SVM algorithm can effectively address the segment proposal\nclassi\ufb01cation task under both strong supervision (e.g. the pixel-level annotations\nare available) and the weak supervision (e.g. only bounding-box annotations are\navailable), and outperforms the alternative approaches. Besides, we further vali-\ndate this method on two major tasks of computer vision: semantic segmentation\nand object detection. Our method achieves the state-of-the-art object detection\nperformance on PASCAL VOC 2007 dataset.\n\n1\n\nIntroduction\n\nRecent work in computer vision relies heavily on manually labeled datasets to achieve satisfactory\nperformance. However, the detailed hand-labelling of datasets is expensive and impractical for large\ndatasets such as ImageNet [6]. It is better to have learning algorithms that can work with data that\nhas only been weakly labelled, for example by putting a bounding box around an object instead of\nsegmenting it or parsing it into parts.\nIn this paper we present a learning algorithm called expectation loss SVM (e-SVM). It requires a\nmethod that can generate a set of proposals for the true class label (e.g., the exact silhouette of the\nobject). But this set of proposals may be very large, each proposal tends to be only partially correct\n(the correctness can be quanti\ufb01ed by a continues value between 0 and 1 called \u201cpositiveness\u201d), and\nseveral proposals may be required to obtain the correct label. In the training stage, our algorithm\ncan deal with the strong supervised case where the positiveness of the proposals is observed, and\ncan easily extend to the weakly supervised case by treating the positiveness as latent variable. In the\ntesting stage, it predicts the class label for each proposal and provides a con\ufb01dence score.\nThere are some alternative approaches for this problem, such as support vector classi\ufb01cation (SVC),\nsupport vector regression (SVR), and logistic regression (LR). For the SVC algorithm, because this\nis not a standard binary classi\ufb01cation problem, one might need to binarize the positiveness using\nad-hoc heuristics to determine a threshold, which may degrade its performance [19]. To address\nthis problem, previous works usually use SVR [4, 19] to train the class con\ufb01dence scoring model in\nsemantic segmentation task. We compare our e-SVM to these three related methods in the segment\nproposal\u2019s class con\ufb01dence prediction problem. The positiveness of each segment proposal is set\nas the intersection over union (IoU) overlap rate between this proposal and the pixel-level instance\n\n1\n\n\fFigure 1: Illustration of the framework on class con\ufb01dence prediction of segment proposals. In training, our\ne-SVM algorithm can handle two different annotation types: pixel-level (strong supervision) and bounding-box\n(weak supervision) annotations. For pixel-level annotation, we set the positiveness of segment proposals as the\nIoU overlap rate w.r.t. ground truth and train scoring models with basic e-SVM. For bounding-box annotation,\nwe treat the positiveness as latent variable and use the latent e-SVM version for training scoring models. In\ntesting stage, the learned scoring model can predict con\ufb01dence scores of segment proposals for each object\nclass. (Best viewed in color)\n\nground truth. We test our algorithm under two different data annotation scenarios: the pixel-level\nannotation (positiveness is observed) and the bounding-box annotation (positiveness is unobserved).\nThe experimental results show that our e-SVM outperforms SVC, SVR, and LR in both scenarios.\nFigure 1 illustrates the framework on class con\ufb01dence prediction of segment proposals.\nWe further validate our approach on two fundamental computer vision tasks: (i) semantic segmenta-\ntion, and (ii) object detection. Firstly, we consider semantic segmentation. There has recently been\nimpressive progress at this task using rich appearance cues. Segments are extracted from images\n[1, 3, 4, 12], appearance features are computed for each segment [5, 22, 26], and classi\ufb01ers are\ntrained using groundtruth pixel labeling [19]. Methods of this type are almost always among the\nwinners of the PASCAL VOC segmentation challenge [5]. But all these methods rely on datasets\nwhich have been hand-labelled at the pixel level. For this application we generate the segment pro-\nposals using CPMC segments [4]. The positiveness of each proposal is set as the IoU overlap rate.\nThe class con\ufb01dence scoring models learnt by our e-SVM, using either the pixel-level or bounding-\nbox annotation, can obtain comparable semantic segmentation accuracy w.r.t.\nthe state-of-the-art\nlearning algorithm used in semantic segmentation literature.\nSecondly, we address object detection by exploiting the effectiveness of segments\u2019 appearance cues\nand coupling them to existing object detection systems. For this application, the data is only weakly\nlabeled because the ground-truth annotation for object detection is typically speci\ufb01ed by bounding\nboxes (e.g. PASCAL VOC [8, 9] and ImageNet [6]), which means that the pixel-level ground truth\nis not available. We also use the CPMC method to produce segment proposals. The IoU w.r.t. object\ninstance bounding boxes is used to represent the positiveness of the proposals. We test our approach\non the PASCAL VOC dataset using, as the base detector, the regions with CNN (RCNN) features\n[14] (currently the state of the art on PASCAL VOC and outperforms previous works by a large\nmargin). This method \ufb01rst used selective search method [25] to extract candidate bounding boxes.\nFor each candidate bounding box, it extracts features by deep networks [17] learned on Imagenet\ndataset and \ufb01ne-tuned on PASCAL. We couple the segment-based appearance cues to this system by\nsimply concatenating a new segment con\ufb01dence map feature based on the learned e-SVM models\nand the deep learning feature, and then train a linear SVM. We show that this simple approach yields\na performance gain of 1.5 percent on per-class mean average precision (mAP) over the state-of-the-\nart RCNN feature on PASCAL VOC 2007 dataset.\nNote that this approach is general. It can use any segment proposal detectors, any image features,\nand any classi\ufb01ers. When applied to object detection it could use any base detector, and we could\ncouple the appearance cues with the base detector in many different ways (we choose the simplest).\nIn addition, it can handle other problems where only the \u201cpositiveness\u201d instead of binary labels are\navailable in training.\n\n2\n\nlatentlatentlatentAnnotationsSegment ProposalsIoUe-SVMTrain0.790.020......Test Image...Scoring Model3.490.25-2.76Class Confidence...Pixel-wise Ground TruthObject Bounding Box\f2 Related work on weakly supervised learning and weighted SVMs\nWe have introduced some of the most relevant works in semantic segmentation or object detection.\nIn this section, we will brie\ufb02y review related work of weakly supervised learning methods for seg-\nment classi\ufb01cation, and discuss the connection to instance weighted SVM approaches in literature.\nThe problem settings for most previous works generally assumed that they only get a set of accom-\npanying words of an image or a set of image-level labeling, which is different from the problem\nsettings in this paper. The multiple instance learning (MIL) algorithms [7, 2] were adopted to solve\nthese problems [21, 23]. MIL handles the situations where at least one positive instance is presented\nin the positive bag and only the bag labels of training examples are available. Vezhnevets et.al. [27]\nproposed a multi-image model (MIM) to solve this weakly-supervised learning problem. Recently,\nLiu et.al. [20] presented a weakly-supervised dual clustering approach to handle this task.\nOur weakly supervised problem setting is in the middle between these settings and the strong super-\nvision case (i.e. the full pixel-level annotations are available). It is also very important and useful\nbecause bounding-box annotations of large-scale image dataset are already available (e.g. ImageNet\n[6]) while the pixel-level annotations of large datasets are still hard to obtain. This weakly super-\nvised problem cannot be solved by MIL. We cannot assume that at least one \u201ccompletely\u201d positive\ninstance (i.e. a CPMC segment proposal) is present in a positive bag (i.e. a groundtruth object in-\nstance) since most of the proposals contain both foreground and background pixels. We will show\nhow our e-SVM and its latent extension address this problem in the next sections.\nIn machine learning literature, the weighted SVM (WSVM) approaches [24, 28, 18] also use an\ninstance-dependent weight on the cost of each example, and can improve the robustness of model\nestimation [24], alleviate the effect of outliers [28], leverage privileged information [18] or deal with\nunbalanced classi\ufb01cation problems. The difference between our e-SVM and WSVMs mainly lies\nin that it weights class labels instead of data points, which leads to each example contributing both\nto the costs of positive and negative labels. Although the loss function of our e-SVM algorithm\nis different from those of WSVMs, it can be effortlessly solved by any standard SVM solver (e.g.,\nLibLinear [10]) like those used in WSVMs. This is an advantage because it does not require a\nspeci\ufb01c solver for the implementation of our e-SVM.\n3 The expectation loss SVM algorithms\n\nIn this section, we will \ufb01rst describe the basic formulation of our expectation loss SVM algorithm\nin section 3.1 when the positiveness of each segment proposal is observed. Then, in section 3.2, a\nlatent e-SVM is introduced to handle the weak supervision situation where the positiveness of each\nsegment proposal is not observed.\n3.1 The basic e-SVM algorithm\nWe are given a set of training images D. Using some segmentation method (we adopt CPMC [4]\nin this work), we can generate a set of foreground segment proposals {S1, S2, . . . , SN} from these\nimages. For each segment Si, we extract feature xi, xi \u2208 Rd.\nSuppose the pixelwise annotations are available for all the groundtruth instances in D. For each\nobject class, we can calculate the IoU ratio ui (ui \u2208 [0, 1]) between each segment Si and the\ngroundtruth instances labeling, and set the positiveness of Si as ui (although positiveness can be\nsome functions of IoU ratio, for simplicity, we just set it as IoU and use ui to represent the pos-\nitiveness in the following paragraphs). Because many foreground segments overlap partially with\nthe groundtruth instances (i.e. 0 < ui < 1), it is not a standard binary classi\ufb01cation problem for\ntraining. Of course, we can de\ufb01ne a threshold \u03c4b and treat all the segments whose ui \u2265 \u03c4b as positive\nexamples and the segments whose ui < \u03c4b as negative examples. In this way, this problem is trans-\nferred to a Support Vector Classi\ufb01cation (SVC) problem. But it needs some heuristics to determine\n\u03c4b and its performance is only partially satisfactory [19].\nTo address this issue, we proposed our expectation loss SVM model as an extension of the classical\nSVC models. In this model, we treat the label Yi of each segment as an unobserved random variable.\nYi \u2208 {\u22121, +1}. Given xi, we assume that Yi follows a Bernoulli distribution. The probability of\nYi = 1 given xi (i.e.\nthe success probability of the Bernoulli distribution) is denoted as \u00b5i. We\nassume that \u00b5i is a function of the positiveness ui, i.e. \u00b5i = g(ui). In the experiment, we simply set\n\u00b5i = ui.\n\n3\n\n\fSimilar to the traditional linear SVC problem, we adopt a linear function as the prediction function:\nF (xi) = wT xi + b. For simplicity, we denote [w b] as w, [xi 1] as xi and F (xi) = wT xi in the\nremaining part of the paper. The loss function of our e-SVM is the expectation over the random\nvariables Yi:\n\n1\nN\n\nwT w +\n\nL(w) =\u03bbw \u00b7 1\n2\n\nN(cid:88)\nN(cid:88)\nN(cid:88)\n1\nN\ni = max(0, 1 \u2212 wT xi) and l\u2212\n\n=\u03bbw \u00b7 1\n2\n\n=\u03bbw \u00b7 1\n2\n\nwT w +\n\nwT w +\n\n1\nN\n\ni=1\n\ni=1\n\ni=1\n\nEYi[max(0, 1 \u2212 YiwT xi)]\n\n\u00b7 Pr(Yi = +1|xi) + l\u2212\n\ni\n\n\u00b7 Pr(Yi = \u22121|xi)]\n\n(1)\n\n\u00b7 g(ui) + l\u2212\n\ni\n\n\u00b7 [1 \u2212 g(ui)]}\n\n[l+\ni\n\n{l+\n\ni\n\ni = max(0, 1 + wT xi).\n\nwhere l+\nGiven the pixelwise groundtruth annotations, g(ui) is known. From Equation 1, we can see that it\nis equivalent to \u201cweight\u201d each sample with a function of its positiveness. The standard linear SVM\nsolver is used to solve this model with loss function of L(w). In the experiments, we show that\nthe performance of our e-SVM is much better than SVC and slightly better than Support Vector\nRegression (SVR) in the segment classi\ufb01cation task.\n\n3.2 The latent e-SVM algorithm\n\nN(cid:88)\n\nOne of the advantage of our e-SVM model is that we can easily extend it to the situation where\nonly bounding box annotations are available (this type of labeling is of most interest in the paper).\nUnder this weakly supervised setting, we cannot obtain the exact value of the positiveness (i.e., IoU)\nui for each segment. Instead, ui will be treated as a latent variable which will be determined by\nminimizing the following loss function:\n\n1\nN\n\n{l+\n\nL(w, u) = \u03bbw \u00b7 1\n2\n\n\u00b7 g(ui) + l\u2212\n\n\u00b7 [1 \u2212 g(ui)]} + \u03bbR \u00b7 R(u)\n\ni\n\ni\n\ni=1\n\nwT w +\n\n(2)\nwhere u denotes {ui}i=1,...,N . R(u) is a regularization term for u. We can see that the loss function\nin Equation 1 is a special case of that in Equation 2 by setting u as constant and \u03bbR equal to 0.\nWhen u is \ufb01xed, L(w, u) is a standard linear SVM loss, which is convex with respect to w. When\nw is \ufb01xed, L(w, u) is also a convex function if R(u) is a convex function with respect to u. The IoU\nbetween a segment Si and groundtruth bounding boxes, denoted as ubb\ni , can serve as an initialization\nfor ui. We can iteratively \ufb01x u and w, and solve the two convex optimization problems until it\nconverges. The pseudo-code for the optimization algorithm is shown in Algorithm 1.\nIf we do not add any regularization term on u (i.e. set \u03bbR = 0), u will become 0 or 1 in the\noptimization step in line 4 of algorithm 1 because the loss function becomes a linear function with\nrespect to u when w is \ufb01xed. It turns to be similar to a latent SVM and can lead the algorithm to\nstuck in the local minimal as shown in the experiments. The regularization term will prevent this\nsituation under the assumption that the true value of u should be around ubb.\nThere are a lot of different designs of the regularization term R(u). In practice, we use the following\ni and\none based on the cross entropy between two Bernoulli distributions with success probability ubb\n\nAlgorithm 1 The optimization algorithm for training latent e-SVM\nInitialization:\n1: u(cur) \u2190 ubb;\nProcess:\n2: repeat\n3: w(new) \u2190 arg minw L(w, u(cur));\nu(new) \u2190 arg minu L(w(new), u);\n4:\nu(cur) \u2190 u(new);\n5:\n6: until Converge\n\n4\n\n\fui respectively.\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nR(u) = \u2212 1\nN\n\n= \u2212 1\nN\n\n[ubb\ni\n\n\u00b7 log(ui) + (1 \u2212 ubb\n\ni ) \u00b7 log(1 \u2212 ui)]\n\nDKL[Bern(ubb\n\ni )||Bern(ui)] + C\n\n(3)\n\nwhere C is a constant value with respect to u. DKL(.) represents the KL distance between two\nBernoulli distributions. This regularization term is convex w.r.t. u and achieves its minimal when\nu = ubb. It is a strong regularization term since its value increases very fast when u (cid:54)= ubb.\n\n4 Visual Tasks\n\n4.1 Semantic segmentation\n\nWe can easily apply our e-SVM to the semantic segmentation task with the framework proposed by\nCarreira et al. [5]. Firstly, CPMC segment proposals [4] are generated and the second-order pooling\nfeatures [5] are extracted from each segment. Then we train the class con\ufb01dence scoring models\nusing either e-SVM or latent e-SVM according to whether the pixel-level annotation is available. In\nthe testing stage, the CPMC segments are sorted based on their con\ufb01dence scores. The top ones will\nbe selected to produce the semantic label map.\n\n4.2 Object detection\n\nFor the task of object detection, we can only acquire bounding-box annotation instead of pixel-level\nlabeling. Therefore, it is natural to apply our latent e-SVM in this task to provide segment-based\nappearance cues for object detection.\nIn the state-of-the-art object detection systems [11, 13, 25, 14], the window candidates of foreground\nobject are extracted from images and the con\ufb01dence scores are predicted on them. Window candi-\ndates are extracted either by sliding window approaches (used in e.g.\nthe deformable part-based\nmodel [11, 13]) or most recently, the selective search approach [25] (used in e.g. the RCNN frame-\nwork [14]). It can lower down the number of window candidates compared to traditional sliding\nwindow approaches.\nIt is not easy to directly incorporate con\ufb01dence scores of the segments into these object detection\nsystems based on window candidates. The dif\ufb01culty lies in two aspects: First, only some of the\nsegments are totally inside or totally outside a window candidate.\nIt might be hard to calculate\nthe contribution of the con\ufb01dence score of a segment that only partially overlaps with a window\ncandidate. Second, the window candidates (even the groundtruth bounding boxes) may contain\nsome of the background regions. Some regions (e.g. the regions near the boundary of the window\ncandidates) will have higher probability to be the background region than the regions in the center.\nTreating them equally may harm the accuracy of the whole detection system.\nIn order to solve these issues, we propose a new segment con\ufb01dence map feature (SCMF) for\neach candidate window. Given an image and a set of window candidates, we \ufb01rst calculate the\n\nFigure 2: Illustration of generating the segment con\ufb01dence map feature for window candidates based on\nlearned e-SVM models. The con\ufb01dence scores of the segments are mapped to pixels to generate a pixel-level\ncon\ufb01dence map. We will divide a window candidate into m \u00d7 m spatial bins and pool the con\ufb01dence scores of\nthe pixels in each bin. It leads to a m \u00d7 m dimensional vector for our SCMF.\n\n5\n\nOriginal Image e-SVMmodelsMapping segment confidence to pixelsConfidence MapPooling in spatial binsSCMF(a)(b)(c)\fcon\ufb01dence scores of all the segment proposals in the image using the learned e-SVM models.\nThe con\ufb01dence score for a segment S is denoted as CfdScore(S). For each pixel p, the con\ufb01-\ndence score is set as the maximum con\ufb01dence score of all the segments that contain this pixel:\nCfdScore(p) = max\u2200S,p\u2208S CfdScore(S). In this way, we can handle the dif\ufb01culty of partial over-\nlapping between segments and candidate windows. For the second dif\ufb01culty, we divide each candi-\ndate window into M = m \u00d7 m spatial bins and pool the con\ufb01dence scores of the pixels in each bin.\nBecause the models are trained with the one-vs-rest scheme, our SCMF is class-speci\ufb01c. It leads to\na (M \u00d7 K)-dimensional feature for each candidate window, where K refers to the total number of\nobject classes. After that, we encode it by additive kernels approximation mapping [26] and obtain\nthe \ufb01nal feature representation of candidate windows. The feature generating process is illustrated\nin Figure 2. In the testing stage, we can concatenate this SCMF with the features from other object\ndetection systems.\n\n5 Experiments\n\nIn this section, we \ufb01rst evaluate the performance of e-SVM on the segment proposal\u2019s class con-\n\ufb01dence prediction problem, by using two new evaluation criterions for this task. After that, we\napply our method to two essential tasks in computer vision: semantic segmentation and object de-\ntection. For semantic segmentation task, we test the proposed e-SVM and latent e-SVM on two\ndifferent data annotation scenarios (i.e., with pixel-level groundtruth label annotation and with only\nbounding-box object annotation) respectively. For object detection task, we combine our SCMF\nwith the state-of-the-art object detection system, and show it can obtain non-trivial improvement on\ndetection performance.\n\n5.1 Performance evaluation\n\nWe use PASCAL VOC 2011 [9] segmentation dataset in this experiment. It is a subset of the whole\nPASCAL 2011 dataset with 1112 images in the training set and 1111 images in the validation set,\nand has 20 foreground object classes in total. We use the of\ufb01cial training set and validation set for\ntraining and testing respectively. Similar to [5], we extract 150 CPMC [4] segment proposals for\neach image and compute the second-order pooling features on each segment.\n\n5.1.1 Evaluation criteria\n\nIn literature [5], the supervised learning framework of segment-based prediction model either re-\ngressed the overlapping value or converted it to a binary classi\ufb01cation problem via a threshold value,\nand evaluate the performance by certain task-speci\ufb01c criterion (i.e., the pixel-wise accuracy used for\nsemantic segmentation). In this paper, we adopt a direct performance evaluation criteria for class\ncon\ufb01dence prediction of segment proposals, which is consistent with the learning problem itself and\nnot biased to particular tasks. Unfortunately, we have not found any work on this sort of direct\nperformance evaluation, and thus introduce two new evaluation criteria for this purpose. We brie\ufb02y\ndescribe them as follows:\nMean precision recall volume\nAlthough the ground-truth target value (i.e., the overlap rate of segment and bounding box) is a real\nvalue in the range of [0, 1], we can transform original class con\ufb01dence prediction problem to a series\nof binary classi\ufb01cation problems, each of which corresponds to a threshold value for binarizing the\ngroundtruth overlap rate of segments. After that, we calculate the Precison Recall (PR) Curve for\neach of these binary classi\ufb01cation problems, and it forms a PR surface w.r.t. different threshold\nvalues. Thus, we can compute the volume under this PR surface as in [15], and use the mean PR\nvolume (mPRV) over all classes as a performance metric for the segment-based class con\ufb01dence\nprediction problem.\nNormalized discounted cumulative gain [16]\nConsidering that a higher con\ufb01dence value is expected to be predicted for the segment with higher\noverlap rate, we think this prediction problem can be treated as a ranking problem, and thus we\nuse the normalized discounted cumulative gain (NDCG) [16], which is a common performance\nmeasurement for ranking problem, as another performance evaluation criterion in this paper.\n\n6\n\n\fFigure 3: Comparison on class con\ufb01dence prediction results of e-SVM, SVR, LR and SVCs (using pixel-level\ngroundtruth annotation). (a) mPRV, (b) NDCG. Best viewed in color.\n\nmPRV (%)\nNDCG (%)\n\ne-SVM SVR\n36.8\n35.1\n87.6\n86.5\n\nLR\n33.4\n86.6\n\nSVM (0.0)\n\n25.8\n83.4\n\nSVM (0.2)\n\n34.3\n86.9\n\nSVM (0.4)\n\n35.9\n86.8\n\nSVM (0.6)\n\n33.4\n85.4\n\nSVM (0.8)\n\n27.7\n83.1\n\nTable 1: Results on class con\ufb01dence prediction of segment proposals (using pixel-level groundtruth annota-\ntion). The number in bracket refers to the threshold value of overlap rate for training SVC.\n\nmPRV (%)\nNDCG (%)\n\nLe-SVM SVR\n29.9\n84.7\n\n31.9\n85.8\n\nLR\n29.0\n84.8\n\nSVM (0.0)\n\n24.4\n82.6\n\nSVM (0.2)\n\n30.7\n85.5\n\nSVM (0.4)\n\n30.4\n84.9\n\nSVM (0.6)\n\n24.8\n82.2\n\nSVM (0.8)\n\n14.7\n76.6\n\nTable 2: Results on class con\ufb01dence prediction of segment proposals (using object bounding-box annotation).\nThe number in bracket refers to the threshold value of overlap rate for training SVC. \u201cLe-SVM\u201d refers to the\nlatent e-SVM algorithm.\n\n5.1.2 Experimental results and comparison to other methods\n\nBased on the mPRV and NDCG introduced above, We evaluate the performance of our e-SVM al-\ngorithm on PASCAL VOC 2011 segmentation dataset, and compare it with three classic methods\n(i.e. SVC, SVR and LR) in literature. Note that we test SVCs\u2019 performance with a variety of binary\nclassi\ufb01cation problems, each of which is trained by using different threshold values of overlap rates\n(e.g., 0.0, 0.2, 0.4, 0.6 and 0.8 as shown in \ufb01gure 3) to get positive and negative examples. In \ufb01gure 3\n(a) and (b), we show the results of mPRV and NDCG for e-SVM, SVR, LR and SVCs respectively.\nWe evaluate the performance w.r.t. different values of \u03bbw as shown in \ufb01gure 3. In addition, we\ncompare their results1 trained with pixel-wise ground truth and weakly-labelled bounding-box an-\nnotation (The latent e-SVM is used in this case.) in tables 1 and 2 respectively. Our e-SVM obtains\nconsistently superior mPRV and NDCG than other methods in both of these two annotation types.\n\n5.2 Semantic segmentation results\n\nFor the semantic segmentation task, we test our method with PASCAL VOC 2011 dataset using\ntraining set for training and validation set for testing. Following the framework proposed by [5],\nwe use the sequential pasting inference approach in testing stage. The per-class accuracies w.r.t.\nthe groundtruth pixel-level semantic label map and object bounding-box annotation are 36.8% and\n27.7% respectively, which are comparable to those of the state-of-the-art class con\ufb01dence scoring\nmodel learning algorithm (i.e., SVR) [5] used in semantic segmentation literature.\n\n1We report the best performance w.r.t. different \u03bbw values for each method in tables 1 and 2.\n\n7\n\nmPRVNDCG(%)383634323028262422201816-7-6.5-5.5-4.5-6-5-4-7-6.5-5.5-4.5-6-5-4(a)(b)8887868584838281807978(%)\fRCNN\nOurs\nGain\nRCNN (bb)\nOurs (bb)\nGain\n\nRCNN\nOurs\nGain\nRCNN (bb)\nOurs (bb)\nGain (bb)\n\nplane\n64.1\n63.7\n-0.4\n68.1\n70.4\n2.3\ntable\n45.8\n47.8\n2.0\n54.5\n56.4\n1.9\n\nbike\n69.2\n70.2\n1.0\n72.8\n74.2\n1.4\ndog\n55.8\n57.9\n2.1\n61.2\n62.9\n1.8\n\nbus\n62.8\n63.2\n0.4\n66.3\n67.2\n1.0\n\nbottle\n33.2\n33.4\n0.2\n36.8\n38.0\n1.2\n\nbird\n50.4\n51.9\n1.5\n56.8\n59.1\n2.3\nhorse motor. person plant\n30.9\n61.0\n34.5\n61.2\n0.3\n3.7\n33.4\n69.1\n35.6\n69.3\n0.2\n2.2\n\nboat\n41.2\n42.5\n1.3\n43.0\n44.7\n1.6\n\n66.8\n67.5\n0.8\n68.6\n69.9\n1.4\n\n53.9\n54.9\n1.0\n58.7\n59.6\n0.9\n\ncar\n70.5\n71.3\n0.8\n74.2\n74.6\n0.3\nsheep\n53.3\n55.8\n2.5\n62.9\n64.6\n1.7\n\ncat\n61.8\n62.0\n0.2\n67.6\n69.0\n1.3\nsofa\n49.2\n51.0\n1.8\n51.1\n53.2\n2.1\n\nchair\n32.4\n34.7\n2.3\n34.4\n36.7\n2.3\ntrain\n56.9\n58.4\n1.6\n62.5\n64.3\n1.8\n\ncow\n58.4\n58.7\n0.2\n63.5\n64.3\n0.8\ntv\n64.1\n65.0\n0.9\n64.8\n65.5\n0.7\n\nAverage\n\n54.1\n55.3\n1.2\n58.5\n60.0\n1.5\n\nTable 3: Object detection results on PASCAL VOC 2007 dataset. \u201dbb\u201d means the result after applying bound-\ning box regression. Gain means the improved AP of our system compared to RCNN under the same settings\n(both with bounding box or without). The better results in the comparisons are shown in bold.\n\n5.3 Object detection results\n\nAs mentioned in Section 4.2, another application of our e-SVM is the object detection task. Most\nrecently, Girshick et.al [14] presented a Regions with CNN features method (RCNN) using the Con-\nvolutional Neural Network pre-trained on the ImageNet Dataset [6] and \ufb01ne-tuned on the PASCAL\nVOC datasets. They achieved a signi\ufb01cantly improvement over the previous state-of-the-art algo-\nrithms (e.g. Deformable Part-based Model (DPM) [11])and push the detection performance into a\nvery high level (The mAP is 58.5 with bounding-box regression on PASCAL VOC 2007).\nOne question arises: can we further improve their performance? The answer is yes. In our method,\nwe \ufb01rst learn the latent e-SVM models based on the object bounding-box annotation, and calculate\nthe spatial con\ufb01dence map features as in section 4.2. Then we simply concatenate them with RCNN\nthe features to train object classi\ufb01ers on candidate windows. We use PASCAL VOC 2007 dataset\nin this experiment. As shown in table 3, our method can improve mAP by 1.2 before applying\nbounding boxes regression. For some categories that the original RCNN does not perform well,\nsuch as potted plant, the gain of mAP is up to 3.65. After applying bounding box regression for both\nRCNN and our algorithm, the gain of performance is 1.5 on average.\nIn the experiment, we set m = 5 and adopt average pooling on the pixel level con\ufb01dence scores\nwithin each spatial bin. We also modi\ufb01ed the bounding box regularization method used in [14] by\naugmenting the \ufb01fth layer features with additive kernels approximation methods [26]. It will lead\nto a slightly improved performance. In summary, we achieved an average AP of 60.0, which is 1.5\nhigher than the best known result (the original RCNN with bounding box regression) of this dataset.\n\n6 Conclusion\n\nWe present a novel learning algorithm call e-SVM that can well handle the situation in which the\nlabels of training data are continuous values whose range is a bounded interval. It can be applied to\nsegment proposal\u2019s class con\ufb01dence prediction problem and can be easily extended to learn the class\ncon\ufb01dence scoring models under weak supervision (e.g. only bounding-box annotation is available).\nWe apply this method on two major tasks of computer vision (i.e., semantic segmentation and object\ndetection), and obtain the state-of-the-art object detection performance on PASCAL VOC 2007\ndataset. We believe that, with the ever growing size of datesets, it is increasingly important for\nlearning with weak supervision to reduce the amount of labeling overhead required.\nAcknowledgements. We gratefully acknowledge the funding support from the National Science\nFoundation (NSF) with award CCF-1317376, and from the National Institute of Health NIH Grant\n5R01EY022247-03. We also thank NVIDIA Corporation for providing GPUs in our experiments.\n\n8\n\n\fReferences\n[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. SLIC superpixels compared to\n\nstate-of-the-art superpixel methods. TPAMI, 34(11):2274\u20132282, 2012.\n\n[2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning.\n\nIn Advances in Neural Information Processing Systems 15, pages 561\u2013568. MIT Press, 2003.\n\n[3] P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, and J. Malik. Semantic segmentation using regions and parts.\n\nIn CVPR, 2012.\n\n[4] J. Carreira and C. Sminchisescu. Cpmc: Automatic object segmentation using constrained parametric\n\nmin-cuts. TPAMI, 34(7):1312\u20131328, 2012.\n\n[5] J. a. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order\n\npooling. In ECCV, pages 430\u2013443, 2012.\n\n[6] J. Deng, A. Berg, , J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2010\n\n(VOC2010) Results. http://www.image-net.org/challenges/LSVRC/2012/index.\n\n[7] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P\u00b4erez. Solving the multiple instance problem with axis-\n\nparallel rectangles. Artif. Intell., 89(1-2):31\u201371, Jan. 1997.\nI. Williams,\n\n[8] M. Everingham, L. Van Gool, C. K.\n\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\nI. Williams,\n\n[9] M. Everingham, L. Van Gool, C. K.\n\nJ. Winn,\n\nPASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.\nnetwork.org/challenges/VOC/voc2011/workshop/index.html.\n\nJ. Winn,\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\n[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\n[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-\n\ntively trained part-based models. TPAMI, 32(9):1627\u20131645, 2010.\n\n[12] P. F. Felzenszwalb and D. P. Huttenlocher. Ef\ufb01cient graph-based image segmentation. IJCV, 59(2):167\u2013\n\n181, Sept. 2004.\n\n[13] S. Fidler, R. Mottaghi, A. L. Yuille, and R. Urtasun. Bottom-up segmentation for top-down detection. In\n\nCVPR, pages 3294\u20133301, 2013.\n\n[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[15] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV,\n\n2014.\n\n[16] K. J\u00a8arvelin and J. Kek\u00a8al\u00a8ainen. Cumulated gain-based evaluation of ir techniques. TOIS, 20(4):422\u2013446,\n\n2002.\n\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1106\u20131114, 2012.\n\n[18] M. Lapin, M. Hein, and B. Schiele. Learning using privileged information: Svm+ and weighted svm.\n\nNeural Networks, 53:95\u2013108, 2014.\n\n[19] F. Li, J. Carreira, and C. Sminchisescu. Object recognition as ranking holistic \ufb01gure-ground hypotheses.\n\nIn CVPR, pages 1712\u20131719, 2010.\n\n[20] Y. Liu, J. Liu, Z. Li, J. Tang, and H. Lu. Weakly-supervised dual clustering for image semantic segmenta-\ntion. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2075\u20132082.\nIEEE, 2013.\n\n[21] A. M\u00a8uller and S. Behnke. Multi-instance methods for partially supervised image segmentation. In PSL,\n\npages 110\u2013119, 2012.\n\n[22] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and algorithms. In CVPR, June 2012.\n[23] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmenta-\n\ntion. In CVPR, pages 1\u20138, 2008.\n\n[24] J. Suykens, J. D. Brabanter, L. Lukas, and J. Vandewalle. Weighted least squares support vector machines:\n\nrobustness and sparse approximation. NEUROCOMPUTING, 48:85\u2013105, 2002.\n\n[25] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV,\n\n104(2):154\u2013171, 2013.\n\n[26] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. TPAMI, 34(3):480\u2013492,\n\n2012.\n\n[27] A. Vezhnevets, V. Ferrari, and J. Buhmann. Weakly supervised semantic segmentation with a multi image\n\nmodel. In ICCV, 2011.\n\n[28] X. Yang, Q. Song, and A. Cao. Weighted support vector machine for data classi\ufb01cation. In IJCNN, 2005.\n\n9\n\n\f", "award": [], "sourceid": 661, "authors": [{"given_name": "Jun", "family_name": "Zhu", "institution": "University of California, Los Angeles"}, {"given_name": "Junhua", "family_name": "Mao", "institution": "University of California, Los Angeles"}, {"given_name": "Alan", "family_name": "Yuille", "institution": "UCLA"}]}