{"title": "Object Localization based on Structural SVM using Privileged Information", "book": "Advances in Neural Information Processing Systems", "page_first": 208, "page_last": 216, "abstract": "We propose a structured prediction algorithm for object localization based on Support Vector Machines (SVMs) using privileged information. Privileged information provides useful high-level knowledge for image understanding and facilitates learning a reliable model even with a small number of training examples. In our setting, we assume that such information is available only at training time since it may be difficult to obtain from visual data accurately without human supervision. Our goal is to improve performance by incorporating privileged information into ordinary learning framework and adjusting model parameters for better generalization. We tackle object localization problem based on a novel structural SVM using privileged information, where an alternating loss-augmented inference procedure is employed to handle the term in the objective function corresponding to privileged information. We apply the proposed algorithm to the Caltech-UCSD Birds 200-2011 dataset, and obtain encouraging results suggesting further investigation into the benefit of privileged information in structured prediction.", "full_text": "Object Localization based on Structural SVM\n\nusing Privileged Information\n\nJan Feyereisl, Suha Kwak\u2217, Jeany Son, Bohyung Han\n\nDept. of Computer Science and Engineering, POSTECH, Pohang, Korea\n\nthefillm@gmail.com, {mercury3,jeany,bhhan}@postech.ac.kr\n\nAbstract\n\nWe propose a structured prediction algorithm for object localization based on Sup-\nport Vector Machines (SVMs) using privileged information. Privileged informa-\ntion provides useful high-level knowledge for image understanding and facilitates\nlearning a reliable model even with a small number of training examples. In our\nsetting, we assume that such information is available only at training time since it\nmay be dif\ufb01cult to obtain from visual data accurately without human supervision.\nOur goal is to improve performance by incorporating privileged information into\nordinary learning framework and adjusting model parameters for better general-\nization. We tackle object localization problem based on a novel structural SVM\nusing privileged information, where an alternating loss-augmented inference pro-\ncedure is employed to handle the term in the objective function corresponding to\nprivileged information. We apply the proposed algorithm to the Caltech-UCSD\nBirds 200-2011 dataset, and obtain encouraging results suggesting further inves-\ntigation into the bene\ufb01t of privileged information in structured prediction.\n\n1\n\nIntroduction\n\nObject localization is often formulated as a binary classi\ufb01cation problem, where a learned classi\ufb01er\ndetermines the existence or absence of a target object within a candidate window of every location,\nsize, and aspect ratio. Recently, a structured prediction technique using Support Vector Machine\n(SVM) has been applied to this problem [1], where the optimal bounding box containing target ob-\nject is obtained by a trained classi\ufb01er. This approach provides a uni\ufb01ed framework for detection and\npost-processing (non-maximum suppression), and handles issues related to the object with variable\naspect ratios naturally. However, object localization is an inherently dif\ufb01cult task due to the large\namount of variations in objects and scenes, e.g., shape deformations, color variations, pose changes,\nocclusion, view point changes, background clutter, etc. This issue is aggravated when the size of\ntraining dataset is small.\nMore reliable model can be learned even with fewer training examples if additional high-level\nknowledge about an object of interest is available during training. Such high-level knowledge is\ncalled privileged information, which typically describes useful semantic properties of an object such\nas parts, attributes, and segmentations. This idea corresponds to the Learning Using Privileged In-\nformation (LUPI) paradigm [3], which exploits the additional information to improve predictive\nmodels in training but does not require the information for prediction. The LUPI framework has\nbeen incorporated into SVM in the form of the SVM+ algorithm [4]. However, the applications of\nSVM+ are often limited to binary classi\ufb01cation problems [3, 4].\nWe propose a novel Structural SVM using privileged information (SSVM+) framework, shown in\nFigure 1, and apply the algorithm to the problem of object localization. In this formulation, priv-\nileged information, e.g., parts, attributes and segmentations, are incorporated to learn a structured\n\n\u2217Current af\ufb01liation: INRIA\u2013WILLOW Project, Paris, France; e-mail: suha.kwak@inria.fr\n\n1\n\n\fFigure 1: Overview of our object localization framework using privileged information. Unlike\nvisual observations, privileged information is available only during training. We use attributes and\nsegmentation masks of an object as privileged information to improve generalization of trained\nmodel. To incorporate privileged information during training, we propose an extension of SSVM,\ncalled SSVM+, whose loss-augmented inference is performed by alternating Ef\ufb01cient Subwindow\nSearch (ESS) [2].\n\nprediction function for object localization. Note that high-level information is available only for\ntraining but not testing in this framework. Our algorithm employs an ef\ufb01cient branch-and-bound\nloss-augmented subwindow search procedure to perform the inference by a joint optimization in\noriginal and privileged spaces during training. Since the additional information is not used in test-\ning, the inference in testing phase is the same as the standard Structural SVM (SSVM) case. We\nevaluate our method by learning to localize birds in the Caltech-UCSD Birds 200-2011 (CUB-2011)\ndataset [5] and exploiting attributes and segmentation masks as privileged information in addition to\nstandard visual features. The main contributions of our work are as follows:\n\n\u2022 We introduce a novel framework for object localization exploiting privileged information\n\nthat is not required or needed to be inferred at test time.\n\n\u2022 We formulate an SSVM+ framework, where an alternating loss-augmented inference pro-\ncedure for ef\ufb01cient subwindow search is incorporated to handle the privileged information\ntogether with the conventional visual features.\n\n\u2022 Performance gains in localization and classi\ufb01cation are achieved, especially with small\n\ntraining datasets.\n\nMethods that exploit additional information have been discussed to improve models for image clas-\nsi\ufb01cation or search in the context of transfer learning [6, 7], learning with side information [8, 9, 10]\nand domain adaptation [11], where underlying techniques rely on pair-wise constraints [8], multiple\nkernels [9] or metric learning [9]. Zero-shot learning is an extreme framework, where the models\nfor unseen classes are constructed even without training data [12, 13]. Recent works often rely on\nnatural language processing techniques to handle pure textual description [14, 15].\nStandard learning algorithms require many data to construct a robust model while zero-shot learning\ndoes not need any training examples. LUPI framework is in the middle of traditional data-driven\nlearning and zero-shot learning since it aims to learn a good model with a small number of training\ndata by taking advantage of privileged information available at training time. Privileged information\nhas been considered in face recognition [16], facial feature detection [17], and event recognition\n[18], but such works are still uncommon. Our work applies the LUPI framework to an object local-\nization problem based on SSVM. The use of SSVMs for object localization is originally investigated\nby [1]. More recently, [19, 20] employ SSVM as part of their localization procedure, however none\nof them incorporate privileged information or similar idea. Recently, [21] presented the potential\nbene\ufb01t of SVM+ in object recognition task.\nThe rest of this paper is organized as follows. We \ufb01rst review the LUPI framework and SSVM\nin Section 2, and our SSVM+ formulation for object localization is presented in Section 3. The\nperformance of our object localization algorithm is evaluated in Section 4.\n\n2\n\nGroundtruth (yi) xi* : Segmentation/Part/Attributes xi : Image SSVM+ Learning Model Prediction Output (y) x : Image Training example Testing example argmax\ud835\udc66\u2217 \u2206 (\ud835\udc66\ud835\udc56,\ud835\udc66\ud835\udc50,\ud835\udc66\u2217)+\ud835\udc64\u2217,\u03a8\u2217\ud835\udc66\u2217 argmax\ud835\udc66 \u2206 (\ud835\udc66\ud835\udc56,\ud835\udc66,\ud835\udc66\ud835\udc50\u2217)+\ud835\udc64,\u03a8\ud835\udc66 Privileged space Visual space y* \ud835\udc66\ud835\udc50\u2217 \ud835\udc66\ud835\udc50 Loss-Augmented Inference by ESS Localization Alternating optimization \u2026 \u2026 \u2026 \u2026 \u2026 \u2026 \u2026 Keypoints Visual descriptors (SURF) Vocabulary \u2026 Histogram Bag-of-Words Features y \f2 Background\n\n2.1 Learning Using Privileged Information\n\n1, y1), . . . , (xn, x\u2217\n\nThe LUPI paradigm [3, 4, 22, 23] is a framework for incorporating additional information during\ntraining that is not available at test time. The inclusion of such information is exploited to \ufb01nd\na better model, which yields lower generalization error. Contrary to classical supervised learn-\ning, where pairs of data are provided (x1, y1), . . . , (xn, yn), xi \u2208 X , yi \u2208 {\u22121, 1}, in the LUPI\nparadigm additional information x\u2217 \u2208 X \u2217 is provided with each training example as well, i.e.,\ni \u2208 X \u2217, yi \u2208 {\u22121, 1}. This information is, however, not\n(x1, x\u2217\nrequired during testing. In both learning paradigms, the task is then to \ufb01nd among a collection of\nfunctions the one that best approximates the underlying decision function from the given data.\nSpeci\ufb01cally, we formulate object localization within a LUPI framework as learning a pair of func-\ntions h : X (cid:55)\u2192 Y and \u03c6 : X \u2217 (cid:55)\u2192 Y jointly, where only h is used for prediction. These functions, for\nexample, map the space of images and attributes to the space of bounding box coordinates Y. The\ndecision function h and the correcting function \u03c6 depend on each other by the following relation,\n\nn, yn), xi \u2208 X , x\u2217\n\n\u2200 1 \u2264 i \u2264 n,\n\n(cid:96)X (h(xi), yi) \u2264 (cid:96)X \u2217 (\u03c6(x\u2217\n\ni ), yi),\n\n(1)\nwhere (cid:96)X and (cid:96)X \u2217 denote the empirical loss functions on the visual (X ) and the privileged space\n(X \u2217), respectively. This inequality is inspired by the LUPI paradigm [3, 4, 22, 23], where for all\ntraining examples the model h is always corrected to have a smaller loss on data than the model \u03c6 on\nprivileged information. The constraint in Eq. (1) is meaningful when we assume that, for the same\nnumber of training examples, the combination of visual and privileged information provides a space\nto learn a better model than visual information alone.\nTo translate this general learning idea into practice, the SVM+ algorithm for binary classi\ufb01cation\nhas been developed [3, 4, 22]. The SVM+ algorithm replaces the slack variable \u03be in the standard\nSVM formulation by a correcting function \u03be = ((cid:104)w\u2217, x\u2217(cid:105) + b\u2217), which estimates its values from the\nprivileged information. This results in the following formulation,\n\n1\n2\n\nmin\n\n(cid:107)w(cid:107)2\n\n(cid:107)w\u2217(cid:107)2\n\nw,w\u2217,b,b\u2217\n\n\u03b3\n2\n(cid:123)(cid:122)\n(cid:125)\ni (cid:105) + b\u2217)\nyi((cid:104)w, xi(cid:105) + b) \u22651 \u2212 ((cid:104)w\u2217, x\u2217\n,\n\n2 +\n\n(cid:124)\n\n2 +\n\n\u03bei\n\ns.t.\n\nn(cid:88)\n\nC\nn\n((cid:104)w\u2217, x\u2217\n\ni=1\n\n(cid:124)\n\n((cid:104)w\u2217, x\u2217\n\n(cid:123)(cid:122)\n(cid:125)\ni (cid:105) + b\u2217)\n(cid:123)(cid:122)\n(cid:125)\ni (cid:105) + b\u2217)\n\n\u2265 0,\n\n\u03bei\n\n(cid:124)\n\n,\n\n\u03bei\n\n(2)\n\n\u2200 1 \u2264 i \u2264 n,\n\nwhere the terms w\u2217, x\u2217 and b\u2217 play the same role as w, x and b in the classical SVM, however\nwithin the new correcting space X \u2217. Furthermore, \u03b3 denotes a regularization parameter for w\u2217. It is\nimportant to observe that the weight vector w depends not only on x but also on x\u2217. For this reason\nthe function that replaces the slack \u03be is called the correcting function. As privileged information\nis only used to estimate the values of the slacks, it is required only during training but not during\ntesting. Theoretical analysis [4] shows that the bound on the convergence rate of the above SVM+\nalgorithm could substantially improve upon standard SVM if suitable privileged information is used.\n\n2.2 Structural SVM (SSVM)\nSSVMs discriminatively learn a weight vector w for a scoring function f : X \u00d7Y (cid:55)\u2192 R over the set\nof training input/output pairs. Once learned, the prediction function h is obtained by maximizing f\nover all possible y \u2208 Y as follows:\n\n\u02c6y = h(x) = arg max\n\n(3)\nwhere \u03a8 : X \u00d7 Y \u2192 Rd is the joint feature map that models the relationship between input x and\nstructured output y. To learn the weight vector w, the following optimization problem (margin-\nrescaling) then needs to be solved:\n\nf (x, y) = arg max\n\ny\u2208Y\n\ny\u2208Y\n\n(cid:104)w, \u03a8(x, y)(cid:105),\n\nn(cid:88)\n(cid:104)w, \u03b4\u03a8i(y)(cid:105) \u2265 \u2206(yi, y) \u2212 \u03bei\n\n(cid:107)w(cid:107)2 +\n\nmin\nw,\u03be\n\nC\nn\n\n1\n2\n\ni=1\n\ns.t.\n\n3\n\n\u03bei,\n1 \u2264 i \u2264 n, \u2200y \u2208 Y,\n\n(4)\n\n\fwhere \u03b4\u03a8i(y) \u2261 \u03a8(xi, yi)\u2212\u03a8(xi, y), and \u2206(yi, y) is a task-speci\ufb01c loss that measures the quality\nof the prediction y with respect to the ground-truth yi. To obtain a prediction, we need to maximize\nEq. (3) over the response variable y for a given input x. SSVMs are a general method for solving a\nvariety of prediction tasks. For each application, the joint feature map \u03a8, the loss function \u2206 and an\nef\ufb01cient loss-augmented inference technique need to be customized.\n\n3 Object Localization with Privileged Information\n\nWe deal with object localization with privileged information: given a set of training images of\nobjects, their locations and their attribute and segmentation information, we want to learn a function\nto localize objects of interest in yet unseen images. Unlike existing methods, our learned function\ndoes not need explicit or even inferred attribute and segmentation information during prediction.\n\n3.1 Structural SVM with Privileged Information (SSVM+)\n\n1, y1), . . . , (xn, x\u2217\n\nWe extend the above structured prediction problem to exploit privileged information. Recollecting\nEq. (1), to learn the pair of interdependent functions h and \u03c6, we learn to predict a structure y based\ni \u2208 X \u2217, yi \u2208 Y, where X\non a training set of triplets, (x1, x\u2217\ncorresponds to various visual features, X \u2217 to attributes or segmentations, and Y is the space of all\npossible bounding boxes. Once learned, only the function h is used for prediction. It is obtained by\nmaximizing the learned function over all possible joint features based on input x \u2208 X and output\ny \u2208 Y as in Eq. (3), identically to standard SSVMs.\nOn the other hand, to jointly learn h and \u03c6, subject to the constraint in Eq. (1), we need to extend the\nSSVM framework substantially. The functions h and \u03c6 are characterized by the parameter vectors\nw and w\u2217, respectively as\n\nn, yn), xi \u2208 X , x\u2217\n\nh(x) = arg max\n\ny\u2208Y\n\n(cid:104)w, \u03a8(x, y)(cid:105) and \u03c6(x\u2217) = arg max\ny\u2217\u2208Y\n\n(cid:104)w\u2217, \u03a8(x\u2217, y\u2217)(cid:105).\n\n(5)\n\nTo learn the weight vectors w and w\u2217 simultaneously, we propose a novel max-margin structured\nprediction framework called SSVM+ that incorporates the constraint in Eq. (1) and hence learns two\nmodels jointly as follows:\n\nn(cid:88)\n\ni=1\n\n\u03bei,\n\n(cid:107)w(cid:107)2 +\n1\n2\n(cid:104)w, \u03b4\u03a8i(y)(cid:105)+(cid:104)w\u2217, \u03b4\u03a8\u2217\ni (y\u2217)(cid:105) \u2265 \u00af\u2206(yi, y, y\u2217) \u2212 \u03bei \u2200 1 \u2264 i \u2264 n, \u2200y, y\u2217 \u2208 Y.\n\n(cid:107)w\u2217(cid:107)2 +\n\nmin\nw,w\u2217,\u03be\n\nC\nn\n\ns.t.\nwhere \u03b4\u03a8\u2217\ngate task-speci\ufb01c loss \u00af\u2206 derived from [23]. This surrogate loss is de\ufb01ned as\n\ni (y\u2217) \u2261 \u03a8\u2217(x\u2217\n\ni , yi) \u2212 \u03a8\u2217(x\u2217\n\ni , y\u2217) and the inequality in Eq. (1) is introduced via a surro-\n\n(6)\n\n\u03b3\n2\n\n\u00af\u2206(yi, y, y\u2217) =\n\n1\n\u03c1\n\n\u2206\u2217(yi, y\u2217) + [\u2206(yi, y) \u2212 \u2206\u2217(yi, y\u2217)]+,\n\n(7)\n\nwhere [t]+ = max(t, 0) and \u03c1 > 0 is a penalization parameter corresponding to the constraint in\nEq. (1), and task-speci\ufb01c loss functions \u2206 and \u2206\u2217 are de\ufb01ned in Section 3.3. Through this surrogate\nloss, we can apply the inequality in Eq. (1) within the ordinary max-margin optimization framework.\nOur framework enforces that the model learned on attributes and segmentations (w\u2217) always corrects\nthe model trained on visual features (w). This results in a model with better generalization on visual\nfeatures alone. Similar to SSVMs, we can tractably deal with the exponential number of possible\nconstraints present in our problem via loss-augmented inference and optimization methods such\nas the cutting plane algorithm [24] or the more recent block-coordinate Frank Wolfe method [25].\nPseudocode for solving Eq. (6) using the the cutting plane method is presented in Algorithm 1.\nOur formulation has a general form that follows the SSVM framework. This means that Eq. (6) is\nindependent of the de\ufb01nitions of joint feature map, task-speci\ufb01c loss and loss-augmented inference.\nWe can therefore apply our method to a variety of other problems in addition to object localization.\nAll that is required is the de\ufb01nition of the three problem speci\ufb01c components, which are also required\nin the standard SSVMs. As will be shown later, only the loss-augmented inference step becomes\nharder compared to SSVMs due to the inclusion of privileged information.\n\n4\n\n\f1, y1), . . . , (xn, x\u2217\n\n\u03c1 \u2206\u2217(yi, y\u2217) + [\u2206(yi, y) \u2212 \u2206\u2217(yi, y\u2217)]+\ni (y\u2217)(cid:105)\n\nfor i = 1, . . . , n do\nSET-UP SURROGATE TASK-SPECIFIC LOSS (EQ. (7))\n\u00af\u2206(yi, y, y\u2217) = 1\nSET-UP COST FUNCTION (EQ. (12))\nH(y, y\u2217) = \u00af\u2206(yi, y, y\u2217) \u2212 (cid:104)w, \u03b4\u03a8i(y)(cid:105) \u2212 (cid:104)w\u2217, \u03b4\u03a8\u2217\nFIND CUTTING PLANE\n) = arg maxy,y\u2217\u2208Y H(y, y\u2217)\n(\u02c6y, \u02c6y\nFIND VALUE OF CURRENT SLACK\n\u03bei = max{0, maxy,y\u2217\u2208Si H(y, y\u2217)}\nif H(\u02c6y, \u02c6y\n\nAlgorithm 1 Cutting plane method for solving Eq. (6)\n1: Input: (x1, x\u2217\nn, yn), C, \u03c1, \u03b3, \u0001\n2: Si \u2190 \u2205 for all i = 1, . . . , n\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19: until no Si has changed during iteration\n\nADD CONSTRAINT TO WORKING SET\nSi \u2190 Si \u222a {(\u02c6y, \u02c6y\n(w, w\u2217) \u2190 optimize Eq. (6) over \u222aiSi.\n\n) > \u03bei + \u0001 then\n\nend if\nend for\n\n\u2217\n\n)}\n\n\u2217\n\n\u2217\n\n3.2\n\nJoint Feature Map\n\nOur extended structured output regressor, SSVM+, estimates bounding box coordinates within target\nimages by considering all possible bounding boxes. The structured output space is de\ufb01ned as Y \u2261\n{(\u03b8, t, l, b, r) | \u03b8 \u2208 {+1,\u22121}, (t, l, b, r) \u2208 R4}, where \u03b8 denotes the presence/absence of an object\nand (t, l, b, r) correspond to coordinates of the top, left, bottom, and right corners of a bounding box,\nrespectively. To model the relationship between input and output variables, we de\ufb01ne a joint feature\nmap, encoding features in x to their bounding boxes de\ufb01ned by y. This is modeled as\n\n\u03a8(xi, y) = xi|y,\n\n(8)\nwhere x|y denotes the region of an image inside a bounding box with coordinates y. Identically,\nfor the privileged space, we de\ufb01ne another joint feature map, which instead of on visual features, it\noperates on the space of attributes aided by segmentation information as\n\n(9)\nThe de\ufb01nition of the joint feature map is problem speci\ufb01c, and we follow the method in [1] pro-\nposed for object localization. Implementation details about both joint feature maps are described in\nSection 4.2\n\n\u03a8\u2217(x\u2217\n\ni , y\u2217) = x\u2217\n\ni |y\u2217 .\n\n3.3 Task-Speci\ufb01c Loss\n\nTo measure the level of discrepancy between the predicted output y and the true structured label\nyi, we need to de\ufb01ne a loss function that accurately measures such a level of disagreement. In our\nobject localization problem, the following task-speci\ufb01c loss, based on the Pascal VOC overlap ratio\n[1], is employed in both spaces,\n\n(cid:40)\n\n1 \u2212 area(yi\u2229y)\narea(yi\u222ay)\n1 \u2212 ( 1\n2 (yi\u03b8y\u03b8 + 1))\n\nif yi\u03b8 = y\u03b8 = 1\notherwise,\n\n\u2206(yi, y) =\n\n(10)\nwhere yi\u03b8 \u2208 {+1,\u22121} denotes the presence (+1) or absence (\u22121) of an object in the i-th image. In\nthe case yi\u03b8 = \u22121, \u03a8(x|y) = 0, where 0 is an all zero vector. The loss is 0 when bounding boxes\nde\ufb01ned by yi and y are identical, and equal to 1 when they are disjoint or yi\u03b8 (cid:54)= y\u03b8.\n\n3.4 Loss-Augmented Inference\n\nDue to the exponential number of constraints that arise during learning of Eq. (6) and the possibly\nvery large search space Y dealt with during prediction, we require an ef\ufb01cient inference technique,\nwhich may differ in training and testing in the SSVM+ framework.\n\n5\n\n\f3.4.1 Prediction\n\nThe goal is to \ufb01nd the best bounding box given the learned weight vector w and the visual feature x.\nPrivileged information is not available at testing time, and inference is performed on visual features\nonly. Therefore, the same maximization problem as in standard SSVMs needs to be solved during\nprediction, which is given by\n\nh(x) = arg max\n\ny\u2208Y\n\n(cid:104)w, \u03a8(x, y)(cid:105).\n\n(11)\n\nThis maximization problem is over the space of bounding box coordinates. However, this problem\ninvolves a very large search space and therefore cannot be solved exhaustively. In the object localiza-\ntion task, the Ef\ufb01cient Subwindow Search (ESS) algorithm [2] is employed to solve the optimization\nproblem ef\ufb01ciently.\n\n3.4.2 Learning\n\nCompared to the inference problem required during the prediction step shown in Eq. (11), the op-\ntimization of our main objective during training involves a more complex inference procedure. We\nneed to perform the following maximization with the surrogate loss and an additional term corre-\nsponding to the privileged space during an iterative procedure:\n\n\u2217\n\n(\u02c6y, \u02c6y\n\n) = arg max\ny,y\u2217\u2208Y\n= arg max\ny,y\u2217\u2208Y\n\ni (y\u2217)(cid:105)\n\u00af\u2206(yi, y, y\u2217) \u2212 (cid:104)w, \u03b4\u03a8i(y)(cid:105) \u2212 (cid:104)w\u2217, \u03b4\u03a8\u2217\n\u00af\u2206(yi, y, y\u2217) + (cid:104)w, \u03a8(xi, y)(cid:105) + (cid:104)w\u2217, \u03a8\u2217(x\u2217\n\ni , y\u2217)(cid:105).\n\n(12)\ni , yi)(cid:105) are constants in Eq. (12) and do not affect the opti-\nNote that (cid:104)w, \u03a8(xi, yi)(cid:105) and (cid:104)w\u2217, \u03a8\u2217(x\u2217\nmization. The problem in Eq. (12), called loss-augmented inference, is required during each iteration\nof the cutting plane method, which is used for learning the functions h and \u03c6 and hence the weight\nvectors w and w\u2217.\nWe adopt an alternating approach for the inference, where we \ufb01rst solve for y\u2217 in the privileged\nspace given the \ufb01xed solution in the original space yc\n\n\u00af\u2206(yi, yc, y\u2217) + (cid:104)w\u2217, \u03a8\u2217(x\u2217\n\ni , y\u2217)(cid:105)\n\narg max\n\ny\u2217\u2208Y\n\nand subsequently perform optimization in the original space while \ufb01xing y\u2217\n\nc\n\n\u00af\u2206(yi, y, y\u2217\n\nc ) + (cid:104)w, \u03a8(xi, y)(cid:105).\n\narg max\n\ny\u2208Y\n\n(13)\n\n(14)\n\nThese two sub-procedures in Eq. (13) and (14) are repeated until convergence, and we obtain the\n\ufb01nal solutions w and w\u2217. In the object localization task, both problems are solved by ESS [2], a\nbranch-and-bound optimization technique, for which it is essential to derive upper bounds of the\nabove objective functions over a set of rectangles from Y. Here we derive the upper bounds of only\nthe surrogate loss terms in Eq. (7); the derivation for the other terms can be found in [2].\nWhen the solution in the privileged space is \ufb01xed, we need to consider the upper bound of only\n[\u2206 \u2212 \u2206\u2217]+ to obtain the upper bound of the surrogate loss. Since [\u2206 \u2212 \u2206\u2217]+ is a monotonically\nincreasing function of \u2206, its upper bound is derived directly from the upper bound of \u2206. Speci\ufb01cally,\nthe upper bound of \u2206 is given by\n\nand the upper bound of the surrogate loss with a \ufb01xed \u2206\u2217 is given by\n\n\u2206 = 1 \u2212 area(yi \u2229 y)\narea(yi \u222a y)\n\n\u2264 1 \u2212 miny\u2208Y area(yi \u2229 y)\nmaxy\u2208Y area(yi \u222a y)\n\u2212 \u2206\u2217(cid:21)\n\n(cid:20)\n1 \u2212 miny\u2208Y area(yi \u2229 y)\nmaxy\u2208Y area(yi \u222a y)\n\n+\n\n,\n\n.\n\n[\u2206 \u2212 \u2206\u2217]+ \u2264\n\n(15)\n\n(16)\n\nWhen the original space is \ufb01xed, the problem is not straightforward since the surrogate loss becomes\na V-shaped function with \u03c1 > 1. In this case, we need to check outputs of the function at both upper\n\n6\n\n\fand lower bounds of \u2206\u2217. The upper bound of \u2206\u2217 is derived identically to that of \u2206, and the lower\nbound of \u2206\u2217 is given by\n\n(17)\nl be the upper and lower bounds of \u2206\u2217, respectively. Then the upper bound of the\n\n.\n\n\u2265 1 \u2212 maxy\u2217\u2208Y area(yi \u2229 y\u2217)\nminy\u2217\u2208Y area(yi \u222a y\u2217)\n\n\u2206\u2217 = 1 \u2212 area(yi \u2229 y\u2217)\narea(yi \u222a y\u2217)\n(cid:18) 1\n\nLet \u2206\u2217\nsurrogate loss with a \ufb01xed \u2206 is given by\n\u2206\u2217 + [\u2206 \u2212 \u2206\u2217]+ \u2264 max\n\nu and \u2206\u2217\n\n1\n\u03c1\n\nu + [\u2206 \u2212 \u2206\u2217\n\u2206\u2217\n\nu]+ ,\n\n\u03c1\n\nl + [\u2206 \u2212 \u2206\u2217\n\u2206\u2217\nl ]+\n\n1\n\u03c1\n\n(cid:19)\n\n.\n\n(18)\n\nBy identifying the bounds of the surrogate loss as in Eq. (17) and (18), we can optimize the objective\nfunction in Eq. (12) through the alternating procedure based on the standard ESS algorithm.\n\n4 Experiments\n\n4.1 Dataset\n\nEmpirical evaluation of our method is performed on the Caltech-UCSD Birds 2011 (CUB-2011)\n[5] \ufb01ne-grained categorization dataset. It contains 200 categories of different species of birds. The\nlocation of each bird is speci\ufb01ed using a bounding box. In addition, a large collection of privileged\ninformation is provided in the form of 15 different part annotations, 312 attributes and segmentation\nmasks, manually labeled in each image by human annotators. Each category contains 30 training\nimages and around 30 testing images.\n\n4.2 Visual and Privileged Feature Extraction\n\nOur feature descriptor in visual space adopts the bag-of-visual-words model based on Speeded Up\nRobust Features (SURF) [26], which is almost identical to [2]. The dimensionality of visual feature\ndescriptors is 3,000. We additionally employ attributes and segmentation masks as privileged infor-\nmation. The information about attributes is described by a 312 dimensional vector, whose element\ncorresponds to each attribute and which has a binary value depending on its visibility and relevance.\nWe use segmentation information to inpaint segmentation masks into each image, which results in\nan image containing the original background pixels with uniform foreground pixels. Subsequently,\nwe extract the 3,000-dimensional feature descriptor based on the same bag-of-visual-words model as\nin the visual space. The intuition behind this approach is to generate a set of features that provide a\nguaranteed strong response in the foreground region. This response is to be stronger than in the orig-\ninal space, hence allowing for easier localization in the privileged space. For each sub-window, we\ncreate a histogram based on the presence of attributes and the frequency of the privileged codewords\ncorresponding to the augmented visual space.\n\n4.3 Evaluation\n\nTo evaluate our SSVM+ algorithm, we compare it against the original SSVM localization method\nby Blaschko and Lampert [1] in several training scenarios. In all experiments we tune the hyper-\nparameters C, \u03bb and \u03c1 on a 4\u00d7 4\u00d7 4 space spanning values [2\u22128, ..., 25]. For SSVM, one dimension\nof the search space corresponding to the parameter C is searched.\nWe \ufb01rst investigate the in\ufb02uence of small training sample sizes on localization performance. For\nthis setting, we loosely adopt the experimental setup of [27]. For training, we focus on 14 bird\ncategories corresponding to 2 major bird groups. We train four different models, each trained on\na distinctive number of training images, namely nc = {1, 5, 10, 20} images per class, resulting\nin n = {14, 70, 140, 280} training images, respectively. Additionally, we train a model on n =\n1000 images, corresponding to 100 bird classes, each with 10 training images. As a validation\nset, 500 training images chosen at random from categories other than the ones used for training\nare used. For testing, we use all testing images of the entire CUB-2011 dataset. Table 1 presents\nresults of this experiment. In all cases, our method outperforms the SSVM method in both average\noverlap as well as average detection (PASCAL VOC overlap ratio > 50%). This implies that for\n\n7\n\n\fTable 1: Comparison between our SSVM+ and the standard SSVM [1] by varying the number of\nclasses and training images.\n\n(A) OVERLAP\n\n# training images\nSSVM [1]\nSSVM+\nDIFF.\n\n14\n38.2\n41.3\n+3.1\n\n70\n43.8\n45.7\n+1.9\n\n140\n42.3\n45.8\n+3.5\n\n280\n44.9\n46.9\n+2.0\n\n1000\n48.1\n49.0\n+0.9\n\n(B) DETECTION\n280\n70\n39.8\n37.3\n43.3\n42.4\n+5.1\n+3.5\n\n140\n34.3\n41.5\n+7.2\n\n14\n25.9\n32.6\n+6.7\n\n1000\n46.2\n48.1\n+1.9\n\nFigure 2: Comparison results of average overlap (A) and detection results (B) between our structured\nlearning with privileged information (SSVM+) and the standard structured learning (SSVM) on 100\nclasses of the CUB-2011 dataset. The bird classes aligned in x-axis are sorted by the differences of\ntwo methods shown in black area in a non-increasing order.\n\nthe same number of training examples, our method consistently converges to a model with better\ngeneralization performance than SSVM. A previously observed trend [4, 23] of decreasing bene\ufb01t\nof privileged information with increasing training set sizes is also apparent here.\nTo evaluate the bene\ufb01t of SSVM+ in more depth, we illustrate average overlap and detection per-\nformance on all the 100 classes in Figure 2, where 10 images per class are used for training with\n14 classes (n = 140). In most of bird classes, SSVM+ shows relatively better performance in both\noverlap ratio and detection rate. Note that each class typically has 30 testing images but some classes\nhave as little as 18 images. Average overlap ratio is 45.8% and average detection is 12.1 (41.5%).\n\n5 Discussion\n\nWe presented a structured prediction algorithm for object localization based on SSVM with privi-\nleged information. Our algorithm is the \ufb01rst method for incorporating privileged information within\na structured prediction framework. Our method allows the use of various types of additional in-\nformation during training to improve generalization performance at testing time. We applied our\nproposed method to an object localization problem, which is solved by a novel structural SVM\nformulation using privileged information. We employed an alternating loss-augmented inference\nprocedure to handle the term in the objective function corresponding to privileged information. We\napplied the proposed algorithm to the Caltech-UCSD Birds 200-2011 dataset and obtained encour-\naging results, suggesting the potential bene\ufb01t of exploiting additional information that is available\nduring training only. Unfortunately, the bene\ufb01t of privileged information tends to reduce as the\nnumber of training examples increases; our SSVM+ framework would be particularly useful when\nthere exist only a few training data or annotation cost is very high.\n\nAcknowledgement\n\nThis work was supported partly by ICT R&D program of MSIP/IITP [14-824-09-006; 14-824-09-\n014] and IT R&D Program of MKE/KEIT (10040246).\n\n8\n\n(cid:16)(cid:20)(cid:19)(cid:19)(cid:20)(cid:19)(cid:21)(cid:19)(cid:22)(cid:19)(cid:23)(cid:19)(cid:24)(cid:19)(cid:25)(cid:19)(cid:50)(cid:89)(cid:72)(cid:85)(cid:79)(cid:68)(cid:83)(cid:3)(cid:85)(cid:68)(cid:87)(cid:76)(cid:82)(cid:3)(cid:11)(cid:8)(cid:12)(cid:10)(cid:35)(cid:11)(cid:3)(cid:49)(cid:88)(cid:71)(cid:84)(cid:78)(cid:67)(cid:82)(cid:3)(cid:84)(cid:67)(cid:86)(cid:75)(cid:81)(cid:54)(cid:54)(cid:57)(cid:48)(cid:14)(cid:54)(cid:54)(cid:57)(cid:48)(cid:71)(cid:76)(cid:73)(cid:73)(cid:16)(cid:24)(cid:19)(cid:24)(cid:20)(cid:19)(cid:20)(cid:24)(cid:21)(cid:19)(cid:55)(cid:75)(cid:72)(cid:3)(cid:81)(cid:88)(cid:80)(cid:69)(cid:72)(cid:85)(cid:3)(cid:82)(cid:73)(cid:3)(cid:71)(cid:72)(cid:87)(cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:3)(cid:11)(cid:81)(cid:12)(cid:10)(cid:36)(cid:11)(cid:3)(cid:38)(cid:71)(cid:86)(cid:71)(cid:69)(cid:86)(cid:75)(cid:81)(cid:80)(cid:54)(cid:54)(cid:57)(cid:48)(cid:14)(cid:54)(cid:54)(cid:57)(cid:48)(cid:71)(cid:76)(cid:73)(cid:73)\fReferences\n[1] Matthew B. Blaschko and Christoph H. Lampert. Learning to localize objects with structured output\n\nregression. In ECCV, pages 2\u201315, 2008.\n\n[2] Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Ef\ufb01cient subwindow search: A\n\nbranch and bound framework for object localization. TPAMI, 31(12):2129\u20132142, 2009.\n\n[3] Vladimir Vapnik, Akshay Vashist, and Natalya Pavlovitch. Learning using hidden information: Master-\n\nclass learning. In NATO Workshop on Mining Massive Data Sets for Security, pages 3\u201314, 2008.\n\n[4] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information.\n\nNeural Networks, 22(5-6):544\u2013557, 2009.\n\n[5] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD\n\nBirds-200-2011 Dataset. Technical report, California Institute of Technology, 2011.\n\n[6] Lixin Duan, Dong Xu, Ivor W. Tsang, and Jiebo Luo. Visual event recognition in videos by learning from\n\nweb data. TPAMI, 34(9):1667\u20131680, 2012.\n\n[7] Lixin Duan, Ivor W. Tsang, and Dong Xu. Domain transfer multiple kernel learning. TPAMI, 34(3):465\u2013\n\n479, 2012.\n\n[8] Qiang Chen, Zheng Song, Yang Hua, Zhongyang Huang, and Shuicheng Yan. Hierarchical matching with\n\nside information for image classi\ufb01cation. In CVPR, pages 3426\u20133433, 2012.\n\n[9] Hao Xia, Steven C.H. Hoi, Rong Jin, and Peilin Zhao. Online multiple kernel similarity learning for\n\nvisual search. TPAMI, 36(3):536\u2013549, 2013.\n\n[10] Gang Wang, David Forsyth, and Derek Hoiem. Improved object categorization and detection using com-\n\nparative object similarity. TPAMI, 35(10):2442\u20132453, 2013.\n\n[11] Wen Li, Lixin Duan, Dong Xu, and Ivor W. Tsang. Learning with augmented features for supervised and\n\nsemi-supervised heterogeneous domain adaptation. TPAMI, 36(6):1134\u201311148, 2013.\n\n[12] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes\n\nby between-class attribute transfer. In CVPR, 2009.\n\n[13] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes.\n\nCVPR, 2009.\n\nIn\n\n[14] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classi\ufb01er: Zero-shot learning using\n\npurely textual descriptions. In ICCV, 2013.\n\n[15] Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. Zero-shot learning through\n\ncross-modal transfer. In NIPS, pages 935\u2013943, 2013.\n\n[16] Lior Wolf and Noga Levy. The svm-minus similarity score for video face recognition. In CVPR, 2013.\n[17] Heng Yang and Ioannis Patras. Privileged information-based conditional regression forest for facial fea-\n\nture detection. In IEEE FG, pages 1\u20136, 2013.\n\n[18] Xiaoyang Wang and Qiang Ji. A novel probabilistic approach utilizing clip attribute as hidden knowledge\n\nfor event recognition. In ICPR, pages 3382\u20133385, 2012.\n\n[19] Cezar Ionescu, Liefeng Bo, and Cristian Sminchisescu. Structural svm for visual localization and contin-\n\nuous state estimation. In ICCV, pages 1157\u20131164, 2009.\n\n[20] Qieyun Dai and Derek Hoiem. Learning to localize detected objects. In CVPR, pages 3322\u20133329, 2012.\n[21] Viktoriia Sharmanska, Novi Quadrianto, and Christoph H. Lampert. Learning to rank using privileged\n\ninformation. In ICCV, pages 825\u2013832, 2013.\n\n[22] Vladimir Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 2006.\n[23] Dmitry Pechyony and Vladimir Vapnik. On the theory of learning with privileged information. NIPS,\n\npages 1894\u20131902, 2010.\n\n[24] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin meth-\n\nods for structured and interdependent output variables. JMLR, 6:1453\u20131484, 2005.\n\n[25] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frank-wolfe\n\noptimization for structural svms. In ICML, 2013.\n\n[26] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (SURF).\n\nCVIU, 110(3):346\u2013359, 2008.\n\n[27] Ryan Farrell, Om Oza, Ning Zhang, Vlad I. Morariu, Trevor Darrell, and Larry S. Davis. Birdlets:\nSubordinate categorization using volumetric primitives and pose-normalized appearance. In ICCV, pages\n161\u2013168, 2011.\n\n9\n\n\f", "award": [], "sourceid": 157, "authors": [{"given_name": "Jan", "family_name": "Feyereisl", "institution": "POSTECH, Samsung Electronics"}, {"given_name": "Suha", "family_name": "Kwak", "institution": "INRIA"}, {"given_name": "Jeany", "family_name": "Son", "institution": "POSTECH"}, {"given_name": "Bohyung", "family_name": "Han", "institution": "POSTECH"}]}