{"title": "Structured output regression for detection with partial truncation", "book": "Advances in Neural Information Processing Systems", "page_first": 1928, "page_last": 1936, "abstract": "We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert (ECCV 2008) to include a bias term significantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efficiently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set -- training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations.", "full_text": "Structured output regression for detection with\n\npartial truncation\n\nAndrea Vedaldi\n\nAndrew Zisserman\n\nDepartment of Engineering\n\nUniversity of Oxford\n\n{vedaldi,az}@robots.ox.ac.uk\n\nOxford, UK\n\nAbstract\n\nWe develop a structured output model for object category detection that explicitly\naccounts for alignment, multiple aspects and partial truncation in both training and\ninference. The model is formulated as large margin learning with latent variables\nand slack rescaling, and both training and inference are computationally ef\ufb01cient.\nWe make the following contributions: (i) we note that extending the Structured\nOutput Regression formulation of Blaschko and Lampert [1] to include a bias term\nsigni\ufb01cantly improves performance; (ii) that alignment (to account for small rota-\ntions and anisotropic scalings) can be included as a latent variable and ef\ufb01ciently\ndetermined and implemented; (iii) that the latent variable extends to multiple as-\npects (e.g.\nleft facing, right facing, front) with the same formulation; and (iv),\nmost signi\ufb01cantly for performance, that truncated and truncated instances can be\nincluded in both training and inference with an explicit truncation mask.\nWe demonstrate the method by training and testing on the PASCAL VOC 2007\ndata set \u2013 training includes the truncated examples, and in testing object instances\nare detected at multiple scales, alignments, and with signi\ufb01cant truncations.\n\n1 Introduction\n\nThere has been a steady increase in the performance of object category detection as measured by the\nannual PASCAL VOC challenges [3]. The training data provided for these challenges speci\ufb01es if an\nobject is truncated \u2013 when the provided axis aligned bounding box does not cover the full extent of\nthe object. The principal cause of truncation is that the object partially lies outside the image area.\nMost participants simple disregard the truncated training instances and learn from the non-truncated\nones. This is a waste of training material, but more seriously many truncated instances are missed\nin testing, signi\ufb01cantly reducing the recall and hence decreasing overall recognition performance.\nIn this paper we develop a model (Fig. 1) which explicitly accounts for truncation in both train-\ning and testing, and demonstrate that this leads to a signi\ufb01cant performance boost. The model is\nspeci\ufb01ed as a joint kernel and learnt using an extension of the structural SVM with latent variables\nframework of [13]. We use this approach as it allows us to address a second de\ufb01ciency of the pro-\nvided supervision \u2013 that the annotation is limited to axis aligned bounding boxes, even though the\nobjects may be in plane rotated so that the box is a loose bound. The latent variables allow us to\nspecify a pose transformation for each instances so that we achieve a spatial correspondence be-\ntween all instances with the same aspect. We show the bene\ufb01ts of this for both the learnt model and\ntesting performance.\nOur model is complementary to that of Felzenszwalb et al. [4] who propose a latent SVM frame-\nwork, where the latent variables specify sub-part locations. The parts give their model some toler-\nance to in plane rotation and foreshortening (though an axis aligned rectangle is still used for a \ufb01rst\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Model overview. Detection examples on the VOC images for\nthe bicycle class demonstrate that the model can handle severe trunca-\ntions (a-b), multiple objects (c), multiple aspects (d), and pose variations\n(small in-plane rotations) (e). Truncations caused by the image bound-\nary, (a) & (b), are a signi\ufb01cant problem for template based detectors,\nsince the template can then only partially align with the image. Small\nin-plane rotations and anisotropic rescalings of the template are approxi-\nmated ef\ufb01ciently by rearranging sub-blocks of the HOG template (white\nboxes in (e)).\n\n(e)\n\nstage as a \u201croot \ufb01lter\u201d) but they do not address the problem of truncation. Like them we base our\nimplementation on the ef\ufb01cient and successful HOG descriptor of Dalal and Triggs [2].\nPrevious authors have also considered occlusion (of which truncation is a special case). Williams et\nal. [11] used pixel wise binary latent variables to specify the occlusion and an Ising prior for spatial\ncoherence. Inference involved marginalizing out the latent variables using a mean \ufb01eld approxima-\ntion. There was no learning of the model from occluded data. For faces with partial occlusion, the\nresulting classi\ufb01er showed an improvement over a classi\ufb01er which did not model occlusion. Others\nhave explicitly included occlusion at the model learning stage, such as the Constellation model of\nFergus et al. [5] and the Layout Consistent Random Field model of Winn et al. [12]. There are nu-\nmerous papers on detecting faces with various degrees of partial occlusion from glasses, or synthetic\ntruncations [6, 7].\nOur contribution is to de\ufb01ne an appropriate joint kernel and loss function to be used in the context\nof structured output prediction. We then learn a structured regressor, mapping an image to a list\nof objects with their pose (or bounding box), while at the same time handling explicitly truncation\nand multiple aspects. Our choice of kernel is inspired by the restriction kernel of [1]; however, our\nkernel performs both restriction and alignment to a template, supports multiple templates to handle\ndifferent aspects and truncations, and adds a bias term which signi\ufb01cantly improves performance.\nWe re\ufb01ne pose beyond translation and scaling with an additional transformation selected from a\n\ufb01nite set of possible perturbations covering aspect ratio change and small in plane rotations. Instead\nof explicitly transforming the image with each element of this set (which would be prohibitively ex-\npensive) we introduce a novel approximation based on decomposing the HOG descriptor into small\nblocks and quickly rearranging those. To handle occlusions we selectively switch between an object\ndescriptor and an occlusion descriptor. To identify which portions of the template are occluded we\nuse a \ufb01eld of binary variables. These could be treated as latent variables; however, since we consider\nhere only occlusions caused by the image boundaries, we can infer them deterministically from the\nposition of the object relative to the image boundaries. Fig. 1 illustrates various detection examples\nincluding truncation, multiple instances and aspects, and in-plane rotation.\nIn training we improve the ground-truth pose parameters, since the bounding boxes and aspect asso-\nciations provided in PASCAL VOC are quite coarse indicators of the object pose. For each instance\nwe add a latent variable which encodes a pose adjustment. Such variables are then treated as part of\nlearning using the technique presented in [13]. However, while there the authors use the CCCP algo-\nrithm to treat the case of margin rescaling, here we show that a similar algorithm applies to the case\nof slack rescaling. The resulting optimization alternates between optimizing the model parameters\ngiven the latent variables (a convex problem solved by a cutting plane algorithm) and optimizing the\nlatent variable given the model (akin to inference).\n\n2\n\nRIGHTLEFTLEFTLEFTLEFTLEFTLEFTLEFTRIGHTRIGHT\fThe overall method is computationally ef\ufb01cient both in training and testing, achieves very good\nperformances, and it is able to learn and recognise truncated objects.\n\n2 Model\nOur purpose is to learn a function y = f(x), x \u2208 X , y \u2208 Y which, given an image x, returns the\nposes y of the objects portrayed in the image. We use the structured prediction learning framework\nof [9, 13], which considers along with the input and output variables x and y, an auxiliary latent\nvariable h \u2208 H as well (we use h to specify a re\ufb01nement to the ground-truth pose parameters). The\nfunction f is then de\ufb01ned as f(x; w) = \u02c6yx(w) where\n\n(\u02c6yx(w), \u02c6hx(w)) = argmax\n(y,h)\u2208Y\u00d7H\n\nF (x, y, h; w), F (x, y, h; w) = (cid:104)w, \u03a8(x, y, h)(cid:105),\n\n(1)\n\nand \u03a8(x, y, h) \u2208 Rd is a joint feature map. This maximization estimates both y and h from the\ndata x and corresponds to performing inference. Given training data (x1, y1), . . . , (xN , yN ), the\nparameters w are learned by minimizing the regularized empirical risk\n\nN(cid:88)\n\ni=1\n\n1\n2\n\n(cid:107)w(cid:107)2 + C\nN\n\nR(w) =\n\n\u02c6yi(w) = \u02c6yxi(w),\n\n\u2206(yi, \u02c6yi(w), \u02c6hi(w)), where\n\n\u02c6hi(w) = \u02c6hxi(w).\n(2)\nHere \u2206(yi, y, h) \u2265 0 is an appropriate loss function that encodes the cost of an incorrect prediction.\nIn this section we develop the model \u03a8(x, y, h), or equivalently the joint kernel function\nK(x, y, h, x(cid:48), y(cid:48), h(cid:48)) = (cid:104)\u03a8(x, y, h), \u03a8(x(cid:48), y(cid:48), h(cid:48))(cid:105), in a number of stages. We \ufb01rst de\ufb01ne the kernel\nfor the case of a single unoccluded instance in an image including scale and perturbing transforma-\ntions, then generalise this to include truncations and aspects; and \ufb01nally to multiple instances. An\nappropriate loss function \u2206(yi, y, h) is subsequently de\ufb01ned which takes into account the pose of\nthe object speci\ufb01ed by the latent variables.\n\n2.1 Restriction and alignment kernel\nDenote by R a rectangular region of the image x, and by x|R the image cropped to that rectangle.\nA restriction kernel [1] is the kernel K((x, R), (x(cid:48), R(cid:48))) = Kimage(x|R, x(cid:48)|R) where Kimage is an\nappropriate kernel between images. The goal is that the joint kernel should be large when the two\nregions have similar appearance.\nOur kernel is similar, but captures both the idea of restriction and alignment. Let R0 be a reference\nrectangle and denote by R(p) = gpR0 the rectangle obtained from R0 by a geometric transformation\ngp : R2 \u2192 R2. We call p the pose of the rectangle R(p). Let \u00afx be an image de\ufb01ned on the reference\nrectangle R0 and let H(\u00afx) \u2208 Rd be a descriptor (e.g. SIFT, HOG, GIST [2]) computed from the\nimage appearance. Then a natural de\ufb01nition of the restriction and alignment kernel is\n\np x).\n\nK((x, p), (x(cid:48), p(cid:48))) = Kdescr(H(x; p), H(x(cid:48); p(cid:48)))\n\n(3)\nwhere Kdescr is an appropriate kernel for descriptors, and H(x; p) is the descriptor computed on the\ntransformed patch x as H(x; p) = H(g\u22121\nImplementation with HOG descriptors. Our choice of the HOG descriptor puts some limits on\nthe space of poses p that can be ef\ufb01ciently explored. To see this, it is necessary to describe how\nHOG descriptors are computed.\nThe computation starts by decomposing the image x into cells of d \u00d7 d pixels (here d = 8). The\ndescriptor of a cell is the nine dimensional histogram of the orientation of the image gradient inside\nthe cell (appropriately weighed and normalized as in [2]). We obtain the HOG descriptor of a\nrectangle of w \u00d7 h cells by stacking the enclosed cell descriptors (this is a 9 \u00d7 w \u00d7 h vector). Thus,\ngiven the cell histograms, we can immediately obtain the HOG descriptors H(x, y) for all the cell-\naligned translations (x, y) of the dw \u00d7 dh pixels rectangle. To span rectangles of different scales\nthis construction is simply repeated on the rescaled image g\u22121\ns x, where gs(z) = \u03b3sz is a rescaling,\n\u03b3 > 0, and s is a discrete scale parameter.\n\n3\n\n\fTo further re\ufb01ne pose beyond scale and translation, here we consider an additional perturbation gt,\nindexed by a pose parameter t, selected in a set of transformations g1, . . . , gT (in the experiments\nwe use 16 transformations, obtained from all combinations of rotations of \u00b15 and \u00b110 degrees and\nscaling along x of 95%, 90%, 80% and 70%). Those could be achieved in the same manner as\nscaling by transforming the image g\u22121\nt x for each one, but this would be very expensive (we would\nneed to recompute the cell descriptors every time). Instead, we approximate such transformations\nby rearranging the cells of the template (Fig. 1 and 2). First, we partition the w \u00d7 h cells of the\ntemplate into blocks of m \u00d7 m cells (e.g. m = 4). Then we transform the center of each block\naccording to gt and we translate the block to the new center (approximated to units of cells). Notice\nthat we could pick m = 1 (i.e. move each cell of the template independently), but we prefer to use\nblocks as this accelerates inference (see Sect. 4).\nHence, pose is for us a tuple (x, y, s, t) representing translation, scale, and additional perturbation.\nSince HOG descriptors are designed to be compared with a linear kernel, we de\ufb01ne\n\nK((x, p), (x(cid:48), p(cid:48))) = (cid:104)\u03a8(x, p), \u03a8(x(cid:48), p(cid:48))(cid:105),\n\n\u03a8(x, p) = H(x; p).\n\n(4)\n\n2.2 Modeling truncations\n\nIf part of the object is occluded (either by clutter or by the image boundaries), some of the descriptor\ncells will be either unpredictable or unde\ufb01ned. We explicitly deal with occlusion at the granularity\nof the HOG cells by adding a \ufb01eld of w \u00d7 h binary indicator variables v \u2208 {0, 1}wh. Here vj = 1\nmeans that the j-th cell of the HOG descriptor H(x, p) should be considered to be visible, and\nvj = 0 that it is occluded. We thus de\ufb01ne a variant of (4) by considering the feature map\n\n(cid:21)\n\n(cid:20)\n\n\u03a8(x, p, v) =\n\n(v \u2297 19) (cid:12) H(x, p)\n\n((1wh \u2212 v) \u2297 19) (cid:12) H(x, p)\n\n(5)\n\n(6)\n\nwhere 1d is a d-dimensional vector of all ones, \u2297 denotes the Kroneker product, and (cid:12) the Hadamard\n(component wise) product. To understand this expression, recall that H is the stacking of w \u00d7 h 9-\ndimensional histograms, so for instance (v \u2297 19) (cid:12) H(x, p) preserves the visible cells and nulls the\nothers. Eq. (5) is effectively de\ufb01ning a template for the object and one for the occlusions.\nNotice that v are in general latent variables and should be estimated as such. However here we\nnote that the most severe and frequent occlusions are caused by the image boundaries (\ufb01nite \ufb01eld of\nview). In this case, which we explore in the experiments, we can write v = v(p) as a function of\nthe pose p, and remove the explicit dependence on v in \u03a8. Moreover the truncated HOG cells are\nunde\ufb01ned and can be assigned a nominal common value. So here we work with a simpli\ufb01ed kernel,\nin which occlusions are represented by a scalar proportional to the number of truncated cells:\n\n(cid:20)(v(p) \u2297 19) (cid:12) H(x, p)\n(cid:21)\n\nwh \u2212 |v(p)|\n\n\u03a8(x, p) =\n\n2.3 Modeling aspects\n\nA template model works well as long as pose captures accurately enough the transformations result-\ning from changes in the viewing conditions. In our model, the pose p, combined with the robustness\nof the HOG descriptor, can absorb a fair amount of viewpoint induced deformation. It cannot, how-\never, capture the 3D structure of a physical object. Therefore, extreme changes of viewpoint require\nswitching between different templates. To this end, we augment pose with an aspect indicator a (so\nthat pose is the tuple p = (x, y, s, t, a)), which indicates which template to use.\nNote that now the concept of pose has been generalized to include a parameter, a, which, differently\nfrom the others, does not specify a geometric transformation. Nevertheless, pose speci\ufb01es how the\nmodel should be aligned to the image, i.e. by (i) choosing the template that corresponds to the\naspect a, (ii) translating and scaling such template according to (x, y, s), and (iii) applying to it\nthe additional perturbation gt. In testing, all such parameters are estimated as part of inference.\nIn training, they are initialized from the ground truth data annotations (bounding boxes and aspect\nlabels), and are then re\ufb01ned by estimating the latent variables (Sect. 2.4).\n\n4\n\n\fWe assign each aspect to a different \u201cslot\u201d of the feature vector \u03a8(x, p). Then we null all but the\none of the slots, as indicated by a:\n\n\u03a8(x, p) =\n\n(7)\n\n\uf8ee\uf8ef\uf8f0 \u03b4a=1\u03a81(x; p)\n\n...\n\n\u03b4a=A\u03a8A(x; p)\n\n\uf8f9\uf8fa\uf8fb\n\nwhere \u03a8a(x; p) is a feature vector in the form of (6). In this way, we compare different templates\nfor different aspects, as indicated by a.\nThe model can be extended to capture symmetries of the aspects (resulting from symmetries of the\nobjects). For instance, a left view of a bicycle can be obtained by mirroring a right view, so that the\nsame template can be used for both aspects by de\ufb01ning\n\n\u03a8(x; p) = \u03b4a=left\u03a8left(x; p) + \u03b4a=right \ufb02ip \u03a8right(x; p),\n\n(8)\nwhere \ufb02ip is the operator that \u201c\ufb02ips\u201d the descriptor (this can be de\ufb01ned in general by computing the\ndescriptor of the mirrored image, but for HOG it reduces to rearranging the descriptor components).\nThe problem remains of assigning aspects to the training data. In the Pascal VOC data, objects are\nlabeled with one of \ufb01ve aspects: front, left, right, back, unde\ufb01ned. However, such assignments may\nnot be optimal for use in a particular algorithm. Fortunately, our method is able to automatically\nreassign aspects as part of the estimation of the hidden variables (Sect. 2.4 and Fig. 2).\n\n2.4 Latent variables\n\nThe PASCAL VOC bounding boxes yield only a coarse estimate of the ground truth pose parameters\n(e.g. they do not contain any information on the object rotation) and the aspect assignments may\nalso be suboptimal (see previous section). Therefore, we introduce latent variables h = (\u03b4p) that\nencode an adjustment to the ground-truth pose parameters y = (p). In practice, the adjustment \u03b4p\nis a small variation of translation x, y, scale s, and perturbation t, and can switch the aspect a all\ntogether.\nWe modify the feature maps to account for the adjustment in the obvious way. For instance (6)\nbecomes\n\n(cid:20)(v(p + \u03b4p) \u2297 19) (cid:12) H(x, p + \u03b4p)\n(cid:21)\n\n\u03a8(x, p, \u03b4p) =\n\nwh \u2212 |v(p + \u03b4p)|\n\n(9)\n\n2.5 Variable number of objects: loss function, bias, training\n\nSo far, we have de\ufb01ned the feature map \u03a8(x, y) = \u03a8(x; p) for the case in which the label y = (p)\ncontains exactly one object, but an image may contain no or multiple object instances (denoted\nrespectively y = \u0001 and y = (p1, . . . , pn)). We de\ufb01ne the loss function between a ground truth label\nyi and the estimated output y as\n\n\u2206(yi, y) =\n\n1 \u2212 overl(B(p), B(p(cid:48)))\n1\n\nif yi = y = \u0001,\nif yi = (p) and y = (p(cid:48)),\nif yi (cid:54)= \u0001 and y = \u0001, or yi = \u0001 and y (cid:54)= \u0001,\n\n(10)\n\nwhere B is the ground truth bounding box, and B(cid:48) is the prediction (the smallest axis aligned bound-\ning box that contains the warped template gpR0). The overlap score between B and B(cid:48) is given by\noverl(B, B(cid:48)) = |B \u2229 B(cid:48)|/|B \u222a B(cid:48)|. Note that the ground truth poses are de\ufb01ned so that B(pl)\nmatches the PASCAL provided bounding boxes [1] (or the manually extended ones for the trun-\ncated ones).\nThe hypothesis y = \u0001 (no object) receives score F (x, \u0001; w) = 0 by de\ufb01ning \u03a8(x, \u0001) = 0 as in [1].\nIn this way, the hypothesis y = (p) is preferred to y = \u0001 only if F (x, p; w) > F (x, \u0001; w) = 0,\nwhich implicitly sets the detection threshold to zero. However, there is no reason to assume that this\nthreshold should be appropriate (in Fig. 2 we show that it is not). To learn an arbitrary threshold,\nit suf\ufb01ces to append to the feature vector \u03a8(x, p) a large constant \u03babias, so that the score of the\nhypothesis y = (p) becomes F (x, (p); w) = (cid:104)w, \u03a8(x, p)(cid:105) + \u03babiaswbias. Note that, since the constant\nis large, the weight wbias remains small and has negligible effect on the SVM regularization term.\n\n5\n\n\uf8f1\uf8f2\uf8f30\n\n\fto this case by setting F (x, y; w) = (cid:80)L\n\nFinally, an image may contain more than one instance of the object. The model can be extended\nl=1 F (x, pl; w) + R(y), where R(y) encodes a \u201crepul-\nsive\u201d force that prevents multiple overlapping detections of the same object. Performing infer-\nence with such a model becomes however combinatorial and in general very dif\ufb01cult. Fortu-\nnately, in training the problem can be avoided entirely. If an image contains multiple instances,\nthe image is added to the training set multiple times, each time \u201cactivating\u201d one of the instances,\nand \u201cdeactivating\u201d the others. Here \u201cdeactivating\u201d an instance simply means removing it from\nthe detector search space. Formally, let p0 be the pose of the active instance and p1, . . . , pN\nthe poses of the inactive ones. A pose p is removed from the search space if, and only if,\nmaxi overl(B(p), B(pi)) \u2265 max{overl(B(p), B(p0)), 0.2}.\n\n3 Optimisation\n\nMinimising the regularised risk R(w) as de\ufb01ned by Eq. (2) is dif\ufb01cult as the loss depends on w\nthrough \u02c6yi(w) and \u02c6hi(w) (see Eq. (1)). It is however possible to optimise an upper bound (derived\nbelow) given by\n\n1\n2\n\n(cid:107)w(cid:107)2 + C\nN\n\nmax\n\n(y,h)\u2208Y\u00d7H\n\n\u2206(yi, y, h) [1 + (cid:104)w, \u03a8(xi, y, h)(cid:105) \u2212 (cid:104)w, \u03a8(xi, yi, h\u2217\n\ni (w))(cid:105)] .\n\n(11)\n\ni (w) = argmaxh\u2208H(cid:104)w, \u03a8(xi, yi, h)(cid:105) completes the label (yi, h\u2217\n\nHere h\u2217\nwhich only the observed part yi is known from the ground truth).\n\ni (w)) of the sample xi (of\n\nAlternation optimization. Eq. (11) is not a convex energy function due to the dependency of h\u2217\ni (w)\non w. Similarly to [13], however, it is possible to \ufb01nd a local minimum by alternating optimizing w\nand estimating h\u2217\ni . To do this, the CCCP algorithm proposed in [13] for the case of margin rescaling,\nmust be extended to the slack rescaling formulation used here.\nStarting from an estimate wt of the solution, de\ufb01ne h\u2217\n\n(cid:104)w, \u03a8(xi, yi, h\u2217\n\ni (w))(cid:105) = max\n\nit = hi(wt), so that, for any w,\nit)(cid:105)\nh(cid:48) (cid:104)w, \u03a8(xi, yi, h(cid:48))(cid:105) \u2265 (cid:104)w, \u03a8(xi, yi, h\u2217\n\nand the equality holds for w = wt. Hence the energy (11) is bounded by\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\n1\n2\n\n(cid:107)w(cid:107)2 + C\nN\n\nmax\n\n(y,h)\u2208Y\u00d7H\n\n\u2206(yi, y, h) [1 + (cid:104)w, \u03a8(xi, y, h)(cid:105) \u2212 (cid:104)w, \u03a8(xi, yi, h\u2217\n\nit)(cid:105)]\n\n(12)\n\nand the bound is strict for w = wt. Optimising (12) will therefore result in an improvement of the\nenergy (11) as well. The latter can be carried out with the cutting-plane technique of [9].\n\nDerivation of the bound (11). The derivation involves a sequence of bounds, starting from\n1 + (cid:104)w, \u03a8(xi, \u02c6yi(w), \u02c6hi(w))(cid:105) \u2212 (cid:104)w, \u03a8(xi, yi, h\n\u2206(yi, \u02c6yi(w), \u02c6hi(w)) \u2264 \u2206(yi, \u02c6yi(w), \u02c6hi(w))\n\u2217\n\n(13)\nThis bound holds because, by construction, the quantity in the square brackets is not smaller than\none, as can be veri\ufb01ed by substituting the de\ufb01nitions of \u02c6yi(w), \u02c6hi(w) and h\u2217\ni (w). We further upper\nbound the loss by\n\ni (w))(cid:105)i\n\nh\n\n\u2206(yi, \u02c6yi(w), \u02c6hi(w)) \u2264 \u2206(yi, y, h) [1 + (cid:104)w, \u03a8(xi, y, h)(cid:105) \u2212 (cid:104)w, \u03a8(xi, yi, h\n\ni (w))(cid:105)]\n\u2217\n\n\u2264 max\n(y,h)\u2208Y\u00d7H \u2206(yi, y, h) [1 + (cid:104)w, \u03a8(xi, y, h)(cid:105) \u2212 (cid:104)w, \u03a8(xi, yi, h\n\ni (w))(cid:105)]\n\u2217\n\n(14)\nSubstituting this bound into (2) yields (11). Note that \u02c6yi(w) and \u02c6hi(w) are de\ufb01ned as the max-\nimiser of (cid:104)w, \u03a8(xi, y, h)(cid:105) alone (see Eq. 1), while the energy maximised in (14) depends on the loss\n\u2206(yi, y, h) as well.\n\n6\n\n\u02db\u02db\u02dby=\u02c6yi(w),h=\u02c6hi(w)\n\n\f(b)\n\n(a)\n\nFigure 2: Effect of different model components. The left panel evaluates the effect of differ-\nent components of the model on the task of learning a detector for the left-right facing PASCAL\nVOC 2007 bicycles. In order of increasing AP (see legend): baseline model (see text); bias term\n(Sect. 2.5); detecting trunctated instances, training on truncated instances, and counting the trun-\ncated cells as a feature (Sect.: 2.2); with searching over small translation, scaling, rotation, skew\n(Sect. 2.1). Right panel: (a) Original VOC speci\ufb01ed bounding box and aspect; (b) alignment and as-\npect after pose inference \u2013 in addition to translation and scale, our templates are searched over a set\nof small perturbations. This is implemented ef\ufb01ciently by breaking the template into blocks (dashed\nboxes) and rearranging those. Note that blocks can partially overlap to capture foreshortening. The\nground truth pose parameters are approximate because they are obtained from bounding boxes (a).\nThe algorithm improves their estimate as part of inference of the latent variables h. Notice that not\nonly translation, scale, and small jitters are re-estimated, but also the aspect subclass can be updated.\nIn the example, an instance originally labeled as misc (a) is reassigned to the left aspect (b).\n\n4 Experiments\n\nData. As training data we use the PASCAL VOC annotations. Each object instance is labeled\nwith a bounding box and a categorical aspect variable (left, right, front, back, unde\ufb01ned). From\nthe bounding box we estimate translation and scale of the object, and we use aspect to select one\nof multiple HOG templates. Symmetric aspects (e.g. left and right) are mapped to the same HOG\ntemplate as suggested in Sect. 2.3.\nWhile our model is capable of handling correctly truncations, truncated bounding boxes provide a\npoor estimate of the pose of the object pose which prevents using such objects for training. While we\ncould simply avoid training with truncated boxes (or generate arti\ufb01cially truncated examples whose\npose would be known), we prefer exploiting all the available training data. To do this, we manually\naugment all truncated PASCAL VOC annotations with an additional \u201cphysical\u201d bounding box. The\npurpose is to provide a better initial guess for the object pose, which is then re\ufb01ned by optimizing\nover the latent variables.\nTraining and testing speed. Performing inference with the model requires evaluating (cid:104)w, \u03a8(x, p)(cid:105)\nfor all possible poses p. This means matching a HOG template O(W HT A) times, where W \u00d7\nH is the dimension of the image in cells, T the number of perturbations (Sect. 2.1), and A the\nnumber of aspects (Sect. 2.3).1 For a given scale and aspect, matching the template for all locations\nreduces to convolution. Moreover, by breaking the template into blocks (Fig. 2) and pre-computing\nthe convolution with each of those, we can quickly compute perturbations of the template. All in\nall, detection requires roughly 30 seconds per image with the full model and four aspects. The\ncutting plane algorithm used to minimize (12) requires at each iteration solving problems similar\nto inference. This can be easily parallelized, greatly improving training speed. To detect additional\nobjects at test time we run inference multiple times, but excluding all detections that overlap by\nmore than 20% with any previously detected object.\n\n1Note that we do not multiply by the number S of scales as at each successive scale W and H are reduced\n\ngeometrically.\n\n7\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91VOC 2007 left\u2212right bicyclesrecallprecision baseline 22.9+ bias 33.7+ test w/ trunc. 55.7+ train w/ trunc. 58.6+ empty cells count 60.0+ transformations 63.0MISCLEFT\fFigure 3: Top row. Examples of detected bicycles. The dashed boxes are bicycles that were detected\nwith or without truncation support, while the solid ones were detectable only when truncations were\nconsidered explicitly. Bottom row. Some cases of correct detections despite extreme truncation for\nthe horse class.\n\nBene\ufb01t of various model components. Fig. 2 shows how the model improves by the successive\nintroduction of the various features of the model. The example is carried on the VOC left-right\nfacing bicycle, but similar effects were observed for other categories. The baseline model uses\nonly the HOG template without bias, truncations, nor pose re\ufb01nement (Sect. 2.1). The two most\nsigni\ufb01cant improvements are (a) the ability of detecting truncated instances (+22% AP, Fig. 3) and\n(b) the addition of the bias (+11% AP). Training with the truncated instances, adding the number\nof occluded HOG cells as a feature component, and adding jitters beyond translation and scaling all\nyield an improvement of about +2\u20133% AP.\nFull model. The model was trained to detect the class bicycle in the PASCAL VOC 2007 data, using\n\ufb01ve templates, initialized from the PASCAL labeling left, right, front/rear, other. Initially, the pose\nre\ufb01nimenent h is null and the alternation optimization algorithm is iterated \ufb01ve times to estimate\nthe model w and re\ufb01nement h. The detector is then tested on all the test data, enabling multiple\ndetections per image, and computing average-precision as speci\ufb01ed by [3]. The AP score was 47%.\nBy comparison, the state of the art for this category [8] achieves 56%. The experiment was repeated\nfor the class horse, obtaining a score of 40%. By comparison, the state of the art on this category,\nour MKL sliding window classi\ufb01er [10], achieves 51%. Note that the proposed method uses only\nHOG, while the others use a combination of at least two features. However [4], using only HOG but\na \ufb02exible part model, also achieves superior results. Further experiments are needed to evaluate the\ncombined bene\ufb01ts of truncation/occlusion handling (proposed here), with multiple features [10] and\n\ufb02exible parts [4].\n\nConclusions\n\nWe have shown how structured output regression with latent variables provides an integrated and ef-\nfective solution to many problems in object detection: truncations, pose variability, multiple objects,\nand multiple aspects can all be dealt in a consistent framework.\nWhile we have shown that truncated examples can be used for training, we had to manually extend\nthe PASCAL VOC annotations for these cases to include rough \u201cphysical\u201d bounding boxes (as a hint\nfor the initial pose parameters). We plan to further extend the approach to infer pose for truncated\nexamples in a fully automatic fashion (weak supervision).\n\nAcknowledgments. We are grateful for discussions with Matthew Blaschko. Funding was provided\nby the EU under ERC grant VisRec no. 228180; the RAEng, Microsoft, and ONR MURI N00014-\n07-1-0182.\n\n8\n\n\fReferences\n[1] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regres-\n\nsion. In Proc. ECCV, 2008.\n\n[2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. CVPR,\n\n2005.\n\n[3] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.\n\nPASCAL Visual Object Classes Challenge 2008 (VOC2008) Results.\npascal-network.org/challenges/VOC/voc2008/workshop/index.html,\n2008.\n\nThe\nhttp://www.\n\n[4] P. F. Felzenszwalb, R. B. Grishick, D. McAllister, and D. Ramanan. Object detection with\n\ndiscriminatively trained part based models. PAMI, 2009.\n\n[5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-\ninvariant learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, volume 2, pages 264\u2013271, June 2003.\n\n[6] K. Hotta. Robust face detection under partial occlusion. In Proceedings of the IEEE Interna-\n\ntional Conference on Image Processing, 2004.\n\n[7] Y. Y. Lin, T. L. Liu, and C. S. Fuh. Fast object detection with occlusions. In Proceedings of\n\nthe European Conference on Computer Vision, pages 402\u2013413. Springer-Verlag, May 2004.\n\n[8] P. Schnitzspan, M. Fritz, S. Roth, and B. Schiele. Discriminative structure learning of hierar-\n\nchical representations for object detection. In Proc. CVPR, 2009.\n\n[9] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for\n\ninterdependent and structured output spaces. In Proc. ICML, 2004.\n\n[10] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection.\n\nIn Proc. ICCV, 2009.\n\n[11] O. Williams, A. Blake, and R. Cipolla. The variational ising classi\ufb01er (VIC) algorithm for\n\ncoherently contaminated data. In Proc. NIPS, 2005.\n\n[12] J. Winn and J. Shotton. The Layout Consistent Random Field for Recognizing and Segmenting\n\nPartially Occluded Objects. In Proc. CVPR, 2006.\n\n[13] C.-N. J. Yu and T. Joachims. Learning structural SVMs with latent variables. In Proc. ICML,\n\n2009.\n\n9\n\n\f", "award": [], "sourceid": 88, "authors": [{"given_name": "Andrea", "family_name": "Vedaldi", "institution": null}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": null}]}