{"title": "Region-based Segmentation and Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 655, "page_last": 663, "abstract": "Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other. However, current state-of-the-art models use a separate representation for each task making joint inference clumsy and leaving classification of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approach to joint object detection and image segmentation. Our approach reasons about pixels, regions and objects in a coherent probabilistic model. Importantly, our model gives a single unified description of the scene. We explain every pixel in the image and enforce global consistency between all variables in our model. We run experiments on challenging vision datasets and show significant improvement over state-of-the-art object detection accuracy.", "full_text": "Region-based Segmentation and Object Detection\n\nDaphne Koller2\nStephen Gould1\n1 Department of Electrical Engineering, Stanford University\n\nTianshi Gao1\n\n2 Department of Computer Science, Stanford University\n\n{sgould,tianshig,koller}@cs.stanford.edu\n\nAbstract\n\nObject detection and multi-class image segmentation are two closely related tasks\nthat can be greatly improved when solved jointly by feeding information from\none task to the other [10, 11]. However, current state-of-the-art models use a\nseparate representation for each task making joint inference clumsy and leaving\nthe classi\ufb01cation of many parts of the scene ambiguous.\nIn this work, we propose a hierarchical region-based approach to joint object\ndetection and image segmentation. Our approach simultaneously reasons about\npixels, regions and objects in a coherent probabilistic model. Pixel appearance\nfeatures allow us to perform well on classifying amorphous background classes,\nwhile the explicit representation of regions facilitate the computation of more so-\nphisticated features necessary for object detection. Importantly, our model gives\na single uni\ufb01ed description of the scene\u2014we explain every pixel in the image and\nenforce global consistency between all random variables in our model.\nWe run experiments on the challenging Street Scene dataset [2] and show signi\ufb01-\ncant improvement over state-of-the-art results for object detection accuracy.\n\n1 Introduction\n\nObject detection is one of the great challenges of computer vision, having received continuous\nattention since the birth of the \ufb01eld. The most common modern approaches scan the image for\ncandidate objects and score each one. This is typi\ufb01ed by the sliding-window object detection ap-\nproach [22, 20, 4], but is also true of most other detection schemes (such as centroid-based meth-\nods [13] or boundary edge methods [5]). The most successful approaches combine cues from\ninside the object boundary (local features) with cues from outside the object (contextual cues),\ne.g., [9, 20, 6]. Recent works are adopting a more holistic approach by combining the output of mul-\ntiple vision tasks [10, 11] and are reminiscent of some of the earliest work in computer vision [1].\nHowever, these recent works use a different representation for each subtask, forcing information\nsharing to be done through awkward feature mappings. Another dif\ufb01culty with these approaches\nis that the subtask representations can be inconsistent. For example, a bounding-box based object\ndetector includes many pixels within each candidate detection window that are not part of the ob-\nject itself. Furthermore, multiple overlapping candidate detections contain many pixels in common.\nHow these pixels should be treated is ambiguous in such approaches. A model that uniquely iden-\nti\ufb01es each pixel is not only more elegant, but is also more likely to produce reliable results since it\nencodes a bias of the true world (i.e., a visible pixel belongs to only one object).\n\nIn this work, we propose a more integrated region-based approach that combines multi-class im-\nage segmentation with object detection. Speci\ufb01cally, we propose a hierarchical model that reasons\nsimultaneously about pixels, regions and objects in the image, rather than scanning arbitrary win-\ndows. At the region level we label pixels as belonging to one of a number of background classes\n(currently sky, tree, road, grass, water, building, mountain) or a single foreground class. The fore-\nground class is then further classi\ufb01ed, at the object level, into one of our known object classes\n(currently car and pedestrian) or unknown.\n\n1\n\n\fOur model builds on the scene decomposition model of Gould et al. [7] which aims to decompose\nan image into coherent regions by dynamically moving pixel between regions and evaluating these\nmoves relative to a global energy objective. These bottom-up pixel moves result in regions with co-\nherent appearance. Unfortunately, complex objects such as people or cars are composed of several\ndissimilar regions which will not be combined by this bottom-up approach. Our new hierarchi-\ncal approach facilitates both bottom-up and top-down reasoning about the scene. For example, we\ncan propose an entire object comprised of multiple regions and evaluate this joint move against our\nglobal objective. Thus, our hierarchical model enjoys the best of two worlds: Like multi-class image\nsegmentation, our model uniquely explains every pixel in the image and groups these into seman-\ntically coherent regions. Like object detection, our model uses sophisticated shape and appearance\nfeatures computed over candidate object locations with precise boundaries. Furthermore, our joint\nmodel over regions and objects allows context to be encoded through direct semantic relationships\n(e.g., \u201ccar\u201d is usually found on \u201croad\u201d).\n\n2 Background and Related Work\n\nOur method inherits features from the sliding-window object detector works, such as Torralba et al.\n[19] and Dalal and Triggs [4], and the multi-class image segmentation work of Shotton et al. [16].\nWe further incorporate into our model many novel ideas for improving object detection via scene\ncontext. The innovative works that inspire ours include predicting camera viewpoint for estimat-\ning the real world size of object candidates [12], relating \u201cthings\u201d (objects) to nearby \u201cstuff\u201d (re-\ngions) [9], co-occurrence of object classes [15], and general scene \u201cgist\u201d [18].\n\nRecent works go beyond simple appearance-based context and show that holistic scene under-\nstanding (both geometric [11] and more general [10]) can signi\ufb01cantly improve performance by\ncombining related tasks. These works use the output of one task (e.g., object detection) to provide\nfeatures for other related tasks (e.g., depth perception). While they are appealing in their simplic-\nity, current models are not tightly coupled and may result in incoherent outputs (e.g., the pixels in\na bounding box identi\ufb01ed as \u201ccar\u201d by the object detector, may be labeled as \u201csky\u201d by an image\nsegmentation task). In our method, all tasks use the same region-based representation which forces\nconsistency between variables. Intuitively this leads to more robust predictions.\n\nThe decomposition of a scene into regions to provide the basis for vision tasks exists in some\nscene parsing works. Notably, Tu et al. [21] describe an approach for identifying regions in the\nscene. Their approach has only be shown to be effective on text and faces, leaving much of the\nimage unexplained. Sudderth et al. [17] relate scenes, objects and parts in a single hierarchical\nframework, but do not provide an exact segmentation of the image. Gould et al. [7] provides a com-\nplete description of the scene using dynamically evolving decompositions that explain every pixel\n(both semantically and geometrically). However, the method cannot distinguish between between\nforeground objects and often leaves them segmented into multiple dissimilar pieces. Our work builds\non this approach with the aim of classifying objects.\n\nOther works attempt to integrate tasks such as object detection and multi-class image segmenta-\ntion into a single CRF model. However, these models either use a different representation for object\nand non-object regions [23] or rely on a pixel-level representation [16]. The former does not enforce\nlabel consistency between object bounding boxes and the underlying pixels while the latter does not\ndistinguish between adjacent objects of the same class.\n\nRecent work by Gu et al. [8] also use regions for object detection instead of the traditional sliding-\nwindow approach. However, unlike our method, they use a single over-segmentation of the image\nand make the strong assumption that each segment represents a (probabilistically) recognizable ob-\nject part. Our method, on the other hand, assembles objects (and background regions) using seg-\nments from multiple different over-segmentations. The multiple over-segmentations avoids errors\nmade by any one segmentation. Furthermore, we incorporate background regions which allows us to\neliminate large portions of the image thereby reducing the number of component regions that need\nto be considered for each object.\n\nLiu et al. [14] use a non-parametric approach to image labeling by warping a given image onto a\nlarge set of labeled images and then combining the results. This is a very effective approach since it\nscales easily to a large number of classes. However, the method does not attempt to understand the\nscene semantics. In particular, their method is unable to break the scene into separate objects (e.g., a\nrow of cars will be parsed as a single region) and cannot capture combinations of classes not present\nin the training set. As a result, the approach performs poorly on most foreground object classes.\n\n2\n\n\f3 Region-based Model for Object Detection\nWe now present an overview of our joint object detection and scene segmentation model. This model\ncombines scene structure and semantics in a coherent energy function.\n\n3.1 Energy Function\nOur model builds on the work of Gould et al. [7] which aims to decompose a scene into a number (K)\nof semantically consistent regions. In that work, each pixel p in the image I belongs to exactly one\nregion, identi\ufb01ed by its region-correspondence variable Rp \u2208 {1, . . . , K}. The r-th region is then\nsimply the set of pixels Pr whose region-correspondence variable equals r, i.e., Pr = {p : Rp = r}.\nIn our notation we will always use p and q to denote pixels, r and s to denote regions, and o to denote\nobjects. Double indices indicate pairwise terms between adjacent entities (e.g., pq or rs).\n\nRegions, while visually coherent, may not encompass entire objects. Indeed, in the work of Gould\net al. [7] foreground objects tended to be over-segmented into multiple regions. We address this de\ufb01-\nciency by allowing an object to be composed of many regions (rather than trying to force dissimilar\nregions to merge). The object to which a region belongs is denoted by its object-correspondence\nvariable Or \u2208 {\u2205, 1, . . . , N }. Some regions, such as background, do not belong to any object\nwhich we denote by Or = \u2205. Like regions, the set of pixels that comprise the o-th object is de-\n\nnoted by Po = Sr:Or=o Pr. Currently, we do not allow a single region or object to be composed of\n\nmultiple disconnected components.\n\nRandom variables are associated with the various entities (pixels, regions and objects) in our\nmodel. Each pixel has a local appearance feature vector \u03b1p \u2208 Rn (see [7]). Each region has an\nappearance variable Ar that summarizes the appearance of the region as a whole, a semantic class\nlabel Sr (such as \u201croad\u201d or \u201cforeground object\u201d), and an object-correspondence variable Or. Each\nobject, in turn, has an associated object class label Co (such as \u201ccar\u201d or \u201cpedestrian\u201d). The \ufb01nal\ncomponent in our model is the horizon which captures global geometry information. We assume\nthat the image was taken by a camera with horizontal axis parallel to the ground and model the\nhorizon vhz \u2208 [0, 1] as the normalized row in the image corresponding to its location. We quantize\nvhz into the same number of rows as the image.\n\nWe combine the variables in our model into a single coherent energy function that captures the\nstructure and semantics of the scene. The energy function includes terms for modeling the location\nof the horizon, region label preferences, region boundary quality, object labels, and contextual re-\nlationships between objects and regions. These terms are described in detail below. The combined\nenergy function E(R, S, O, C, vhz | I, \u03b8) has the form:\nE = \u03c8hz(vhz) + Xr\n\n\u03c8obj\no (Co, vhz) + Xo,r\n\n\u03c8reg\nr (Sr, vhz) + Xr,s\n\n\u03c8bdry\nrs + Xo\n\n(1)\n\n\u03c8ctxt\n\nor (Co, Sr)\n\nwhere for notational clarity the subscripts on the factors indicate that they are functions of the pixels\n(appearance and shape) belonging to the regions, i.e., \u03c8\nis also a function of Pr, etc. It is assumed\nthat all terms are conditioned on the observed image I and model parameters \u03b8. The summation\nover context terms includes all ordered pairs of adjacent objects and regions, while the summation\nover boundary terms is over unordered pairs of regions. An illustration of the variables in the energy\nfunction is shown in Figure 1.\n\nreg\nr\n\nThe \ufb01rst three energy terms are adapted from the model of [7]. We brie\ufb02y review them here:\nHorizon term. The \u03c8hz term captures the a priori location of the horizon in the scene and, in our\nmodel, is implemented as a log-gaussian \u03c8hz(vhz) = \u2212 log N (vhz; \u00b5, \u03c32) with parameters \u00b5 and \u03c3\nlearned from labeled training images.\n\nKnowing the location of the horizon allows us to compute the world height of an object in the\nscene. Using the derivation from Hoiem et al. [12], it can be shown that the height yk of an object\n(or region) in the scene can be approximated as yk \u2248 h vt\u2212vb\nwhere h is the height of the camera\nvhz\u2212vb\norigin above the ground, and vt and vb are the row of the top-most and bottom-most pixels in the\nobject/region, respectively. In our current work, we assume that all images were taken from the\nsame height above the ground, allowing us to use vt\u2212vb\nas a feature in our region and object terms.\nvhz\u2212vb\nRegion term. The region term \u03c8reg in our energy function captures the preference for a region\nto be assigned different semantic labels (currently sky, tree, road, grass, water, building, mountain,\nforeground). For convenience we include the vhz variable in this term to provide rough geometry\ninformation. If a region is associated with an object, then we constrain the assignment of its class\nlabel to foreground (e.g., a \u201csky\u201d region cannot be part of a \u201ccar\u201d object).\n\n3\n\n\fProcedure SceneInference\n\nGenerate over-segmentation dictionary \u2126\nInitialize Rp using any of the over-segmentations\nRepeat until convergence\n\nPhase 1:\n\nPropose a pixel move {Rp : p \u2208 \u03c9} \u2190 r\nUpdate region and boundary features\nRun inference over regions S and vhz\n\nPhase 2:\n\nPropose a pixel {Rp} \u2190 r or region move {Or} \u2190 o\nUpdate region, boundary and object features\nRun inference over regions and objects (S, C) and vhz\n\nCompute total energy E\nIf (E < Emin) then\n\nAccept move and set Emin = E\n\nElse reject move\n\nFigure 1: Illustration of the entities in our model (left) and inference algorithm (right). See text for details.\n\nMore formally, let Nr be the number of pixels in region r, i.e., Nr = Pp 1{Rp = r}, and let\n\u03c6r : (cid:0)Pr, vhz, I(cid:1) 7\u2192 Rn denote the features for the r-th region. The region term is then\n\nif Or 6= \u2205 and Sr 6= foreground\n\n(2)\n\nr (Sr, vhz) = (cid:26) \u221e\n\u03c8reg\n\n\u2212\u03b7regNr log \u03c3 (Sr | \u03c6r; \u03b8reg) otherwise\n\nwhere \u03c3(\u00b7) is the multi-class logit \u03c3(y | x; \u03b8) =\n\nregion term versus the other terms in the model.\n\nexp{\u03b8T\n\ny x}\nPy\u2032 expn\u03b8T\n\ny\u2032 xo\n\nand \u03b7reg is the relative weight of the\n\nBoundary term. The term \u03c8bdry penalizes two adjacent regions with similar appearance or lack\nof boundary contrast. This helps to merge coherent pixels into a single region. We combine two\nmetrics in this term: the \ufb01rst captures region similarity as a whole, the second captures contrast along\n\nthe common boundary between the regions. Speci\ufb01cally, let d (x, y; S) = p(x \u2212 y)T S\u22121(x \u2212 y)\n\ndenote the Mahalanobis distance between vectors x and y, and Ers be the set of pixels along the\nboundary. Then the boundary term is\n\n\u03c8bdry\n\nrs = \u03b7\n\nbdry\nA \u00b7 |Ers| \u00b7 e\u2212 1\n\n2 d(Ar ,As;\u03a3A)2\n\n+ \u03b7bdry\n\n\u03b1 X(p,q)\u2208Ers\n\n2 d(\u03b1p,\u03b1q;\u03a3\u03b1)2\n\ne\u2212 1\n\n(3)\n\nwhere the \u03a3A and \u03a3\u03b1 are the image-speci\ufb01c pixel appearance covariance matrix computed over all\npixels and neighboring pixels, respectively. In our experiments we restrict \u03a3A to be diagonal and set\nencode\nthe trade-off between the region similarity and boundary contrast terms and weight them against the\nother terms in the energy function (Equation 1).\n\n\u03a3\u03b1 = \u03b2I with \u03b2 = E(cid:2)k\u03b1p \u2212 \u03b1qk2(cid:3) as in Shotton et al. [16]. The parameters \u03b7\n\nbdry\nA and \u03b7\n\nbdry\n\u03b1\n\nNote that the boundary term does not include semantic class or object information. The term\n\npurely captures segmentation coherence in terms of appearance.\n\nObject term. Going beyond the model in [7], we include object terms \u03c8obj in our energy function\nthat score the likelihood of a group of regions being assigned a given object label. We currently\nclassify objects as either car, pedestrian or unknown. The unknown class includes objects like trash\ncans, street signs, telegraph poles, traf\ufb01c cones, bicycles, etc. Like the region term, the object term\n\nis de\ufb01ned by a logistic function that maps object features \u03c6o : (cid:0)Po, vhz, I(cid:1) 7\u2192 Rn to probability of\n\neach object class. However, since our region layer already identi\ufb01es foreground regions, we would\nlike our energy to improve only when we recognize known object classes. We therefore bias the\nobject term to give zero contribution to the energy for the class unknown.1 Formally we have\n\n\u03c8obj\n\nn (Co, vhz) = \u2212\u03b7objNo(cid:0)log \u03c3(cid:0)Co | \u03c6o; \u03b8obj(cid:1) \u2212 log \u03c3(cid:0)unknown | \u03c6o; \u03b8obj(cid:1)(cid:1)\n\nwhere No is the number of pixels belonging to the object.\n\n(4)\n\nContext term. Intuitively, contextual information which relates objects to their local background\ncan improve object detection. For example, Heitz and Koller [9] showed that detection rates im-\nprove by relating \u201cthings\u201d (objects) to \u201cstuff\u201d (background). Our model has a very natural way of\n\n1This results in the technical condition of allowing Or to take the value \u2205 for unknown foreground regions\n\nwithout affecting the energy.\n\n4\n\n\fencoding such relationships through pairwise energy terms between objects Co and regions Sr. We\ndo not encode contextual relationships between region classes (i.e., Sr and Ss) since these rarely\nhelp.2 Contextual relationships between foreground objects (i.e., Co and Cm) may be bene\ufb01cial\n(e.g., people found on bicycles), but are not considered in this work. Formally, the context term is\n\n\u03c8ctxt\n\nor (Co, Sr) = \u2212\u03b7ctxt log \u03c3(cid:0)Co \u00d7 Sr | \u03c6or; \u03b8ctxt(cid:1)\n\n(5)\nwhere \u03c6or : (Po, Pr, I) 7\u2192 Rn is a pairwise feature vector for object o and region r, \u03c3(\u00b7) is the\nmulti-class logit, and \u03b7ctxt weights the strength of the context term relative to other terms in the\nenergy function. Since the pairwise context term is between objects and (background) regions it\ngrows linearly with the number of object classes. This has a distinct advantage over approaches\nwhich include a pairwise term between all classes resulting in quadratic growth.\n\n3.2 Object Detectors\nPerforming well at object detection requires more than simple region appearance features. Indeed,\nthe power of state-of-the-art object detectors is their ability to model localized appearance and gen-\neral shape characteristics of an object class. Thus, in addition to raw appearance features, we append\nto our object feature vector \u03c6o features derived from such object detection models. We discuss two\nmethods for adapting state-of-the-art object detector technologies for this purpose.\n\nIn the \ufb01rst approach, we treat the object detector as a black-box that returns a score per (rectan-\ngular) candidate window. However, recall that an object in our model is de\ufb01ned by a contiguous\nset of pixels Po, not a rectangular window. In the black-box approach, we naively place a bounding\nbox (at the correct aspect ratio) around these pixels and classify the entire contents of the box. To\nmake classi\ufb01cation more robust we search candidate windows in a small neighborhood (de\ufb01ned over\nscale and position) around this bounding box, and take as our feature the output of highest scoring\nwindow. In our experiments we test this approach using the HOG detector of Dalal and Triggs [4]\nwhich learns a linear SVM classi\ufb01er over feature vectors constructed by computing histograms of\ngradient orientations in \ufb01xed-size overlapping cells within the candidate window.\n\nNote that in the above black-box approach many of the pixels within the bounding box are not\nactually part of the object (consider, for example, an L-shaped region). A better approach is to mask\nout all pixels not belonging to the object. In our implementation, we use a soft mask that attenuates\nthe intensity of pixels outside the object based on their distance to the object boundary (see Figure 2).\nThis has the dual advantage of preventing hard edge artifacts and being less sensitive to segmentation\nerrors. The masked window is used at both training and test time. In our experiments we test this\nmore integrated approach using the patch-based features of Torralba et al. [19, 20]. Here features\nare extracted by matching small rectangular patches at various locations within the masked window\nand combining these weak responses using boosting. Object appearance and shape are captured by\noperating on both the original (intensity) image and the edge-\ufb01ltered image.\n\nFor both approaches, we append the score (for each object) from the object detection classi\ufb01ers\u2014\n\nlinear SVM or boosted decision trees\u2014to the object feature vector \u03c6o.\n\n(a) full window (b) hard region mask\n\n(c) hard window (d) soft region mask\n\n(e) soft window\n\nFigure 2: Illustration of soft mask for proposed object regions.\n\nAn important parameter for sliding-window detectors is the base scale at which features are ex-\ntracted. Scale-invariance is achieved by successively down-sampling the image. Below the base-\nscale, feature matching becomes inaccurate, so most detectors will only \ufb01nd objects above some\nminimum size. Clearly there exists a trade-off between the desire to detect small objects, feature\nquality, and computational cost. To reduce the computational burden of running our model on\nhigh-resolution images while still being able to identify small objects, we employ a multi-scale ap-\nproach. Here we run our scene decomposition algorithm on a low-resolution (320 \u00d7 240) version\nof the scene, but extract features from the original high-resolution version. That is, when we extract\nobject-detector features we map the object pixels Po onto the original image and extract our features\nat the higher resolution.\n\n2The most informative region-to-region relationship is that sky tends to be above ground (road, grass, or\n\nwater). This information is already captured by including the horizon in our region term.\n\n5\n\n\f4 Inference and Learning\nWe now describe how we perform inference and learn the parameters of our energy function.\n\nInference\n\n4.1\nWe use a modi\ufb01ed version of the hill-climbing inference algorithm described in Gould et al. [7],\nwhich uses multiple over-segmentations to propose large moves in the energy space. An overview\nof this procedure is shown in the right of Figure 1. We initialize the scene by segmenting the\nimage using an off-the-shelf unsupervised segmentation algorithm (in our experiments we use mean-\nshift [3]). We then run inference using a two-phased approach.\n\nIn the \ufb01rst phase, we want to build up a good set of initial regions before trying to classify them as\nobjects. Thus we remove the object variables O and C from the model and arti\ufb01cially increase the\nbdry\nbdry\nboundary term weights (\u03b7\nA ) to promote merging. In this phase, the algorithm behaves\n\u03b1\nexactly as in [7] by iteratively proposing re-assignments of pixels to regions (variables R) and re-\ncomputes the optimal assignment to the remaining variables (S and vhz). If the overall energy for the\nnew con\ufb01guration is lower, the move is accepted, otherwise the previous con\ufb01guration is restored\nand the algorithm proposes a different move. The algorithm proceeds until no further reduction in\nenergy can be found after exhausting all proposal moves from a pre-de\ufb01ned set (see Section 4.2).\n\nand \u03b7\n\nIn the second phase, we anneal the boundary term weights and introduce object variables over\nall foreground regions. We then iteratively propose merges and splits of objects (variables O) as\nwell as high-level proposals (see Section 4.2 below) of new regions generated from sliding-window\nobject candidates (affecting both R and O). After a move is proposed, we recompute the optimal\nassignment to the remaining variables (S, C and vhz). Again, this process repeats until the energy\ncannot be reduced by any of the proposal moves.\n\nSince only part of the scene is changing during any iteration we only need to recompute the\nfeatures and energy terms for the regions affected by a move. However, inference is still slow given\nthe sophisticated features that need to be computed and the large number of moves considered.\nTo improve running time, we leave the context terms \u03c8ctxt out of the model until the last iteration\nthrough the proposal moves. This allows us to maximize each region term independently during\neach proposal step\u2014we use an iterated conditional modes (ICM) update to optimize vhz after the\nregion labels have been inferred. After introducing the context term, we use max-product belief\npropagation to infer the optimal joint assignment to S and C. Using this approach we can process\nan image in under \ufb01ve minutes.\n\n4.2 Proposal Moves\nWe now describe the set of pixel and region proposal moves considered by our algorithm. These\nmoves are relative to the current best scene decomposition and are designed to take large steps in\nthe energy space to avoid local minima. As discussed above, each move is accepted if it results in a\nlower overall energy after inferring the optimal assignment for the remaining variables.\n\nThe main set of pixel moves are described in [7] but brie\ufb02y repeated here for completeness.\nThe most basic move is to merge two adjacent regions. More sophisticated moves involve local\nre-assignment of pixels to neighboring regions. These moves are proposed from a pre-computed\ndictionary of image segments \u2126. The dictionary is generated by varying the parameters of an un-\nsupervised over-segmentation algorithm (in our case mean-shift [3]) and adding each segment \u03c9 to\nthe dictionary. During inference, these segments are used to propose a re-assignment of all pixels\nin the segment to a neighboring region or creation of new region. These bottom-up proposal moves\nwork well for background classes, but tend to result in over-segmented foreground classes which\nhave heterogeneous appearance, for example, one would not expect the wheels and body of a car to\nbe grouped together by a bottom-up approach.\n\nAn analogous set of moves can be used for merging two adjacent objects or assigning regions\nto objects. However, if an object is decomposed into multiple regions, this bottom-up approach is\nproblematic as multiple such moves may be required to produce a complete object. When performed\nindependently, these moves are unlikely to improve the energy. We get around this dif\ufb01culty by\nintroducing a new set of powerful top-down proposal moves based on object detection candidates.\nHere we use pre-computed candidates from a sliding-window detector to propose new foreground\nregions with corresponding object variable. Instead of proposing the entire bounding-box from the\ndetector, we propose the set of intersecting segments (from our segmentation dictionary \u2126) that are\nfully contained within the bounding-box in a single move.\n\n6\n\n\fCARS PED.\nEXPERIMENT\n0.15\n0.40\nPatch baseline\n0.37\nHOG baseline\n0.35\n0.22\nPatch RB (w/o cntxt) 0.55\nPatch RB (full model) 0.56\n0.21\nHOG RB (w/o cntxt) 0.58\n0.35\nHOG RB (full model) 0.57\n0.35\n\nFigure 3: PR curves for car (left) and pedestrian (right) detection on the Street Scene dataset [2]. The table\nshows 11-pt average precision for variants of the baseline sliding-window and our region-based (RB) approach.\n\n4.3 Learning\nWe learn the parameters of our model from labeled training data in a piecewise fashion. First, the\nindividual terms are learned using the maximum-likelihood objective for the subset of variables\nwithin each term. The relative weights (\u03b7reg, \u03b7obj, etc.) between the terms are learned through cross-\nvalidation on a subset of the training data. Boosted pixel appearance features (see [7]) and object\ndetectors are learned separately and their output provided as input features to the combined model.\nFor both the base object detectors and the parameters of the region and object terms, we use a\nclosed-loop learning technique where we \ufb01rst learn an initial set of parameters from training data.\nWe then run inference on our training set and record mistakes made by the algorithm (false-positives\nfor object detection and incorrect moves for the full algorithm). We augment the training data with\nthese mistakes and re-train. This process gives a signi\ufb01cant improvement to the \ufb01nal results.\n5 Experiments\nWe conduct experiments on the challenging Street Scene dataset [2]. This is a dataset consisting of\n3547 high-resolution images of urban environments. We rescaled the images to 320 \u00d7 240 before\nrunning our algorithm. The dataset comes with hand-annotated region labels and object boundaries.\nHowever, the annotations use rough overlapping polygons, so we used Amazon\u2019s Mechanical Turk\nto improve the labeling of the background classes only. We kept the original object polygons to be\nconsistent with other results on this dataset.\n\nWe divided the dataset into \ufb01ve folds\u2014the \ufb01rst fold (710 images) was used for testing and the\nremaining four used for training. The multi-class image segmentation component of our model\nachieves an overall pixel-level accuracy of 84.2% across the eight semantic classes compared to\n83.0% for the pixel-based baseline method described in [7]. More interesting was our object detec-\ntion performance. The test set contained 1183 cars and 293 pedestrians with average size of 86 \u00d7 48\nand 22 \u00d7 49 pixels, respectively. Many objects are occluded making this a very dif\ufb01cult dataset.\n\nSince our algorithm produces MAP estimation for the scene we cannot simply generate a\nprecision-recall curve by varying the object classi\ufb01er threshold as is usual for reporting object detec-\ntion results. Instead we take the max-marginals for each Cn variable at convergence of our algorithm\nand sweep over thresholds for each object separately to generate a curve. An attractive aspect of this\napproach is that our method does not have overlapping candidates and hence does not require arbi-\ntrary post-processing such as non-maximal suppression of sliding-window detections.\n\nOur results are shown in Figure 3. We also include a comparison to two baseline sliding-window\napproaches. Our method signi\ufb01cantly improves over the baselines for car detection. For pedestrian\ndetection, our method shows comparable performance to the HOG baseline which has been specif-\nically engineered for this task. Notice that our method does not achieve 100% recall (even at low\nprecision) due to the curves being generated from the MAP assignment in which pixels have already\nbeen grouped into regions. Unlike the baselines, this forces only one candidate object per region.\nHowever, by trading-off the strength (and hence operating point) of the energy terms in our model\nwe can increase the maximum recall for a given object class (e.g., by increasing the weight of the\nobject term by a factor of 30 we were able to increase pedestrian recall from 0.556 to 0.673).\n\nRemoving the pairwise context term does not have a signi\ufb01cant affect on our results. This is\ndue to the encoding of semantic context through the region term and the fact that all images were\nof urban scenes. However, we believe that on a dataset with more varied backgrounds (e.g., rural\nscenes) context would play a more important role.\n\nWe show some example output from our algorithm in Figure 4. The \ufb01rst row shows the original\nimage (left) together with annotated regions and objects (middle-left), regions (middle-right) and\npredicted horizon (right). Notice how multiple regions get grouped together into a single object.\nThe remaining rows show a selection of results (image and annotated output) from our method.\n\n7\n\n\fFigure 4: Qualitative results from our experiments. Top row shows original image, annotated regions and\nobjects, region boundaries, and predicted horizon. Other examples show original image (left) and overlay\ncolored by semantic class and detected objects (right).\n\n6 Discussion\nIn this paper we have presented a hierarchical model for joint object detection and image segmenta-\ntion. Our novel approach overcomes many of the problems associated with trying to combine related\nvision tasks. Importantly, our method explains every pixel in the image and enforces consistency be-\ntween random variables from different tasks. Furthermore, our model is encapsulated in a modular\nenergy function which can be easily analyzed and improved as new computer vision technologies\nbecome available.\n\nOne of the dif\ufb01culties in our model is learning the trade-off between energy terms\u2014too strong a\nboundary penalty and all regions will be merged together, while too weak a penalty and the scene\nwill be split into too many segments. We found that a closed-loop learning regime where mistakes\nfrom running inference on the training set are used to increase the diversity of training examples\nmade a big difference to performance.\n\nOur work suggests a number of interesting directions for future work. First, our greedy inference\nprocedure can be replaced with a more sophisticated approach that makes more global steps. More\nimportantly, our region-based model has the potential for providing holistic uni\ufb01ed understanding\nof an entire scene. This has the bene\ufb01t of eliminating many of the implausible hypotheses that\nplague current computer vision algorithms. Furthermore, by clearly delineating what is recognized,\nour framework directly present hypotheses for objects that are currently unknown providing the\npotential for increasing our library of characterized objects using a combination of supervised and\nunsupervised techniques.\n\nAcknowledgments. This work was supported by the NSF under grant IIS 0917151, MURI contract\nN000140710747, and The Boeing Company. We thank Pawan Kumar and Ben Packer for helpful discussions.\n\n8\n\n\fReferences\n[1] H.G. Barrow and J.M. Tenenbaum. Computational vision. IEEE, 1981.\n[2] S. Bileschi and L. Wolf. A uni\ufb01ed system for object detection, texture recognition, and context analysis\n\nbased on the standard model feature set. In BMVC, 2005.\n\n[3] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. PAMI, 2002.\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[5] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour segments for object detection.\n\nPAMI, 2008.\n\n[6] M. Fink and P. Perona. Mutual boosting for contextual inference. In NIPS, 2003.\n[7] Stephen Gould, Rick Fulton, and Daphne Koller. Decompsing a scene into geometric and semantically\n\nconsistent regions. In ICCV, 2009.\n\n[8] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik. Recognition using regions. In CVPR, 2009.\n[9] G. Heitz and D. Koller. Learning spatial context: Using stuff to \ufb01nd things. In ECCV, 2008.\n[10] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded classi\ufb01cation models: Combining models for\n\nholistic scene understanding. In NIPS, 2008.\n\n[11] D. Hoiem, A. A. Efros, and M. Hebert. Closing the loop on scene interpretation. CVPR, 2008.\n[12] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 2008.\n[13] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit\n\n[14] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: Label transfer via dense scene alignment.\n\n[15] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV,\n\nshape model. In ECCV, 2004.\n\nIn CVPR, 2009.\n\n2007.\n\n[16] J. Shotton, J. Winn, C. Rother, and A. Criminisi. TextonBoost: Joint appearance, shape and context\n\nmodeling for multi-class object recognition and segmentation. In ECCV, 2006.\n\n[17] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Describing visual scenes using transformed objects\n\n[18] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and\n\n[19] A. Torralba, K. Murphy, and W. Freeman. Sharing features: ef\ufb01cient boosting procedures for multiclass\n\n[20] A. Torralba, K. Murphy, and W. Freeman. Contextual models for object detection using boosted random\n\n[21] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu.\n\nImage parsing: Unifying segmentation, detection, and\n\n[22] P. Viola and M. J. Jones. Robust real-time face detection. IJCV, 2004.\n[23] C. Wojek and B. Schiele. A dynamic conditional random \ufb01eld model for joint labeling of object and scene\n\nand parts. In IJCV, 2007.\n\nobject recognition, 2003.\n\nobject detection. In CVPR, 2004.\n\n\ufb01elds. In NIPS, 2004.\n\nrecognition. In ICCV, 2003.\n\nclasses. In ECCV, 2008.\n\n9\n\n\f", "award": [], "sourceid": 576, "authors": [{"given_name": "Stephen", "family_name": "Gould", "institution": null}, {"given_name": "Tianshi", "family_name": "Gao", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}