{"title": "Probabilistic Joint Image Segmentation and Labeling", "book": "Advances in Neural Information Processing Systems", "page_first": 1827, "page_last": 1835, "abstract": "We present a joint image segmentation and labeling model (JSL) which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag, followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset, as well as in VOC2010, where 41.7% accuracy on the test set is achieved.", "full_text": "Probabilistic Joint Image Segmentation and Labeling\u2217\n\nAdrian Ion1,2, Joao Carreira1, Cristian Sminchisescu1\n\n1Faculty of Mathematics and Natural Sciences, University of Bonn\n\n2 PRIP, Vienna University of Technology & Institute of Science and Technology, Austria\n\n{ion,carreira,cristian.sminchisescu}@ins.uni-bonn.de\n\nAbstract\n\nWe present a joint image segmentation and labeling model (JSL) which, given a\nbag of \ufb01gure-ground segment hypotheses extracted at multiple image locations\nand scales, constructs a joint probability distribution over both the compatible\nimage interpretations (tilings or image segmentations) composed from those seg-\nments, and over their labeling into categories. The process of drawing samples\nfrom the joint distribution can be interpreted as \ufb01rst sampling tilings, modeled\nas maximal cliques, from a graph connecting spatially non-overlapping segments\nin the bag [1], followed by sampling labels for those segments, conditioned on\nthe choice of a particular tiling. We learn the segmentation and labeling parame-\nters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point\nestimation procedure. The partition function over tilings and labelings is increas-\ningly more accurately approximated by including incorrect con\ufb01gurations that a\nnot-yet-competent model rates probable during learning. We show that the pro-\nposed methodology matches the current state of the art in the Stanford dataset [2],\nas well as in VOC2010, where 41.7% accuracy on the test set is achieved.\n\n1 Introduction\n\nOne of the main goals of scene understanding is the semantic segmentation of images: label a di-\nverse set of object properties, at multiple scales, while at the same time identifying the spatial extent\nover which such properties hold. For instance, an image may be segmented into things (man-made\nobjects, people or animals), amorphous regions or stuff like grass or sky, or main geometric prop-\nerties like the ground plane or the vertical planes corresponding to buildings in the scene. The\noptimal identi\ufb01cation of such properties requires inference over spatial supports of different levels\nof granularity, and such regions may often overlap. It appears to be now well understood that a suc-\ncessful extraction of such properties requires models that can make inferences over adaptive spatial\nneighborhoods that span well beyond patches around individual pixels. Incorporating segmentation\ninformation to inform the labeling process has recently become an increasingly active research area.\nWhile initially inferences were restricted to super-pixel segmentations, recent trends emphasize joint\nmodels with capabilities to represent the uncertainty in the segmentation process [2, 4, 5, 6, 7]. One\ndif\ufb01culty is the selection of segments that have adequate spatial support for reliable labeling, and\na second major dif\ufb01culty is the design of models where both the segmentation and the labeling\nlayers can be learned jointly. In this paper, we present a joint image segmentation and labeling\nmodel (JSL) which, given a bag of possibly overlapping \ufb01gure-ground (binary) segment hypothe-\nses, extracted independently at multiple image locations and scales, constructs a joint probability\ndistribution over both the compatible image interpretations (or tilings) assembled from those seg-\nments, and over their labels. For learning, we present a procedure based on Maximum Likelihood,\nwhere the partition function over tilings and labelings is increasingly more accurately approximated\nin each iteration, by including incorrect con\ufb01gurations that the model rates probable. This prevents\n\n\u2217Supported, in part, by the EC, under MCEXT-025481, and by CNCSIS-UEFISCU, PNII-RU-RC-2/2009.\n\n1\n\n\fCPMC\n\nFigure-Ground\nSegments\n\ns1\n\ns2\n\nJSL\n\nFGTiling\n\nTilings\n\nt1\n\nt2\n\nt3\n\nLabelings\n\nSky\n\nBldg\n\nFG-Obj FG-Obj\n\nl(t1)\n\nWater\n\nSky\n\nl(t3)\n\nBldg\n\nFG-Obj FG-Obj\n\nWater\n\nSky\n\nl(t2)\n\nBldg\n\nFG-Obj\n\nFG-Obj\n\nWater\n\nFigure 1: Overview of our joint segment composition and categorization framework. Given an im-\nage I, we extract a bag S of \ufb01gure-ground segmentations, constrained at different spatial locations\nand scales, using the CPMC algorithm [3] and retain the \ufb01gure segments (other algorithms can be\nused for segment bagging). Segments are composed into image interpretations (tilings) by FGTil-\ning [1]. In brief, segments become nodes in a consistency graph where any two segments that do not\nspatially overlap are connected by an edge. Valid compositions (tilings) are obtained by computing\nmaximal cliques in the consistency graph. Multiple tilings are usually generated for each image.\nTilings consist of subsets of segments in S, and may induce residual regions that contain pixels not\nbelonging to any of the segments selected in a particular tiling. For labeling (JSL), con\ufb01gurations\nare scored based on both their category-dependent properties measured by F l\n\u03b1, and by their mid-\n\u03b2 over the dependency graph\u2014a subset of the\nlevel category-independent properties measured by F t\nconsistency graph connecting only spatially neighboring segments that share a boundary. The model\nparameters \u03b8 = [\u03b1\u22a4 \u03b2\u22a4]\u22a4 are jointly learned using Maximum Likelihood based on a novel incre-\nmental Saddle Point partition function approximation. Notice that a segment appearing in different\ntilings of an image I is constrained to have the same label (red vertical edges).\n\ncyclic behavior and leads to a stable optimization process. The method jointly learns both the mid-\nlevel, category-independent parameters of a segment composition model, and the category-sensitive\nparameters of a labeling model for those segments. To our knowledge this is the \ufb01rst model for joint\nimage segmentation and labeling, that accommodates both inference and learning, within a com-\nmon, consistent probabilistic framework. We show that our procedure matches the state of the art in\nthe Stanford [2], as well as the VOC2010 dataset, where 41.7% accuracy on the test set is achieved.\nOur framework is reviewed in \ufb01g. 1.\n\n1.1 Related Work\n\nOne approach to recognize the elements of an image would be to accurately partition it into re-\ngions based on low and mid-level statistical regularities, and then label those regions, as pursued\nby Barnard et al. [8]. The labeling problem can then be reduced to a relatively small number of\nclassi\ufb01cation problems. However, most existing mid-level segmentation algorithms cannot generate\none unique, yet accurate segmentation per image, across multiple images, for the same set of generic\nparameters [9, 10]. To achieve the best recognition, some tasks might require multiple overlapping\nspatial supports which can only be provided by different segmentations.\n\nSegmenting object parts or regions can be done at a \ufb01ner granularity, with labels decided locally,\nat the level of pixels [11, 12, 13] or superpixels [14, 15], based on measurements collected over\nneighborhoods with limited spatial support. Inconsistent label con\ufb01gurations can be resolved by\nsmoothing neighboring responses, or by encouraging consistency among the labels belonging to re-\ngions with similar low-level properties [16, 13]. The models are effective when local appearance\nstatistics are discriminative, as in the case of amorphous stuff (water, grass), but inference is harder\nto constrain for shape recognition, which requires longer-range interactions among groups of mea-\nsurements. One way to introduce constraints is by estimating the categories likely to occur in the\nimage using global classi\ufb01ers, then bias inference to that label distribution [12, 13, 15].\n\n2\n\n\fA complementary research trend is to segment and recognize categories based on features extracted\nover competing image regions with larger spatial support (extended regions). The extended regions\ncan be rectangles produced by bounding box detectors [17, 2]. The responses are combined in a\nsingle pixel or superpixel layer [7, 18, 17, 6] to obtain the \ufb01nal labeling. Extended regions can also\narise from multiple full-image segmentations [7, 18, 6]. By computing segmentations multiple times\nwith different parameters, chances increase that some of the segments are accurate. Multiple seg-\nmentations can also be aggregated in an inclusion hierarchy [19, 5], instead of being obtained inde-\npendently. The work of Tu et al. [20] uses generative models to drive the sequential re-segmentation\nprocess, formulated as Data Driven Markov Chain Monte Carlo inference. Recently, Gould et al.\n[2] proposed a model for segmentation and labeling where new region hypotheses were generated\nthrough a sequential procedure, where uniform label swaps for all the pixels contained inside indi-\nvidual segment proposals are accepted if they reduce the value of a global energy function. Kumar\nand Koller [4] proposed an improved joint inference using dual-decomposition. Our approach for\nsegmentation and labeling is layered rather than simultaneous, and learning for the segmentation\nand labeling parameters is performed jointly (rather than separately), in a probabilistic framework.\n\n2 Probabilistic Segmentation and Labeling\n\nLet S = {s1, s2, . . . }, be a set (bag) of segments from an image I.\nIn our case, the segments\nsi are obtained using the publicly available CPMC algorithm [3], and represent different \ufb01gure-\nground hypotheses, computed independently by applying constraints at various spatial locations and\nscales in the image.1 Subsets of segments in the bag S form the power set P(S), with 2|S| possible\nelements. We focus on a restriction of the power set of an image, its tiling set T (I), with the\nproperty that all segments contained in any subset (or tiling) do not spatially overlap and the subset\nis maximal: T (I) = {t = {. . . si, . . . sj, . . . } \u2208 P(S), s.t. \u2200i, j, overlap(si, sj) = 0}. Each\ntiling t in T (I) can have its segments labeled with one of L possible category labels. We call a\nlabeling the mapping obtained by assigning labels to segments in a tiling l(t) = {l1, . . . , l|t|}, with\nli \u2208 {1, . . . , L} the label of segment si, and |l(t)| = |t| (one label corresponds to one segment).2\nLet L(I) be the set of all possible labelings for image I with\n\n|L(I)| = X\nt\u2208T (I)\n\nL|t|\n\n(1)\n\nwhere we sum over all valid segment compositions (tilings) of an image, T (I), and the label space\nof each. We de\ufb01ne a joint probability distribution over tilings and their corresponding labelings,\n\np\u03b8(l(t), t, I) =\n\n1\n\nZ\u03b8(I)\n\nexp F\u03b8(l(t), t, I)\n\n(2)\n\nwhere Z\u03b8(I) = Pt Pl(t) exp F\u03b8(l(t), t, I) is the normalizer or partition function, l(t) \u2208 L(I), t \u2208\nT (I), and \u03b8 the parameters of the model. It is a constrained probability distribution de\ufb01ned over\ntwo sets: a set of segments in a tiling and an index set of labels for those segments, both of the same\ncardinality. F\u03b8 is de\ufb01ned as\n\nF\u03b8(l(t), t, I) = F l\n\n\u03b1(l(t), I) + F t\n\n\u03b2(t, I)\n\n(3)\n\nwith parameters \u03b8 = [\u03b1\u22a4 \u03b2\u22a4]\u22a4. The additive decomposition can be viewed as the sum of one term,\n\u03b2(t, I), encoding a mid-level, category independent score of a particular tiling t, and another\nF t\n\u03b1(l(t), I), encoding the potential of a labeling l(t) for that tiling t. The\ncategory-dependent score, F l\ncomponents F l\n\u03b2(t, I) are de\ufb01ned as interactions over unary and pairwise terms. The\npotential of a labeling is\n\n\u03b1(l(t), I) and F t\n\nF l\n\n\u03b1(l(t), I) = X\nsi\u2208t\n\n\u03a6l\n\nli(si, \u03b1) + X\nsi\u2208t\n\nX\nsj \u2208N l\nsi\n\n\u03a8l\n\nli,lj (si, sj, \u03b1)\n\n(4)\n\nand \u03a8l\n\nwith \u03a6l\nli\nborhood of si. In our experiments we take N l\n\nli,lj\n\nunary and pairwise, label-dependent potentials, and N l\nsi\n\nthe label relevant neigh-\nsi = t \\ {si}. The unary and pairwise terms are linear\n\n1Some of the \ufb01gure-ground segments in S(I) can spatially overlap.\n2We call a segmentation assembled from non-overlapping \ufb01gure-ground segments a tiling, and the tiling\n\ntogether with the set of corresponding labels for its segments a labeling (rather than a labeled tiling).\n\n3\n\n\fin the parameters, e.g. \u03a6l\n(si, \u03b1) encodes how likely it is for\nli\nsegment si to exhibit the regularities typical of objects belonging to class li. The potential of a tiling\nis de\ufb01ned as\n\n(si). For example \u03a6l\nli\n\n(si, \u03b1) = \u03b1\u22a4\u03a6l\nli\n\nF t\n\n\u03b2(t, I) = X\nsi\u2208t\n\n\u03a6t(si, \u03b2) + X\nsi\u2208t\n\nX\nsj \u2208N t\nsi\n\n\u03a8t(si, sj, \u03b2)\n\n(5)\n\nthe local image\nwith \u03a6t and \u03a8t unary and pairwise, label-independent potential functions, and N t\nsi\nsi = {sj \u2208 t | si, sj share a boundary part and do not overlap}. Both terms \u03a6t\nneighborhood i.e. N t\nand \u03a8t are linear in the parameters, similar to the components of the category dependent potential\n\u03b1(l(t), I). For example \u03a6t(si, \u03b1) encodes how likely is that segment si exhibits generic object\nF l\nregularities (details on the segmentation model F t\n\n\u03b2(t, I) can be found in [1]).\n\nInference: Given an image I, inference for the optimal tiling and labeling (l\u2217(t\u2217), t\u2217) is given by\n\n(l\u2217(t\u2217), t\u2217) = argmax\n\np\u03b8(l(t), t, I)\n\nl(t),t\n\n(6)\n\nOur inference methodology is described in sec. 3.\nLearning: During learning we optimize the parameters \u03b8 that maximize the likelihood (ML) of the\nground truth under our model:\n\n\u03b8\u22c6 = argmax\n\n\u03b8\n\nY\nI\n\np\u03b8(lI(tI ), tI, I) = argmax\n\n\u03b8\n\nX\nI\n\n(cid:2)F\u03b8(lI(tI ), tI , I) \u2212 log Z\u03b8(I)(cid:3)\n\n(7)\n\nwhere (lI(tI ), tI) are ground truth labeled tilings for image I. Our learning methodology, including\nan incremental saddle point approximation for the partition function is presented in sec. 4.\n\n3 Inference for Tilings and Labelings\n\nGiven an image where a bag S of multiple \ufb01gure-ground segments has been extracted using\nCPMC [3], inference is performed by \ufb01rst composing a number of plausible tilings from subsets\nof the segments, then labeling each tiling using spatial inference methods.\n\nThe inference algorithm for computing (sampling) tilings associates each segment to a node in a\nconsistency graph where an edge exists between all pairs of nodes corresponding to segments that do\nnot spatially overlap. The cliques of the consistency graph correspond to alternative segmentations\nof the image constructed from the basic segments. The algorithm described in [1] can ef\ufb01ciently\n\ufb01nd a number of plausible maximal weighted cliques, scored by (5). A maximum of |S| distinct\nmaximal cliques (tilings) are returned, and each segment si is contained in at least one of them.\nInference for the labels of the segments in each tiling can be performed using any number of reliable\nmethods\u2014in this work we use tree-reweighted belief propagation TRW-S [21]. The maximum in\n(6) is computed by selecting the labeling with the highest probability (2) among the tilings generated\nby the segmentation algorithm.\n\nGiven a set of N = |S| \ufb01gure-ground segments, the total complexity for inference is O(N d3 +N T +\nsi |},\nN ), where O(N d3) steps are required to sample up to N tilings [1], with d = maxsi\u2208S{|N t\nN T is the complexity for inference with TRW-S (with complexity, say, T ) for each computed tiling,\nand N steps are done to select the highest scoring labeling. For |S| = 200 the joint inference over\nlabelings and tilings takes under 10 seconds per image in our implementation and produces a set of\nplausible segmentation and labeling hypotheses which are also useful for learning, described next.\n\n4 Incremental Saddle Point Learning\n\nFundamental to maximum likelihood learning is a tractable, yet stable and suf\ufb01ciently accurate esti-\nmate of the partition function in (7). The number of terms in Z\u03b8(I) is |L(I)| (1), and is exponential\nboth in the number of \ufb01gure-ground segments and in the number of labels. As reviewed in sec. 3,\nwe approximate the tilings distribution of an image by a number of con\ufb01gurations bounded above\nby the number of \ufb01gure-ground segments. This replaces one exponential set of terms in the partition\nfunction in (2) (the sum over tilings) with a set of size at most |S|.\n\n4\n\n\fIn turn, each tiling can be labeled in exponentially many ways\u2014the second sum in the partition\nfunction in (2), running over all labelings of a tiling. One possibility to deal with this exponential\nsum for models with loopy dependencies would be Pseudo-Marginal Approximation (PMA) which\nestimates Z\u03b8(I) using loopy BP and computes gradients as expectations from estimated marginals.\nKumar et al. [22] found this approximation to perform best for learning conditional random \ufb01elds\nfor pixel labeling. However it requires inference over all tilings at every optimization iteration.\nWith 484 iterations required for convergence on the VOC dataset, this strategy took in our case 140\ntimes longer than the learning strategy based on incremental saddle-point approximations presented\n(below), which requires 1.3 hours for learning. Run for the same time, the PMA did not produce\nsatisfactory results in our model (sec. 5).\n\nAnother possibility would be to approximate the exponential sum over labels with its largest term,\nobtained at the most probable con\ufb01guration (the saddle-point approximation). However, this ap-\nproach tends to behave erratically as a result of \ufb02ips within the MAP con\ufb01gurations used to approx-\nimate the partition function (sec. 5).\n\nTo ensure stability and learning accuracy, we use an incremental saddle point approximation to the\npartition function. This is obtained by accumulating new incorrect (\u2018offending\u2019) labelings rated as\nthe most probable by our current model, in each learning iteration (Lj(I) denotes the set over which\nthe partition function for image I is computed in learning iteration j):\n\nLj+1(I) = Lj(I) \u222a {\u02c6l, t} with (\u02c6l, t) = argmax\n\nF\u03b8(l(t), t, I)\n\n(8)\n\nl(t),t\n\nand \u02c6l 6= lI with lI the ground truth labeling for image I. We set L0(I) = \u2205. The con\ufb01gurations in\nLj are also used to compute the (analytic) gradient and we use quasi-Newton methods to optimize\n(7). As learning progresses, new labelings are added to the partition function estimate and this\nbecomes more accurate.\n\nOur learning procedure stops either when (1) all label con\ufb01gurations have been incrementally gen-\nerated, case when the exact value of the partition function and unbiased estimates for parameters\nare obtained, or (2) when a subset of the con\ufb01guration space has been considered in the partition\nfunction approximation and no new \u2018offending\u2019 con\ufb01gurations outside this set have been generated\nduring the previous learning (and inference) iteration. In this case a biased estimate is obtained.\nThis is to some extent inevitable for learning models with loopy dependencies and exponential state\nspaces. In practice, for all datasets we worked on, the learning algorithm converged in 10-25 it-\nerations. In experiments (sec. 5), we show that learning is signi\ufb01cantly more stable over standard\nsaddle-point approximations.\n\n5 Experiments\n\nWe evaluate the quality of semantic segmentation produced by our models in two different datasets:\nthe Stanford Background Dataset [2], and the VOC2010 Pascal Segmentation Challenge [23].\n\nThe Stanford Background Dataset contains 715 images and comprises two domains of annotation:\nsemantic classes and geometric classes. The task is to label each pixel in every image with both\ntypes of properties. The dataset also contains mid-level segmentation annotations for individual\nobjects, which we use to initially learn the parameters of the segmentation model (see sec. 3 and [1]).\nEvaluation in this dataset is performed using cross-validation over 5 folds, as in [2]. The evaluation\ncriterion is the mean pixel (labeling) accuracy.\n\nThe VOC2010 dataset is accepted as currently one of the most challenging object-class segmentation\nbenchmarks. This dataset also has annotation for individual objects, which we use to learn mid-level\nsegmentation parameters (\u03b2). Unlike Stanford, where all pixels are annotated, on VOC only objects\nfrom the 20 classes have ground truth labels. The evaluation criterion is the VOC score: the average\nper-class overlap between pixels labeled in each class and the respective ground truth annotation3.\nQuality of segments and tilings: We generate a bag of \ufb01gure-ground segments for each image\nusing the publicly available CPMC code [3]. CPMC is an algorithm that generates a large pool\n(or bag) of \ufb01gure-ground segmentations, scores them using mid-level properties, and returns the\n\n3The overlap measure of two segments is O(s, sg) = |s\u2229s\n\ng |\n\n|s\u222asg | [23].\n\n5\n\n\fStanford Geometry\nStanford Semantics\n\nVOC2010 Object Classes\n\nMax. pixel accuracy\n\n93.3\n85.6\n\nMax. VOC score\n\n77.9\n\nMethod\n\nSemantic Geometry\n\nJSL\n\nGould et al. [2]\n\n75.6\n76.4\n\n88.8\n91.0\n\nTable 1: Left: Study of maximum achievable labeling accuracy for our tiling set, for Stanford and\nVOC2010. The study uses our tiling closest to the segmentation ground truth and assigns \u2018per-\nfect\u2019 pixel labels to it based on that ground truth. In contrast, the best labeling accuracy we obtain\nautomatically is 88.8 for Stanford Geometry, 75.6 for Stanford Semantic, and 41.7 for VOC2010.\nThis shows that potential bottlenecks in reaching the maximum values have to do more with training\n(ranking) and labeling, rather than the spatial segment layouts and the tiling con\ufb01gurations produced.\nThe average number of segments per tiling are 6.6 on Stanford and 7.9 on VOC. Right: Mean pixel\naccuracies on the Stanford Labeling Dataset. We obtain results comparable to the state-of-the-art\nin a challenging full-image labeling problem. The results are signi\ufb01cant, considering that we use\ntilings (image segmentations) made on average of 6.6 segments per image. The same method is also\ncompetitive in object segmentation datasets such as the VOC2010, where the object granularity is\nmuch higher and regions with large spatial support are decisive for effective recognition (table 2).\n\ntop k ranked. The online version contains pre-trained models on VOC, but these tend to discard\nbackground regions, since VOC has none. For the Stanford experiments, we retrain the CPMC\nsegment ranker using Stanford\u2019s segment layout annotations. We generated segment bags having up\nto 200 segments on the Stanford dataset, and up to 100 segments on the VOC dataset. We model\nand sample tilings using the methodology described in [1] (see also (5) and sec. 3).\n\nTable 1, left) gives labeling performance upper-bounds on the two datasets for the \ufb01gure-ground seg-\nments and tilings produced. It can be seen that the upper bounds are high for both problems, hence\nthe quality of segments and tilings do not currently limit the \ufb01nal labeling performance, compared\nto the current state-of-the-art. For further detail on the \ufb01gure-ground segment pool quality (CPMC)\nand their assembly into complete image interpretations (FGtiling), we refer to [3, 1].\nLabeling performance: The tiling component of our model (5) has 41 unary and 31 pairwise\nparameters (\u03b2) in VOC2010, and 40 unary and 74 parameters (\u03b2) in Stanford. Detail for these\nfeatures is given in [1]. We will discuss only the features used by the labeling component of the\nmodel (4) in this section.\n\nIn both VOC2010 and Stanford we use two meta-features for the unary, category-dependent terms.\nOne type of meta-feature is produced as the output of regressors trained (on speci\ufb01c image features\ndescribed next) to predict overlap of input segments to putative categories. There is one such meta-\nfeature (1 regressor) for each category. A second type of meta-feature is obtained from an object\ndetector [24] to which a particular segment is presented. These detectors operate on bounding boxes,\nso we determine segment class scores as those of the bounding box overlapping most with the\nbounding box enclosing each segment.\n\nSince the target semantic concepts of the Stanford and VOC2010 datasets are widely different, we\nuse label-dependent unary terms based on different features. In both cases we use pairwise features\nconnecting all segments (N l\ns encodes full connectivity), among those belonging to a same tiling. As\npairwise features for \u03a8l we use simply a square matrix with all values set to 1, as in [5]. In this way,\nthe model can learn to avoid unlikely patterns of label co-occurrence.\n\nOn the Stanford Background Dataset, we train two types of unary meta-features for each class, for\nsemantic and geometric classes. The \ufb01rst unary meta-feature is the output of a regressor trained\nwith the publicly available features from Hoiem et al. [7], and the second one uses the features of\nGould et al. [25]. Each of the feature vectors is transformed using a randomized feature map that\napproximates the Gaussian-RBF kernel [26, 27]. Using this methodology we can work with linear\nmodels in the randomized feature map, yet exploit non-linear kernel embeddings. Summarizing,\nfor Stanford geometry, we have 12 parameters, \u03b1 (9 unary parameters from 3 classes, each with 2\nmeta-features and bias and 3 pairwise parameters), whereas for Stanford semantic labels we have 52\nparameters, \u03b1 (24 unary from 8 classes, each with 2 meta-features and bias, and 28 pair-wise, the\nupper triangle of an 8x8 matrix).\n\n6\n\n\fperson\n\nbicycle\n\nbicycle\n\nhorse\n\nperson\n\nbird\n\nbird\n\nbird\n\ntrain\n\nperson\n\npottedplant\n\nsofa\n\nchair\n\nFigure 2: (Best viewed in color) Semantic segmentation results of our method on images from the\nVOC2010 test set: \ufb01rst three images where the algorithm performs satisfactorily, whereas the last\nthree examples where the algorithm works less well. Notice that identifying multiple objects from\nthe same class is possible in this framework.\n\nIn the Stanford dataset, background regions such as grass and sky are shapeless and often locally\ndiscriminative. In such cases methods relying on pixel-level descriptors usually obtain good results\n(e.g. see baseline in [2]). In turn, outdoor datasets containing stuff are challenging for a method like\nours that relies on segmentations (tilings) which have an average of 6.6 segments per image (table\n1, left). The results we obtain are comparable to Gould et al. [2], as visible in table 1, right. The\nevaluation criterion is the same for both methods: the mean pixel accuracy.\n\nOn the VOC2010 dataset, performance is evaluated using the VOC score, the average of per-class\noverlap between pixels labeled in each class and the respective ground truth class. We used two\ndifferent unary meta-features as well. The \ufb01rst is the output of SVM regressors trained as in [28] us-\ning their publicly available features [3]. These regressors predict class scores directly on segments,\nbased on several features: bag of words of gray-level SIFT [29] and color SIFT [30] de\ufb01ned on\nthe foreground and background of each individual segment, and three pyramid HOGs with different\nparameters. Multiple chi-square kernels K(x, y) = exp(\u2212\u03b3\u03c72(x, y)) are combined as in [28]. As a\nsecond unary meta-feature we use the outputs of deformable part model detectors [24]. Summariz-\ning, we have 63 category-dependent unary parameters, \u03b1 (21 classes, each having 2 meta-features\nand bias), and 210 category-dependent pairwise parameters \u03b1 (upper triangle of 21x21 matrix). The\nresults, which match and slightly improve the recent winners in the 2010 VOC challenge, are re-\nported in table 2. In particular, our method produces the highest VOC score average over all classes,\nand also scores \ufb01rst on 9 individual classes. The images in \ufb01g. 2 show that our algorithm produces\ncorrect labelings. Notice that often the boundaries produced by tilings align with the boundaries of\nindividual objects, even when there are multiple such nearby objects from the same class.\nImpact of different segmentation and labeling methods: We also evaluate the inference method\nof [4] (using the code provided by the authors), on the VOC 2010 dataset, and the same input seg-\nments and potentials as for JSL. The inference time of the C++ implementation of [4] is comparable\nwith our MATLAB implementations of FGtiling and JSL. The score obtained by [4] on our model\nis 31.89%, 2.8% higher than the score obtained by the authors using piece-wise training and a dif-\n\nClasses\n\nJSL CHD BSS\nBackground 83.4 81.1 84.2\n51.6 58.3 52.5\nAeroplane\n25.1 23.1 27.4\nBicycle\n52.4 39.0 32.3\n35.6 37.8 34.5\n49.6 36.4 47.4\n66.7 63.2 60.6\n55.6 62.4 54.8\n\nBird\nBoat\nBottle\nBus\nCar\n\nClasses\n\n9.1\n\nCat\nChair\nCow\n\nJSL CHD BSS\n44.6 31.9 42.6\n10.6\n9.0\n41.2 36.8 32.9\nDiningTable 29.9 24.6 25.2\n25.5 29.4 27.1\n49.8 37.5 32.4\n47.9 60.6 47.1\n37.2 44.9 38.3\n\nDog\nHorse\n\nMotorbike\n\nPerson\n\nClasses\n\nJSL CHD BSS\nPottedPlant 19.3 30.1 36.8\n45.0 36.8 50.3\n24.4 19.4 21.9\n37.2 44.1 35.2\nTv/Monitor 43.3 35.9 40.9\n\nSheep\nSofa\nTrain\n\nAverage\n\n41.7 40.1 39.7\n\nTable 2: Per class results and averages obtained by our method (JSL) as well as top-scoring methods\nin the VOC2010 segmentation challenge (CHD: CVC-HARMONY-DET [15], BSS: BONN-SVR-\nSEGM [28]). Compared to other VOC2010 participants, the proposed method obtains better scores\nin 9 out of 21 classes, and has superior class average, the standard measure used for ranking. Top\nscores for each class are marked in bold. Results for other methods can be found in [23]. Note\nthat both JSL (the meta-features) and CHD are trained with the additional bounding box data and\nimages from the training set for object detection. Using this additional training data the class average\nobtained by BSS is 43.8 [28].\n\n7\n\n\f160\n\n140\n\nZ\ng\no\n\u2212\n\nl\n\n120\n\n100\n\n \n\nno inc PF\ninc PF\n\n5\n\n10\n\nlearning iteration\n\n15\n\n \n\ne\nr\no\nc\ns\n \nC\nO\nV\n\n40\n\n30\n\n20\n\n10\n\n0\n\n \n\n5\n\n \n\nx 105\n\nwith incremental Z\nwithout incremental Z\n\n10\n\n15\n\nLearning iteration\n\n20\n\ns\ng\nn\n\ni\nl\n\ne\nb\na\n\nl\n \n.\nr\nn\n\n3\n\n2\n\n1\n\n0\n\n \n\nlabelings total\nlabelings new\n\n2\n\n4\n\n6\nlearning iteration\n\n \n\n8\n\nFigure 3: Left: The negative log(Z) at the end of each iteration, for standard (non-incremental) and\nincremental saddle-point approximations to partition function. Without the stable and more accurate\nincremental saddle-point approximation to the partition function, the algorithm cannot successfully\nlearn. Results are obtained by training on VOC2010\u2019s \u2018trainval\u2019 (train+validation) dataset. Center:\nVOC2010 labeling score as a function of the learning iteration (training on VOC2010\u2019s \u2018trainval\u2019).\nRight: Number of new labeling con\ufb01gurations added to the partition function expansion as learning\nproceeds for VOC2010. Most con\ufb01gurations are added in the \ufb01rst few iterations.\n\nferent pool of segments [23], but 9.8% lower than the score of JSL. This suggests that a layered\nstrategy based on selecting a compact set of representative segmentations, followed by labeling is\nmore accurate than sequentially searching for segments and their labels.\n\nIn practice, the proposed JSL framework does not depend on FGtiling/CPMC to provide segmenta-\ntions. Instead, we can use any segmentation method. We have tested the JSL framework (learning\nand inference) on the Stanford dataset, using segmentations produced by the Ultrametric Contour\nMap (UCM) hierarchical segmentation method [9]. To obtain a similar number of segments as for\nCPMC (200 per image), we have selected only the segmentation levels above 20. The features and\nparameters where computed exactly as before. The bag of segments for each image was derived from\nthe UCM segmentations, and the segmentations where taken as tiling con\ufb01gurations for the corre-\nsponding image. In this case, the scores are 76.8 and 88.2 for the semantic and geometric classes,\nrespectively, showing the robustness of JSL to different input segmentations (see also table 1, right).\nLearning performance: In all our learning experiments, the model parameters have been initialized\nto the null vector, before learning proceeds, except for the \u03b1 corresponding to the unary terms in F l\n\u03b1\nwhich where set to one. Figure 3, left and center, shows comparisons of learning with and without\nthe incremental saddle point approximation to the partition function, for the VOC 2010 dataset.\nWithout accumulating labelings incrementally, the learning algorithm exhibits erratic behavior and\nover\ufb01ts\u2014the relatively small number of labelings used to estimate the partition function produce\nvery different results between consecutive iterations. Figure 3, right, shows the number of total and\nnew labelings added at each learning iteration.\n\nLearning the parameters on VOC 2010 using PMA has taken 180 hours and produced a VOC score\nof 41.3%. Stopping the learning with PMA after 2 hours (slightly above the 1.3 hrs required by the\nincremental saddle point approximation) results in a VOC score of 3.87%.\n\n6 Conclusion\n\nWe have presented a joint image segmentation and labeling model (JSL) which, given a bag of\n\ufb01gure-ground image segment hypotheses, constructs a joint probability distribution over both the\ncompatible image interpretations assembled from those segments, and over their labeling. The pro-\ncess can be interpreted as \ufb01rst sampling maximal cliques from a graph connecting all segments that\ndo not spatially overlap, followed by sampling labels for those segments, conditioned on the choice\nof their particular tiling. We propose a joint learning procedure based on Maximum Likelihood\nwhere the partition function over tilings and labelings is increasingly more accurately approximated\nduring training, by including incorrect con\ufb01gurations that the model rates probable. This ensures\nthat mistakes are not carried on uncorrected in future training iterations, and produces stable and\naccurate learning schedules. We show that models can be learned ef\ufb01ciently and match the state of\nthe art in the Stanford dataset, as well as VOC2010 where 41.7% accuracy on the test set is achieved.\n\n8\n\n\fReferences\n[1] A. Ion, J. Carreira, and C. Sminchisescu. Image segmentation by \ufb01gure-ground composition into maximal\n\ncliques. In ICCV, November 2011.\n\n[2] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent\n\nregions. In ICCV, September 2009.\n\n[3] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In\n\nCVPR, June 2010.\n\n[4] M. P. Kumar and D. Koller. Ef\ufb01ciently selecting regions for scene understanding. In CVPR, 2010.\n[5] S. Nowozin, P.V. Gehler, and C.H. Lampert. On parameter learning in crf-based approaches to object\n\nclass image segmentation. In ECCV, 2010.\n\n[6] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierarchical crfs for object class image\n\nsegmentation. In ICCV, 2009.\n\n[7] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. IJCV, 75(1), 2007.\n[8] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. Jordan. Matching words and\n\npictures. JMLR., 3:1107\u20131135, March 2003.\n\n[9] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From contours to regions: An empirical evaluation. In\n\nCVPR, pages 2294\u20132301, June 2009.\n\n[10] T. Malisiewicz and A. Efros. Improving spatial support for objects via multiple segmentations. In BMVC,\n\n2007.\n\n[11] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object\n\nrecognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81:2\u201323, 2009.\n\n[12] X. He, R. S. Zemel, and M. Carreira-Perpinan. Multiscale conditional random \ufb01elds for image labeling.\n\nCVPR, 2004.\n\n[13] G. Csurka and F. Perronnin. An ef\ufb01cient approach to semantic segmentation. IJCV, pages 1\u201315, 2010.\n[14] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel\n\nneighborhoods. In ICCV, 2009.\n\n[15] J. M. Gonfaus, X. Boix, J. van de Weijer, A. D. Bagdanov, J Serrat, and J. Gonzalez. Harmony potentials\n\nfor joint classi\ufb01cation and segmentation. In CVPR, 2010.\n\n[16] P. Kohli, L. Ladicky, and P.H.S. Torr. Robust higher order potentials for enforcing label consistency. In\n\nCVPR, 2008.\n\n[17] L. Ladicky, P. Sturgess, K. Alaharia, C. Russel, and P.H.S. Torr. What, where & how many ? combining\n\nobject detectors and crfs. In ECCV, September 2010.\n\n[18] C. Pantofaru, C. Schmid, and M. Hebert. Object recognition by integrating multiple image segmentations.\n\nIn ECCV, 2008.\n\n[19] J.J. Lim, P. Arbelaez, Chunhui Gu, and J. Malik. Context by region ancestry. In ICCV, 2009.\n[20] Z. Tu, X. Chen, A.L. Yuille, and S.-C. Zhu. Image parsing: unifying segmentation, detection, and recog-\n\nnition. In ICCV, 2003.\n\n[21] V. Kolmogorov.\n\n28(10):1568\u20131583, 2006.\n\nConvergent\n\ntree-reweighted message passing for energy minimization.\n\nPAMI,\n\n[22] S. Kumar, J. August, and M. Hebert. Exploiting inference for approximate parameter learning in discrim-\n\ninative \ufb01elds: An empirical study. In EMMCVPR, 2005.\n\n[23] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2010 (VOC2010) Results. http://www.pascal-network.org/challenges/VOC/.\n\n[24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-\n\ntively trained part-based models. PAMI, 32(9):1627\u20131645, 2010.\n\n[25] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-class segmentation with relative location\n\nprior. IJCV, 80(3):300\u2013316, 2008.\n\n[26] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, December 2007.\n[27] F. Li, C. Ionescu, and C. Sminchisescu. Random Fourier approximations for skewed multiplicative his-\n\ntogram kernels. In DAGM, September 2010.\n\n[28] F. Li, J. Carreira, and C. Sminchisescu. Object recognition by sequential \ufb01gure-ground ranking. IJCV,\n\n2012.\n\n[29] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[30] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene\n\nrecognition. PAMI, 32(9):1582\u20131596, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1037, "authors": [{"given_name": "Adrian", "family_name": "Ion", "institution": null}, {"given_name": "Joao", "family_name": "Carreira", "institution": null}, {"given_name": "Cristian", "family_name": "Sminchisescu", "institution": null}]}