{"title": "A Generative Model for Parts-based Object Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 100, "page_last": 107, "abstract": "The Shape Boltzmann Machine (SBM) has recently been introduced as a state-of-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object's parts. Our model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based image segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art.", "full_text": "A Generative Model\n\nfor Parts-based Object Segmentation\n\nS. M. Ali Eslami\n\nSchool of Informatics\nUniversity of Edinburgh\n\ns.m.eslami@sms.ed.ac.uk\n\nChristopher K. I. Williams\n\nSchool of Informatics\nUniversity of Edinburgh\nckiw@inf.ed.ac.uk\n\nAbstract\n\nThe Shape Boltzmann Machine (SBM) [1] has recently been introduced as a state-\nof-the-art model of foreground/background object shape. We extend the SBM to\naccount for the foreground object\u2019s parts. Our new model, the Multinomial SBM\n(MSBM), can capture both local and global statistics of part shapes accurately.\nWe combine the MSBM with an appearance model to form a fully generative\nmodel of images of objects. Parts-based object segmentations are obtained simply\nby performing probabilistic inference in the model. We apply the model to two\nchallenging datasets which exhibit signi\ufb01cant shape and appearance variability,\nand \ufb01nd that it obtains results that are comparable to the state-of-the-art.\n\nThere has been signi\ufb01cant focus in computer vision on object recognition and detection e.g. [2], but\na strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One\nsuch description is a parts-based object segmentation, in which an image is partitioned into multiple\nsets of pixels, each belonging to either a part of the object of interest, or its background.\nThe signi\ufb01cance of parts in computer vision has been recognized since the earliest days of the\n\ufb01eld (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based\nsegmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however\nseveral models have recently been proposed that aim to increase the accuracy of segmentations by\nalso incorporating prior knowledge about the foreground object\u2019s shape [8, 9, 10, 11]. In such cases,\nprobabilistic techniques often mainly differ in how accurately they represent and learn about the\nvariability exhibited by the shapes of the object\u2019s parts.\nAccurate models of the shapes and appearances of parts can be necessary to perform inference in\ndatasets that exhibit large amounts of variability. In general, the stronger the models of these two\ncomponents, the more performance is improved. A generative model has the added bene\ufb01t of being\nable to generate samples, which allows us to visually inspect the quality of its understanding of the\ndata and the problem.\nRecently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been\nused to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art\nand it possesses several highly desirable characteristics: samples from the model look realistic, and\nit generalizes to generate samples that differ from the limited number of examples it is trained on.\nThe main contributions of this paper are as follows: 1) In order to account for object parts we extend\nthe SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape\nBoltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of\nparts-based object shape. 2) We combine the MSBM with an appearance model to form a fully\ngenerative model of images of objects (see Fig. 1). We show how parts-based object segmentations\ncan be obtained simply by performing probabilistic inference in the model. We apply our model\nto two challenging datasets and \ufb01nd that in addition to being principled and fully generative, the\nmodel\u2019s performance is comparable to the state-of-the-art.\n\n1\n\n\fn\ni\na\nr\nT\n\ns\ne\ng\na\nm\n\ni\n\nn\ni\na\nr\nT\n\ns\nl\ne\nb\na\nl\n\nTest image\n\nAppearance model\n\nJoint\nModel\n\nShape model\n\nParsing\n\nFigure 1: Overview. Using annotated images separate models of shape and appearance are trained.\nGiven an unseen test image, its parsing is obtained via inference in the proposed joint model.\n\nIn Secs. 1 and 2 we present the model and propose ef\ufb01cient inference and learning schemes. In\nSec. 3 we compare and contrast the resulting joint model with existing work in the literature. We\ndescribe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5.\n\n1 Model\n\nWe consider datasets of cropped images of an object class. We assume that the images are con-\nstructed through some combination of a \ufb01xed number of parts. Given a dataset D = {Xd}, d = 1...n\nof such images X, each consisting of P pixels {xi}, i = 1...P , we wish to infer a segmentation S for\nthe image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and\nL is the \ufb01xed number of parts that combine to generate the foreground. In other words, si = (sli),\nl = 0...L, sli 2{ 0, 1} andPl sli = 1. Note that the background is also treated as a \u2018part\u2019 (l = 0).\n\nAccurate inference of S is driven by models for 1) part shapes and 2) part appearances.\nPart shapes: Several types of models can be used to de\ufb01ne probabilistic distributions over segmen-\ntations S. The simplest approach is to model each pixel si independently with categorical variables\nwhose parameters are speci\ufb01ed by the object\u2019s mean shape (Fig. 2(a)). Markov Random Fields\n(MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential\nfunctions that ef\ufb01ciently capture local properties of images like smoothness and continuity.\nRestricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Ma-\nchines (DBMs, Fig. 2(c)) make heavy use of hidden variables to ef\ufb01ciently de\ufb01ne higher-order\npotentials that take into account the con\ufb01guration of larger groups of image pixels. The introduc-\ntion of such hidden variables provides a way to ef\ufb01ciently capture complex, global properties of\nimage pixels. RBMs and DBMs are powerful generative models, but they also have many parame-\nters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds\nof examples). In order to learn a model that accurately captures the properties of part shapes we\nuse DBMs but also impose carefully chosen connectivity and capacity constraints, following the\nstructure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for\nmulti-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM).\nThe MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1, h2}), and de\ufb01nes a\nBoltzmann distribution over segmentations p(S) =Ph1,h2 exp{E(S, h1, h2|\u2713s)}/Z(\u2713s) where\n\nw2\n\njkh1\n\nj h2\n\nkh2\nc2\nk,\n\n(1)\n\nE(S, h1, h2|\u2713s) = Xi,l\n\nblisli +Xi,j,l\n\nwhere j and k range over the \ufb01rst and second layer hidden variables, and \u2713s = {W 1, W 2, b, c1, c2}\nare the shape model parameters. In the \ufb01rst layer, local receptive \ufb01elds are enforced by connecting\neach hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as\nshown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity\nto be learned at the lowest layer. We share weights between the four sets of \ufb01rst-layer hidden units\nand patches, and purposely restrict the number of units in h2. These modi\ufb01cations signi\ufb01cantly\nreduce the number of parameters whilst taking into account an important property of shapes, namely\nthat the strongest dependencies between pixels are typically local.\n\nw1\n\nlijslih1\n\nj +Xj\n\nc1\nj h1\n\nj +Xj,k\n\nk +Xk\n\n2\n\n\fS\n\nS\n\nh2\n\nh1\n\nS\n\nh2\n\nh1\n\nS\n\nb\n\nh2\nh1\nS\n\n(a) Mean\n\n(b) MRF\n\n(c) DBM\n\n(d) SBM\n\n(e) 2D SBM\n\nFigure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice\nof a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice\nof a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h\nare binary and visible units S are multinomial random variables. Based on Fig. 2 of [1].\n\nk = 1 k = 2 k = 3\n\nk = 1 k = 2 k = 3\n\nk = 1 k = 2 k = 3\n\n\u21e1\n\n\n\nl = 0\n\nl = 1\n\nl = 2\n\nFigure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background\n(l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding\nappearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color.\n\nPart appearances: Pixels in a given image are assumed to have been generated by W \ufb01xed Gaus-\nsians in RGB space. During pre-training, the means {\u00b5w} and covariances {\u2303w} of these Gaussians\nare extracted by training a mixture model with W components on every pixel in the dataset, ignoring\nimage and part structure. It is also assumed that each of the L parts can have different appearances\nin different images, and that these appearances can be clustered into K classes. The classes differ in\nhow likely they are to use each of the W components when \u2018coloring in\u2019 the part.\nThe generative process is as follows. For part l in an image, one of the K classes is chosen (repre-\nsented by a 1-of-K indicator variable al). Given al, the probability distribution de\ufb01ned on pixels\nassociated with part l is given by a Gaussian mixture model with means {\u00b5w} and covariances {\u2303w}\nand mixing proportions {lkw}. The prior on A = {al} speci\ufb01es the probability \u21e1lk of appearance\nclass k being chosen for part l. Therefore appearance parameters \u2713a = {\u21e1lk, lkw} (see Fig. 3) and:\n\np(xi|A, si,\u2713 a) =Yl\np(A|\u2713a) =Yl\n\np(xi|al,\u2713 a)sli =Yl Yk Xw\np(al|\u2713a) =Yl Yk\n\n(\u21e1lk)alk .\n\nlkw N (xi|\u00b5w, \u2303w)!alk!sli\n\n,\n\n(2)\n\n(3)\n\nCombining shapes and appearances: To summarize, the latent variables for X are A, S, H, and\nthe model\u2019s active parameters \u2713 include shape parameters \u2713s and appearance parameters \u2713a, so that\n\np(X, A, S, H|\u2713) =\n\n1\n\nZ()\n\np(A|\u2713a)p(S, H|\u2713s)Yi\n\np(xi|A, si,\u2713 a),\n\n(4)\n\nwhere the parameter adjusts the relative contributions of the shape and appearance components.\nSee Fig. 4 for an illustration of the complete graphical model. During learning, we \ufb01nd the val-\nues of \u2713 that maximize the likelihood of the training data D, and segmentation is performed on\na previously-unseen image by querying the marginal distribution p(S|Xtest,\u2713 ). Note that Z() is\nconstant throughout the execution of the algorithms. We set via trial and error in our experiments.\n\n3\n\n\fH\n\nal\n\nsi\n\nxi\n\nL+1\n\nP\n\n\u2713a\n\nn\n\n\u2713s\n\nH\n\nA\n\nS\n\nX\n\nFigure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via\nappearance variables al. The model\u2019s belief about each layer\u2019s shape is captured by shape variables\nH. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X.\n\nInference and learning\n\n2\nInference: We approximate p(A, S, H|X,\u2713 ) by drawing samples of A, S and H using block-Gibbs\nMarkov Chain Monte Carlo (MCMC). The desired distribution p(S|X,\u2713 ) can then be obtained by\nconsidering only the samples for S (see Algorithm 1).\nIn order to sample p(A|S, H, X,\u2713 ) we\nconsider the conditional distribution of appearance class k being chosen for part l which is given by:\n\np(alk = 1|S, X,\u2713 ) =\n\n\u21e1lkQi (Pw lkw N (xi|\u00b5w, \u2303w))\u00b7sli\nr=1h\u21e1lrQi (Pw lrw N (xi|\u00b5w, \u2303w))\u00b7slii .\nPK\n\n(5)\n\nSince the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer\nare conditionally independent given the units in the other two layers. This property can be exploited\nto make inference in the shape model exact and ef\ufb01cient. The conditional probabilities are:\n\np(h1\n\nj = 1|s, h2,\u2713 ) = (Xi,l\nk = 1|h1,\u2713 ) = (Xj\n\np(h2\n\nw1\n\nlijsli +Xk\n\nw2\n\njkh1\n\nj + c2\n\nj ),\n\nw2\n\njkh2\n\nk + c1\n\nj ),\n\n(6)\n\n(7)\n\np(sli = 1|A, H, X,\u2713 ) =\n\nwhere (y) = 1/(1 + exp(y)) is the sigmoid function. To sample from p(H|S, X,\u2713 ) we iterate\nbetween Eqns. 6 and 7 multiple times and keep only the \ufb01nal values of h1 and h2. Finally, we draw\nsamples for the pixels in p(S|A, H, X,\u2713 ) independently:\nexp(Pj w1\nlijh1\nPL\nm=1 exp(Pj w1\n\nj + bli) p(xi|A, sli = 1,\u2713 )\nmijh1\n\nSeeding: Since the latent-space is extremely high-dimensional, in practice we \ufb01nd it helpful to run\nseveral inference chains, each initializing S(1) to a different value. The \u2018best\u2019 inference is retained\nand the others are discarded. The computation of the likelihood p(X|\u2713) of image X is intractable,\nso we approximate the quality of each inference using a scoring function:\n\nj + bmi) p(xi|A, smi = 1,\u2713 )\n\n(8)\n\n.\n\nScore(X|\u2713) =\n\np(X, A(t), S(t), H(t)|\u2713),\n\n(9)\n\n1\n\nT Xt\n\nwhere {A(t), S(t), H(t)}, t = 1...T are the samples obtained from the posterior p(A, S, H|X,\u2713 ).\nIf the samples were drawn from the prior p(A, S, H|\u2713) the scoring function would be an unbiased\nestimator of p(X|\u2713), but would be wildly inaccurate due to the high probability of missing the\nimportant regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue).\nLearning: Learning of the model involves maximizing the log likelihood log p(D|\u2713a,\u2713 s) of the\ntraining dataset D with respect to the model parameters \u2713a and \u2713s. Since training is partially su-\npervised, in that for each image X its corresponding segmentation S is also given, we can learn the\nparameters of the shape and appearance components separately.\nFor appearances, the learning of the mixing coef\ufb01cients and the histogram parameters decomposes\ninto standard mixture updates independently for each part. For shapes, we follow the standard deep\n\n4\n\n\fAlgorithm 1 MCMC inference algorithm.\n1: procedure INFER(X,\u2713 )\nInitialize S(1), H(1)\n2:\nfor t 2 : chain length do\n3:\n4:\n5:\n6:\n7:\n\nA(t) \u21e0 p(A|S(t1), H(t1), X,\u2713 )\nS(t) \u21e0 p(S|A(t), H(t1), X,\u2713 )\nH(t) \u21e0 p(H|S(t),\u2713 )\n\nreturn {S(t)}t=burnin:chain length\n\nlearning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up,\none layer at a time. We begin by training an RBM on the observed data using stochastic maximum\nlikelihood learning (SML; also referred to as \u2018persistent CD\u2019; [14, 13]). Once this RBM is trained,\nwe infer the conditional mean of the hidden units for each training image. The resulting vectors\nthen serve as the training data for a second RBM which is again trained using SML. We use the\nparameters of these two RBMs to initialize the parameters of the full MSBM model. In the second\nphase we perform approximate stochastic gradient ascent in the likelihood of the full model to \ufb01ne-\ntune the parameters in an EM-like scheme as described in [13].\n\n3 Related work\n\nExisting probabilistic models of images can be categorized by the amount of variability they expect\nto encounter in the data and by how they model this variability. A signi\ufb01cant portion of the literature\nmodels images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19].\nModels that account for the parts within the foreground object mainly differ in how accurately they\nlearn about and represent the variability of the shapes of the object\u2019s parts.\nIn Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9]\nadditionally allows for local deformations of this mean partitioning. Stel Component Analysis [10]\naccounts for larger amounts of shape variability by learning a number of different template means\nfor the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances\n[11] models global properties of shape using a factor analysis-like model, and \u2018masked\u2019 RBMs have\nbeen used to model more local properties of shape [20]. However, none of these models constitute\na strong model of shape in terms of realism of samples and generalization capabilities [1]. We\ndemonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties.\nThe closest works to ours in terms of ability to deal with datasets that exhibit signi\ufb01cant variability\nin both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and\nFowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts\nusing several template means. The different parts are composed using hand coded geometric con-\nstraints, which means that the model cannot be automatically extended to other application domains.\nThe Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and de\ufb01nes dis-\ntributions over segmentations only in the posterior, and therefore is not fully generative. The model\npresented here is entirely learned from data and fully generative, therefore it can be applied to new\ndatasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly\nabsorb future developments in shape and appearance models.\n\n4 Experiments\n\nPenn-Fudan pedestrians: The \ufb01rst dataset that we considered is Penn-Fudan pedestrians [23],\nconsisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth\nsegmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms;\nFig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21].\nFor the shape component, we trained an MSBM on the 684 images of a labeled version of the\nHumanEva dataset [24] (at 48 \u21e5 24 pixels; also \ufb02ipped horizontally) with overlap b = 4, and 400\nand 50 hidden units in the \ufb01rst and second layers respectively. Each layer was pre-trained for 3000\nepochs (iterations). After pre-training, joint training was performed for 1000 epochs.\n\n5\n\n\fg\nn\ni\nl\np\nm\na\nS\n)\na\n(\n\ns\nf\nf\ni\n\nD\n\n)\nb\n(\n\nn\no\ni\nt\ne\nl\np\nm\no\nC\n\n)\nc\n(\n\n!\n\n!\n\n!\n\nFigure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The\napparent \u2018blurriness\u2019 of samples is not due to averaging or resizing. We display the probability of\neach pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs\nto the red or blue parts, we display that pixel in purple. (b) Differences between the samples and\ntheir most similar counterparts in the training dataset. (c) Completion of occlusions (pink).\n\nTo assess the realism and generalization characteristics of the learned MSBM we sample from it.\nIn Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs\nMCMC (1000 samples between frames). The model captures highly non-linear correlations in the\ndata whilst preserving the object\u2019s details (e.g. face and arms). To demonstrate that the model has\nnot simply memorized the training data, in Fig. 5(b) we show the difference between the sampled\nshapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement).\nWe see that the model generalizes in non-trivial ways to generate realistic shapes that it had not\nencountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions.\nThe samples highlight the variability in possible completions captured by the model. Note how,\ne.g.\nthe length of the person\u2019s trousers on one leg affects the model\u2019s predictions for the other,\ndemonstrating the model\u2019s knowledge about long-range dependencies. An interactive MATLAB\nGUI for sampling from this MSBM has been included in the supplementary material.\nThe Penn-Fudan dataset (at 200 \u21e5 100 pixels) was then split into 10 train/test cross-validation splits\nwithout replacement. We used the training images in each split to train the appearance component\nwith a vocabulary of size W = 50 and K = 100 mixture components1. We additionally constrained\nthe model by sharing the appearance models for the arms and legs with that of the face.\nWe assess the quality of the appearance model by performing the following experiment: for each test\nimage, we used the scoring function described in Eq. 9 to evaluate a number of different proposal\nsegmentations for that image. We considered 10 randomly chosen segmentations from the training\ndataset as well as the ground-truth segmentation for the test image, and found that the appearance\nmodel correctly assigns the highest score to the ground-truth 95% of the time.\nDuring inference, the shape and appearance models (which are de\ufb01ned on images of different sizes),\nwere combined at 200 \u21e5 100 pixels via MATLAB\u2019s imresize function, and we set = 0.8\n(Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the\nHumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for\n20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized\nMATLAB implementation completed inference for each chain in around 7 seconds.\nWe compute the conditional probability of each pixel belonging to different parts given the last set\nof samples obtained from the highest scoring chain, assign each pixel independently to the most\nlikely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We \ufb01nd\nthat accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel\nare all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We\nalso report the accuracy obtained, had the top scoring seed segmentation been returned as the \ufb01nal\nsegmentation for each image. Here the quality of the seed is determined solely by the appearance\nmodel. We observe that the model has comparable performance to the state-of-the-art but pedestrian-\nspeci\ufb01c algorithm of [21], and that inference in the model signi\ufb01cantly improves the accuracy of the\nsegmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c).\n\n1We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the\n\ndataset are highly varied, and the complexity of the appearance model re\ufb02ects this fact.\n\n6\n\n\fTable 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The \ufb01nal\ncolumn is an average of the background, upper and lower body scores (as reported in [21]).\nAverage\n69.5%\n65.3%\n66.6%\n53.5%\n56.4%\n\nUpper Body Lower Body Head\n51.8%\n53.0%\n54.1%\n45.5%\n43.5%\n\nBo and Fowlkes [21]\nMSBM\nMSBM + SP\nTop seed\nTop seed + SP\n\nFG\nBG\n73.3% 81.1%\n70.7% 72.8%\n71.6% 73.8%\n59.0% 61.8%\n61.6% 67.3%\n\n73.6%\n68.6%\n69.9%\n56.8%\n60.8%\n\n71.6%\n66.7%\n68.5%\n49.8%\n54.1%\n\nTable 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled\ncorrectly. The \ufb01nal column is an average weighted by the frequency of occurrence of each label.\n\nBody Wheel Window Bumper License\n\nISM [22]\nMSBM\nTop seed\n\nBG\n93.2% 72.2% 63.6%\n94.6% 72.7% 36.8%\n92.2% 68.4% 28.3%\n\n80.5%\n74.4%\n63.8%\n\n73.8%\n64.9%\n45.4%\n\nLight\n56.2% 34.8%\n17.9% 19.9%\n11.2% 15.1%\n\nAverage\n86.8%\n86.0%\n81.8%\n\nETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which\nitself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semi-\npro\ufb01le view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts\n(body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of\nthe model with the ISM of Thomas et al. [22], who also report their results on this dataset.\nThe dataset was split into 10 train/test cross-validation splits without replacement. We used the\ntraining images in each split to train both the shape and appearance components. For the shape\ncomponent, we trained an MSBM at 50 \u21e5 50 pixels with overlap b = 4, and 2000 and 100 hidden\nunits in the \ufb01rst and second layers respectively. Each layer was pre-trained for 3000 epochs and joint\ntraining was performed for 1000 epochs. The appearance model was trained with a vocabulary of\nsize W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded\nat 50 exemplar segmentations (obtained using K-medoids). We \ufb01nd that the use of superpixels does\nnot help with this dataset (due to the poor quality of superpixels obtained for these images).\nQualitative and quantitative results that show the performance of model to be comparable to the\nstate-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy\nbetween the MSBM and ISM on the \u2018license\u2019 and \u2018light\u2019 labels to mainly be due to ISM\u2019s use of\ninterest-points, as they are able to locate such \ufb01ne structures accurately. By incorporating better\nmodels of part appearance into the generative model, we expect to see this discrepancy decrease.\n\n5 Conclusions and future work\n\nIn this paper we have shown how the SBM can be extended to obtain the MSBM, and presented\na principled probabilistic model of images of objects that exploits the MSBM as its model for part\nshapes. We demonstrated how object segmentations can be obtained simply by performing MCMC\ninference in the model. The model can also be treated as a probabilistic evaluator of segmentations:\ngiven a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that\nthe combination of a generative model such as ours, with a discriminative, bottom-up segmentation\nalgorithm could be highly effective. We are currently investigating how textured appearance models,\nwhich take into account the spatial structure of pixels, affect the learning and inference algorithms\nand the performance of the model.\n\nAcknowledgments\nThanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and\nJohn Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS\nscheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886).\n\n7\n\n\ft\ns\ne\nT\n)\na\n(\n\ns\ne\nk\nl\nw\no\nF\nd\nn\na\n\no\nB\n\n)\nb\n(\n\nM\nB\nS\nM\n\n)\nc\n(\n\nh\nt\nu\nr\nt\nd\nn\nu\no\nr\nG\n\n)\nd\n(\n\nBackground\n\nHair\n\nFace\n\nUpper\n\nShoes\n\nLegs\n\nLower\n\nArms\n\nFigure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21].\n(c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21].\n\nt\ns\ne\nT\n)\na\n(\n\n.\nl\na\n\nt\ne\n\ns\na\nm\no\nh\nT\n)\nb\n(\n\nM\nB\nS\nM\n\n)\nc\n(\n\nh\nt\nu\nr\nt\n\nd\nn\nu\no\nr\nG\n\n)\nd\n(\n\nBackground\n\nBody\n\nWheel\n\nWindow\n\nBumper\n\nLicense\n\nHeadlight\n\nFigure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of\nthe joint model. (d) Ground-truth images. Images shown are those selected by [22].\n\n8\n\n\fReferences\n[1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong\n\nModel of Object Shape. In IEEE CVPR, 2012.\n\n[2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zis-\nInternational Journal of\n\nserman. The PASCAL Visual Object Classes (VOC) Challenge.\nComputer Vision, 88:303\u2013338, 2010.\n\n[3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Struc-\n\ntures. IEEE Transactions on Computers, 22(1):67\u201392, 1973.\n\n[4] David Marr. Vision: A Computational Investigation into the Human Representation and Pro-\n\ncessing of Visual Information. Freeman, 1982.\n\n[5] Irving Biederman. Recognition-by-components: A theory of human image understanding.\n\nPsychological Review, 94:115\u2013147, 1987.\n\n[6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative\n\nParts for Object Detection. In ECCV, pages 302\u2013315, 2006.\n\n[7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and\n\nSegmenting Partially Occluded Objects. In IEEE CVPR, pages 37\u201344, 2006.\n\n[8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In\n\nIEEE CVPR, pages 212\u2013219, 2004.\n\n[9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmenta-\n\ntion. In ICCV, pages 756\u2013763, 2005.\n\n[10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel\n\ncomponent analysis. In IEEE CVPR, pages 2044\u20132051, 2009.\n\n[11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Parts-\n\nbased Object Understanding. In BMVC, pages 18.1\u201318.12, 2011.\n\n[12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD\n\nthesis, University of Edinburgh, 2011.\n\n[13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines.\n\nume 5, pages 448\u2013455, 2009.\n\nIn AISTATS, vol-\n\n[14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likeli-\n\nhood gradient. In ICML, pages 1064\u20131071, 2008.\n\n[15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. \u201cGrabCut\u201d: interactive foreground\n\nextraction using iterated graph cuts. ACM SIGGRAPH, 23:309\u2013314, 2004.\n\n[16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up\n\nSegmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004.\n\n[17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmen-\n\ntation of Objects using Ef\ufb01cient Learning. IEEE CVPR, pages 1\u20137, 2007.\n\n[18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class seg-\n\nmentation. In ECCV, pages 380\u2013393, 2010.\n\n[19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of Foreground-\n\nBackground Segmentation using Masked RBMs. In ICANN, 2011.\n\n[20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model\n\nof Images by Factoring Appearance and Shape. Neural Computation, 23(3):593\u2013650, 2011.\n\n[21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011.\n[22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using\n\nRecognition and Annotation to Guide a Robot\u2019s Attention. IJRR, 28(8):976\u2013998, 2009.\n\n[23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database\nand Tool for Image Annotation. International Journal of Computer Vision, 77:157\u2013173, 2008.\nInternational Journal of\n\n[24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva.\n\nComputer Vision, 87(1-2):4\u201327, 2010.\n\n[25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to\n\nRegions: An Empirical Evaluation. In IEEE CVPR, 2009.\n\n9\n\n\f", "award": [], "sourceid": 57, "authors": [{"given_name": "S.", "family_name": "Eslami", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}]}