{"title": "Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1585, "page_last": 1592, "abstract": "We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we use chi--square tests to show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman--Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state--of--the--art methods, while simultaneously discovering categories shared among natural scenes.", "full_text": "Shared Segmentation of Natural Scenes\nUsing Dependent Pitman-Yor Processes\n\nErik B. Sudderth and Michael I. Jordan\n\nElectrical Engineering & Computer Science, University of California, Berkeley\n\nsudderth@cs.berkeley.edu, jordan@cs.berkeley.edu\n\nAbstract\n\nWe develop a statistical framework for the simultaneous, unsupervised segmenta-\ntion and discovery of visual object categories from image databases. Examining\na large set of manually segmented scenes, we show that object frequencies and\nsegment sizes both follow power law distributions, which are well modeled by the\nPitman\u2013Yor (PY) process. This nonparametric prior distribution leads to learning\nalgorithms which discover an unknown set of objects, and segmentation methods\nwhich automatically adapt their resolution to each image. Generalizing previ-\nous applications of PY processes, we use Gaussian processes to discover spatially\ncontiguous segments which respect image boundaries. Using a novel family of\nvariational approximations, our approach produces segmentations which compare\nfavorably to state-of-the-art methods, while simultaneously discovering categories\nshared among natural scenes.\n\n1 Introduction\n\nImages of natural environments contain a rich diversity of spatial structure at both coarse and \ufb01ne\nscales. We would like to build systems which can automatically discover the visual categories\n(e.g., foliage, mountains, buildings, oceans) which compose such scenes. Because the \u201cobjects\u201d\nof interest lack rigid forms, they are poorly suited to traditional, \ufb01xed aspect detectors. In simple\ncases, topic models can be used to cluster local textural elements, coarsely representing categories\nvia a bag of visual features [1, 2]. However, spatial structure plays a crucial role in general scene\ninterpretation [3], particularly when few labeled training examples are available.\n\nOne approach to modeling additional spatial dependence begins by precomputing one, or several,\nsegmentations of each input image [4\u20136]. However, low-level grouping cues are often ambiguous,\nand \ufb01xed partitions may improperly split or merge objects. Markov random \ufb01elds (MRFs) have\nbeen used to segment images into one of several known object classes [7, 8], but these approaches\nrequire manual segmentations to train category-speci\ufb01c appearance models. In this paper, we instead\ndevelop a statistical framework for the unsupervised discovery and segmentation of visual object\ncategories. We approach this problem by considering sets of images depicting related natural scenes\n(see Fig. 1(a)). Using color and texture cues, our method simultaneously groups dense features\ninto spatially coherent segments, and re\ufb01nes these partitions using shared appearance models. This\nextends the cosegmentation framework [9], which matches two views of a single object instance, to\nsimultaneously segment multiple object categories across a large image database. Some recent work\nhas pursued similar goals [6, 10], but robust object discovery remains an open challenge.\n\nOur models are based on the Pitman\u2013Yor (PY) process [11], a nonparametric Bayesian prior on\nin\ufb01nite partitions. This generalization of the Dirichlet process (DP) leads to heavier-tailed, power\nlaw distributions for the frequencies of observed objects or topics. Using a large database of manual\nscene segmentations, Sec. 2 demonstrates that PY priors closely match the true distributions of\nnatural segment sizes, and frequencies with which object categories are observed. Generalizing\nthe hierarchical DP [12], Sec. 3 then describes a hierarchical Pitman\u2013Yor (HPY) mixture model\nwhich shares \u201cbag of features\u201d appearance models among related scenes. Importantly, this approach\ncoherently models uncertainty in the number of object categories and instances.\n\n\fSegment Labels\nPY(0.39,3.70)\nDP(11.40)\n\n0\n10\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\n \n0\n10\nSegment Labels (sorted by frequency)\n0\n10\n\n2\n10\n\n1\n10\n\n \n\nSegment Labels\nPY(0.47,6.90)\nDP(33.00)\n\n\u22121\n\n10\n\n\u22122\n\n10\n\n\u22123\n\n10\n\n\u22124\n\n10\n\ns\nt\nn\ne\nm\ng\ne\nS\n\n \nt\ns\ne\nr\no\n\nf\n \nf\n\n \n\no\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\ns\nt\nn\ne\nm\ng\ne\nS\n \ny\nt\ni\nc\ne\nd\ns\nn\n\ni\n\ni\n \nf\n\n \n\no\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\n \n\n3\n10\n\ns\nt\nn\ne\nm\ng\ne\nS\n\n \nt\ns\ne\nr\no\nf\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\ns\nt\nn\ne\nm\ng\ne\nS\n \ny\nt\ni\nc\ne\nd\ns\nn\ni\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\ni\n\n2\n10\n\n1\n10\n\n0\n10\n\n \n\n3\n10\n\n2\n10\n\n1\n10\n\n0\n10\n\n \n\n \n0\n10\nSegment Labels (sorted by frequency)\n\n2\n10\n\n1\n10\n\n(a)\n\n(b)\n\n \n\n120\n\nSegment Areas\nPY(0.02,2.20)\nDP(2.40)\n\n\u22122\n\n10\n\n\u22121\n\n10\n\nProportion of Image Area\n\n0\n10\n\n \n\nSegment Areas\nPY(0.32,0.80)\nDP(2.90)\n\ns\ne\ng\na\nm\n\nI\n \nt\ns\ne\nr\no\n\nf\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n \n\n120\n\n100\n\ns\ne\ng\na\nm\n\nSegment Counts\nPY(0.02,2.20)\nDP(2.40)\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nNumber of Segments per Image\n\nSegment Counts\nPY(0.32,0.80)\nDP(2.90)\n\n \n\n \n\nI\n \ny\nt\ni\nc\ne\nd\ns\nn\n\ni\n\ni\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n\u22122\n\n10\n\n\u22121\n\n10\n\nProportion of Image Area\n\n0\n10\n\n(c)\n\n80\n\n60\n\n40\n\n20\n\n0\n\n \n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nNumber of Segments per Image\n\n(d)\n\nFigure 1: Validation of stick-breaking priors for the statistics of human segmentations of the forest (top) and\ninsidecity (bottom) scene categories. We compare observed frequencies (black) to those predicted by Pitman\u2013\nYor process (PY, red circles) and Dirichlet process (DP, green squares) models. For each model, we also display\n95% con\ufb01dence intervals (dashed). (a) Example human segmentations, where each segment has a text label\nsuch as sky, tree trunk, car, or person walking. The full segmented database is available from LabelMe [14].\n(b) Frequency with which different semantic text labels, sorted from most to least frequent on a log-log scale,\nare associated with segments. (c) Number of segments occupying varying proportions of the image area, on a\nlog-log scale. (d) Counts of segments of size at least 5,000 pixels in 256 \u00d7 256 images of natural scenes.\n\nAs described in Sec. 4, we use thresholded Gaussian processes to link assignments of features to\nregions, and thereby produce smooth, coherent segments. Simulations show that our use of contin-\nuous latent variables captures long-range dependencies neglected by MRFs, including intervening\ncontour cues derived from image boundaries [13]. Furthermore, our formulation naturally leads\nto an ef\ufb01cient variational learning algorithm, which automatically searches over segmentations of\nvarying resolution. Sec. 5 concludes by demonstrating accurate segmentation of complex images,\nand discovery of appearance patterns shared across natural scenes.\n\n2 Statistics of Natural Scene Categories\n\nTo better understand the statistical relationships underlying natural scenes, we analyze manual seg-\nmentations of Oliva and Torralba\u2019s eight categories [3]. A non-expert user partitioned each image\ninto a variable number of polygonal segments corresponding to distinctive objects or scene elements\n(see Fig. 1(a)). Each segment has a semantic text label, allowing study of object co-occurrence fre-\nquencies across related scenes. There are over 29,000 segments in the collection of 2,688 images.1\n\n2.1 Stick Breaking and Pitman\u2013Yor Processes\n\nP\u221e\n\nThe relative frequencies of different object categories, as well as the image areas they occupy, can be\nstatistically modeled via distributions on potentially in\ufb01nite partitions. Let \u03d5 = (\u03d51, \u03d52, \u03d53, . . .),\nk=1 \u03d5k = 1, denote the probability mass associated with each subset. In nonparametric Bayesian\n\nstatistics, prior models for partitions are often de\ufb01ned via a stick-breaking construction:\n\n\u03d5k = wk\n\n(1 \u2212 w\u2113) = wk(cid:18)1 \u2212\n\nk\u22121Y\u2113=1\n\nk\u22121X\u2113=1\n\n\u03d5\u2113(cid:19)\n\nwk \u223c Beta(1 \u2212 \u03b3a, \u03b3b + k\u03b3a)\n\n(1)\n\nThis Pitman\u2013Yor (PY) process [11], denoted by \u03d5 \u223c GEM(\u03b3a, \u03b3b), is de\ufb01ned by two hyperparam-\neters satisfying 0 \u2264 \u03b3a < 1, \u03b3b > \u2212\u03b3a. When \u03b3a = 0, we recover a Dirichlet process (DP) with\nconcentration parameter \u03b3b. This construction induces a distribution on \u03d5 such that subsets with\nmore mass \u03d5k typically have smaller indexes k. When \u03b3a > 0, E[wk] decreases with k, and the\nresulting partition frequencies follow heavier-tailed, power law distributions.\n\nWhile the sequences of beta variables underlying PY processes lead to in\ufb01nite partitions, only a\n\nrandom, \ufb01nite subset of size K\u03b5 =(cid:12)(cid:12){k | \u03d5k > \u03b5}(cid:12)(cid:12) will have probability greater than any threshold \u03b5.\n\nImplicitly, nonparametric models thus also place priors on the number of latent classes or objects.\n\n1See LabelMe [14]: http://labelme.csail.mit.edu/browseLabelMe/spatial envelope 256x256 static 8outdoorcategories.html\n\n\f2.2 Object Label Frequencies\n\nPitman\u2013Yor processes have been previously used to model the well-known power law behavior of\ntext sequences [15, 16]. Intuitively, the labels assigned to segments in the natural scene database\nhave similar properties: some (like sky, trees, and building) occur frequently, while others (rainbow,\nlichen, scaffolding, obelisk, etc.) are more rare. Fig. 1(b) plots the observed frequencies with which\nunique text labels, sorted from most to least frequent, occur in two scene categories. The overlaid\nquantiles correspond to the best \ufb01tting DP and PY processes, with parameters (\u02c6\u03b3a, \u02c6\u03b3b) estimated\na log(k) + \u2206(\u02c6\u03b3a, \u02c6\u03b3b) for large k [11],\nproducing power law behavior which accurately predicts observed object frequencies. In contrast,\nthe closest \ufb01tting DP model (\u02c6\u03b3a = 0) signi\ufb01cantly underestimates the number of rare labels.\n\nvia maximum likelihood. When \u02c6\u03b3a > 0, log E[e\u03d5k | \u02c6\u03b3] \u2248 \u2212\u02c6\u03b3 \u22121\n\nWe have quantitatively assessed the accuracy of these models using bootstrap signi\ufb01cance tests [17].\nThe PY process provides a good \ufb01t for all categories, while there is signi\ufb01cant evidence against the\nDP in most cases. By varying PY hyperparameters, we also capture interesting differences among\nscene types: urban, man-made environments have many more unique objects than natural ones.\n\n2.3 Segment Counts and Size Distributions\n\nWe have also used the natural scene database to quantitatively validate PY priors for image parti-\ntions [17]. For natural environments, the DP and PY processes both provide accurate \ufb01ts. However,\nsome urban environments have many more small objects, producing power law area distributions\n(see Fig. 1(c)) better captured by PY processes. As illustrated in Fig. 1(d), PY priors also model\nuncertainty in the number of segments at various resolutions.\n\nWhile power laws are often used simply as a descriptive summary of observed statistics, PY pro-\ncesses provide a consistent generative model which we use to develop effective segmentation algo-\nrithms. We do not claim that PY processes are the only valid prior for image areas; for example,\nlog-normal distributions have similar properties, and may also provide a good model [18]. How-\never, PY priors lead to ef\ufb01cient variational inference algorithms, avoiding the costly MCMC search\nrequired by other segmentation methods with region size priors [18, 19].\n\n3 A Hierarchical Model for Bags of Image Features\n\nWe now develop hierarchical Pitman\u2013Yor (HPY) process models for visual scenes. We \ufb01rst describe\na \u201cbag of features\u201d model [1, 2] capturing prior knowledge about region counts and sizes, and then\nextend it to model spatially coherent shapes in Sec. 4. Our baseline bag of features model directly\ngeneralizes the stick-breaking representation of the hierarchical DP developed by Teh et al. [12].\nN-gram language models based on HPY processes [15, 16] have somewhat different forms.\n\n3.1 Hierarchical Pitman\u2013Yor Processes\n\nEach image is \ufb01rst divided into roughly 1,000 superpixels [18] using a variant of the normalized\ncuts spectral clustering algorithm [13]. We describe the texture of each superpixel via a local texton\nhistogram [20], using band-pass \ufb01lter responses quantized to Wt = 128 bins. Similarly, a color\nhistogram is computed by quantizing the HSV color space into Wc = 120 bins. Superpixel i in\nimage j is then represented by histograms xji = (xt\nji and color xc\nji.\n\nji) indicating its texture xt\n\nji, xc\n\nk, \u03b8c\n\nk), where \u03b8t\n\nFigure 2 contains a directed graphical model summarizing our HPY model for collections of lo-\ncal image features. Each of the potentially in\ufb01nite set of global object categories occurs with fre-\nquency \u03d5k, where \u03d5 \u223c GEM(\u03b3a, \u03b3b) as motivated in Sec. 2.2. Each category k also has an asso-\nciated appearance model \u03b8k = (\u03b8t\nk parameterize multinomial distributions on\nthe Wt texture and Wc color bins, respectively. These parameters are regularized by Dirichlet priors\n\u03b8t\nk \u223c Dir(\u03c1t), \u03b8c\nConsider a dataset containing J images of related scenes, each of which is allocated an in\ufb01nite set\nof potential segments or regions. As in Sec. 2.3, region t occupies a random proportion \u03c0jt of the\narea in image j, where \u03c0j \u223c GEM(\u03b1a, \u03b1b). Each region is also associated with a particular global\nobject category kjt \u223c \u03d5. For each superpixel i, we then independently select a region tji \u223c \u03c0j, and\nsample features using parameters determined by that segment\u2019s global object category:\n\nk \u223c Dir(\u03c1c), with hyperparameters chosen to encourage sparse distributions.\n\nk and \u03b8c\n\nzji , kjtji\n\n(2)\n\nji | tji, kj, \u03b8(cid:1) = Mult(cid:0)xt\n\nji | \u03b8t\n\nzji(cid:1)\u00b7Mult(cid:0)xc\n\nzji(cid:1)\nji | \u03b8c\n\nAs in other adaptations of topic models to visual data [8], we assume that different feature channels\nvary independently within individual object categories and segments.\n\nji, xc\n\np(cid:0)xt\n\n\f,\n\nvjt\n\ntji\n\nkjt\n\nB\n\nxji\n\nNj\n\nJ\n\n\u0004\n\nwk\n\n6\n\nk\n\nB\n\n7\n\ny\nt\ni\ns\nn\ne\nD\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n0\n\ny\nt\ni\ns\nn\ne\nD\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\u22124\n\ny\nt\ni\ns\nn\ne\nD\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nStick\u2212Breaking Proportion\n\ny\nt\ni\ns\nn\ne\nD\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nStick\u2212Breaking Proportion\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nStick\u2212Breaking Proportion\n\nGEM(0, 10)\n\nGEM(0.1, 2)\n\nGEM(0.5, 5)\n\ny\nt\ni\ns\nn\ne\nD\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\nStick\u2212Breaking Threshold\n\ny\nt\ni\ns\nn\ne\nD\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\nStick\u2212Breaking Threshold\n\n\u22122\n\n0\n\n2\n\n4\n\nStick\u2212Breaking Threshold\n\nFigure 2: Stick-breaking representation of a hierarchical Pitman\u2013Yor (HPY) model for J groups of features.\nLeft: Directed graphical model in which global category frequencies \u03d5 \u223c GEM(\u03b3) are constructed from stick-\nbreaking proportions wk \u223c Beta(1 \u2212 \u03b3a, \u03b3b + k\u03b3a), as in Eq. (1). Similarly, vjt \u223c Beta(1 \u2212 \u03b1a, \u03b1b + t\u03b1a)\nde\ufb01ne region areas \u03c0j \u223c GEM(\u03b1) for image j. Each of the Nj features xji is independently sampled as in\nEq. (2). Upper right: Beta distributions from which stick proportions wk are sampled for three different PY\nprocesses: k = 1 (blue), k = 10 (red), k = 20 (green). Lower right: Corresponding distributions on thresholds\nfor an equivalent generative model employing zero mean, unit variance Gaussians (dashed black). See Sec. 4.1.\n\n3.2 Variational Learning for HPY Mixture Models\n\nTo allow ef\ufb01cient learning of HPY model parameters from large image databases, we have devel-\noped a mean \ufb01eld variational method which combines and extends previous approaches for DP\nmixtures [21, 22] and \ufb01nite topic models. Using the stick-breaking representation of Fig. 2, and a\nfactorized variational posterior, we optimize the following lower bound on the marginal likelihood:\n\nq(k, t, v, w, \u03b8) =\" KYk=1\n\nlog p(x | \u03b1, \u03b3, \u03c1) \u2265 H(q) + Eq[log p(x, k, t, v, w, \u03b8 | \u03b1, \u03b3, \u03c1)]\n\nq(wk | \u03c9k)q(\u03b8k | \u03b7k)#\u00b7\n\nJYj=1\" TYt=1\n\nq(vjt | \u03bdjt)q(kjt | \u03bajt)# NjYi=1\n\n(3)\n\nq(tji | \u03c4ji)\n\nHere, H(q) is the entropy. We truncate the variational posterior [21] by setting q(vjT = 1) = 1 for\neach image or group, and q(wK = 1) = 1 for the shared global clusters. Multinomial assignments\nq(kjt | \u03bajt), q(tji | \u03c4ji), and beta stick proportions q(wk | \u03c9k), q(vjt | \u03bdjt), then have closed form\nupdate equations. To avoid bias, we sort the current sets of image segments, and global categories,\nin order of decreasing aggregate assignment probability after each iteration [22].\n\n4 Segmentation with Spatially Dependent Pitman\u2013Yor Processes\n\nWe now generalize the HPY image segmentation model of Fig. 2 to capture spatial dependencies.\nFor simplicity, we consider a single-image model in which features xi are assigned to regions by\nindicator variables zi, and each segment k has its own appearance parameters \u03b8k (see Fig. 3). As in\nSec. 3.1, however, this model is easily extended to share appearance parameters among images.\n\n4.1 Coupling Assignments using Thresholded Gaussian Processes\nConsider a generative model which partitions data into two clusters via assignments zi \u2208 {0, 1}\nsampled such that P[zi = 1] = v. One representation of this sampling process \ufb01rst generates a\nGaussian auxiliary variable ui \u223c N (0, 1), and then chooses zi according to the following rule:\n\nzi =(cid:26) 1\n\n0\n\nif ui < \u03a6\u22121(v)\notherwise\n\n\u03a6(u) , 1\n\n\u221a2\u03c0Z u\n\n\u2212\u221e\n\ne\u2212s2/2 ds\n\n(4)\n\nHere, \u03a6(u) is the standard normal cumulative distribution function (CDF). Since \u03a6(ui) is uniformly\n\ndistributed on [0, 1], we immediately have P[zi = 1] = P(cid:2)ui < \u03a6\u22121(v)(cid:3) = P[\u03a6(ui) < v] = v.\nlar, we note that if zi \u223c \u03c0 where \u03c0k = vkQk\u22121\n\nWe adapt this idea to PY processes using the stick-breaking representation of Eq. (1). In particu-\n\u2113=1 (1 \u2212 v\u2113), a simple induction argument shows that\nvk = P[zi = k | zi 6= k \u2212 1, . . . , 1]. The stick-breaking proportion vk is thus the conditional prob-\nability of choosing cluster k, given that clusters with indexes \u2113 < k have been rejected. Combining\n\n\fu\n\nk1\n\nu\n\nk2\n\n5\n1\n\n5\n2\n\n5\n3\n\n5\n4\n\n5\n1\n\n5\n2\n\n5\n3\n\n5\n4\n\n5\n1\n\n5\n2\n\n5\n3\n\n5\n4\n\nu\n\nk3\n\nu\n\nk4\n\nB\n\nz\n\n1\n\nx\n\n1\n\nz\n\n4\n\nx\n\n4\n\nz\n\n3\n\nx\n\n3\n\nz\n\n2\n\nx\n\n2\n\n,\n\nv\n\nk\n\n6\n\nk\n\nB\n\n7\n\nu3\n\nu2\n\nu1\n\nFigure 3: A nonparametric Bayesian approach to image segmentation in which thresholded Gaussian processes\ngenerate spatially dependent Pitman\u2013Yor processes. Left: Directed graphical model in which expected segment\nareas \u03c0 \u223c GEM(\u03b1) are constructed from stick-breaking proportions vk \u223c Beta(1 \u2212 \u03b1a, \u03b1b + k\u03b1a). Zero\nmean Gaussian processes (uki \u223c N (0, 1)) are cut by thresholds \u03a6\u22121(vk) to produce segment assignments\nzi, and thereby features xi. Right: Three randomly sampled image partitions (columns), where assignments\n(bottom, color-coded) are determined by the \ufb01rst of the ordered Gaussian processes uk to cross \u03a6\u22121(vk).\n\nthis insight with Eq. (4), we can generate samples zi \u223c \u03c0 as follows:\n\nzi = min(cid:8)k | uki < \u03a6\u22121(vk)(cid:9)\n\nwhere uki \u223c N (0, 1) and uki \u22a5 u\u2113i, k 6= \u2113\n\n(5)\n\nAs illustrated in Fig. 3, each cluster k is now associated with a zero mean Gaussian process (GP) uk,\nand assignments are determined by the sequence of thresholds in Eq. (5). If the GPs have identity\ncovariance functions, we recover the basic HPY model of Sec. 3.1. More general covariances can\nbe used to encode the prior probability that each feature pair occupies the same segment. Intuitively,\nthe ordering of segments underlying this dependent PY model is analogous to layered appearance\nmodels [23], in which foreground layers occlude those that are farther from the camera.\n\nTo retain the power law prior on segment sizes justi\ufb01ed in Sec. 2.3, we transform priors on stick\nproportions vk \u223c Beta(1 \u2212 \u03b1a, \u03b1b + k\u03b1a) into corresponding random thresholds:\n\np(\u00afvk | \u03b1) = N (\u00afvk | 0, 1) \u00b7 Beta(\u03a6(\u00afvk) | 1 \u2212 \u03b1a, \u03b1b + k\u03b1a)\n\n\u00afvk , \u03a6\u22121(vk)\n\n(6)\n\nFig. 2 illustrates the threshold distributions corresponding to several different PY stick-breaking\npriors. As the number of features N becomes large relative to the GP covariance length-scale, the\nproportion assigned to segment k approaches \u03c0k, where \u03c0 \u223c GEM(\u03b1a, \u03b1b) as desired.\n4.2 Variational Learning for Dependent PY Processes\n\nSubstantial innovations are required to extend the variational method of Sec. 3.2 to the Gaussian pro-\ncesses underlying our dependent PY processes. Complications arise due to the threshold assignment\nprocess of Eq. (5), which is \u201cstronger\u201d than the likelihoods typically used in probit models for GP\nclassi\ufb01cation, as well as the non-standard threshold prior of Eq. (6). In the simplest case, we place\nfactorized Gaussian variational posteriors on thresholds q(\u00afvk) = N (\u00afvk | \u03bdk, \u03b4k) and assignment\nsurfaces q(uki) = N (uki | \u00b5ki, \u03bbki), and exploit the following key identities:\n\nPq[uki < \u00afvk] = \u03a6(cid:18) \u03bdk \u2212 \u00b5ki\n\u221a\u03b4k + \u03bbki(cid:19)\n\nEq[log \u03a6(\u00afvk)] \u2264 log Eq[\u03a6(\u00afvk)] = log \u03a6(cid:18) \u03bdk\u221a1 + \u03b4k(cid:19) (7)\n\nThe \ufb01rst expression leads to closed form updates for Dirichlet appearance parameters q(\u03b8k | \u03b7k),\nwhile the second evaluates the beta normalization constants in Eq. (6). We then jointly optimize\neach layer\u2019s threshold q(\u00afvk) and assignment surface q(uk), \ufb01xing all other layers, via backtracking\nconjugate gradient (CG) with line search. For details and further re\ufb01nements, see [17].\n\n\fFigure 4: Five samples from each of four prior models for image partitions (color coded). Top Left: Nearest\nneighbor Potts MRF with K = 10 states. Top Right: Potts MRF with potentials biased by DP samples [28].\nBottom Left: Softmax model in which spatially varying assignment probabilities are coupled by logistically\ntransformed GPs [25\u201327]. Bottom Right: PY process assignments coupled by thresholded GPs (as in Fig. 3).\n\n4.3 Related Work\n\nRecently, Duan et. al. [24] proposed a generalized spatial Dirichlet process which links assignments\nvia thresholded GPs, as in Sec. 4.1. However, their focus is on modeling spatial random effects\nfor prediction tasks, as opposed to the segmentation tasks which motivate our generalization to PY\nprocesses. Unlike our HPY extension, they do not consider approaches to sharing parameters among\nrelated groups or images. Moreover, their basic Gibbs sampler takes 12 hours on a toy dataset with\n2,000 observations; our variational method jointly segments 200 scenes in comparable time.\n\nSeveral authors have independently proposed a spatial model based on pointwise, multinomial logis-\ntic transformations of K latent GPs [25\u201327]. This produces a \ufb01eld of smoothly varying multinomial\ndistributions \u02c7\u03c0i, from which segment assignments are independently sampled as zi \u223c \u02c7\u03c0i. As shown\nin Fig. 4, this softmax construction produces noisy, less spatially coherent partitions. Moreover, its\nbias towards partitions with K segments of similar size is a poor \ufb01t for natural scenes.\n\nA previous nonparametric image segmentation method de\ufb01ned its prior as a normalized product\nof a DP sample \u03c0 \u223c GEM(0, \u03b1) and a nearest neighbor MRF with Potts potentials [28]. This\nconstruction effectively treats log \u03c0 as the canonical, rather than moment, parameters of the MRF,\nand does not produce partitions whose size distribution matches GEM(0, \u03b1). Due to the phase\ntransition which occurs with increasing potential strength, Potts models assign low probability to\nrealistic image partitions [29]. Empirically, the DP-Potts product construction seems to have similar\nissues (see Fig. 4), although it can still be effective with strongly informative likelihoods [28].\n\n5 Results\n\nFigure 5 shows segmentation results for images from the scene categories considered in Sec. 2.\nWe compare the bag of features PY model (PY-BOF), dependent PY with distance-based squared\nexponential covariance (PY-Dist), and dependent PY with covariance that incorporates intervening\ncontour cues (PY-Edge) based on the Pb detector [20]. The conditionally speci\ufb01ed PY-Edge model\n\nscales the covariance between superpixels i and j byp1 \u2212 bij , where bij is the largest Pb response\n\non the straight line connecting them. We convert these local covariance estimates into a globally\nconsistent, positive de\ufb01nite matrix via an eigendecomposition. For the results in Figs. 5 and 6, we\nindependently segment each image, without sharing appearance models or supervised training.\n\nWe compare our results to the normalized cuts spectral clustering method with varying numbers of\nsegments (NCut(K)), and a high-quality af\ufb01nity function based on color, texture, and intervening\ncontour cues [13]. Our PY models consistently capture variability in the number of true segments,\nand detect both large and small regions. In contrast, normalized cuts is implicitly biased towards\nregions of equal size, which produces distortions. To quantitatively evaluate results, we measure\noverlap with held-out human segments via the Rand index [30]. As summarized in Fig. 6, PY-BOF\nperforms well for some images with unambiguous features, but PY-Edge is often substantially better.\n\nWe have also experimented with our hierarchical PY extension, in which color and texture distribu-\ntions are shared between images. As shown in Fig. 7, many of the inferred global visual categories\nalign reasonably with semantic categories (e.g., sky, foliage, mountains, or buildings).\n\n6 Discussion\n\nWe have developed a nonparametric framework for image segmentation which uses thresholded\nGaussian processes to produce spatially coupled Pitman\u2013Yor processes. This approach produces\nempirically justi\ufb01ed power law priors for region areas and object frequencies, allows visual appear-\n\n\fFigure 5: Segmentation results for two images (rows) from each of the coast, mountain, and tallbuilding scene\ncategories. From left to right, columns show LabelMe human segments, image with boundaries inferred by\nPY-Edge, and segments for PY-Edge, PY-Dist, PY-BOF, NCut(3), NCut(4), and NCut(6). Best viewed in color.\n\n)\nr\na\nv\no\nC\ne\ng\nd\nE\n\n \n\ni\n\n(\n \nn\na\ns\ns\nu\na\nG\nY\nP\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nNormalized Cuts\n\n1\n\n(a)\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nx\ne\nd\nn\nI\n \nd\nn\na\nR\n \ne\ng\na\nr\ne\nv\nA\n\n \n\nNormalized Cuts\nPY Gaussian (Edge Covar)\nPY Gaussian (Distance Covar)\nPY Bag of Features\n\n)\nr\na\nv\no\nC\ne\ng\nd\nE\n\n \n\ni\n\n(\n \nn\na\ns\ns\nu\na\nG\nY\nP\n\n \n\n0.5\n \n2\n10\nNumber of Normalized Cuts Regions\n\n8\n\n4\n\n6\n\n(b)\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nNormalized Cuts\n\n1\n\n(c)\n\n \n\nNormalized Cuts\nPY Gaussian (Edge Covar)\nPY Gaussian (Distance Covar)\nPY Bag of Features\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nx\ne\nd\nn\nI\n \nd\nn\na\nR\n \ne\ng\na\nr\ne\nv\nA\n\n0.5\n \n2\n10\nNumber of Normalized Cuts Regions\n\n4\n\n6\n\n8\n\n(d)\n\nFigure 6: Quantitative comparison of segmentation results to human segments, using the Rand index. (a) Scat-\nter plot of PY-Edge and NCut(4) Rand indexes for 200 mountain images. (b) Average Rand indexes for moun-\ntain images. We plot the performance of NCut(K) versus the number of segments K, compared to the variable\nresolution segmentations of PY-Edge, PY-Dist, and PY-BOF. (c) Scatter plot of PY-Edge and NCut(6) Rand\nindexes for 200 tallbuilding images. (d) Average Rand indexes for tallbuilding images.\n\nance models to be \ufb02exibly shared among natural scenes, and leads to ef\ufb01cient variational inference\nalgorithms which automatically search over segmentations of varying resolution. We believe this\nprovides a promising starting point for discovery of shape-based visual appearance models, as well\nas weakly supervised nonparametric learning in other, non-visual application domains.\n\nAcknowledgments We thank Charless Fowlkes and David Martin for the Pb boundary estimation and seg-\nmentation code, Antonio Torralba for helpful conversations, and Sra. Barriuso for her image labeling expertise.\nThis research supported by ONR Grant N00014-06-1-0734, and DARPA IPTO Contract FA8750-05-2-0249.\n\nReferences\n\n[1] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In CVPR,\n\nvolume 2, pages 524\u2013531, 2005.\n\n[2] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their\n\nlocation in images. In ICCV, volume 1, pages 370\u2013377, 2005.\n\n[3] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. IJCV, 42(3):145\u2013175, 2001.\n\n\fFigure 7: Most signi\ufb01cant segments associated with each of three shared, global visual categories (rows) for\nhierarchical PY-Edge models trained with 200 images of mountain (left) or tallbuilding (right) scenes.\n\n[4] L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent object segmentation and\n\nclassi\ufb01cation. In ICCV, 2007.\n\n[5] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and A. Zisserman. Using multiple segmentations to\n\ndiscover objects and their extent in image collections. In CVPR, volume 2, pages 1605\u20131614, 2006.\n\n[6] S. Todorovic and N. Ahuja. Learning the taxonomy and models of categories present in arbitrary images.\n\nIn ICCV, 2007.\n\n[7] X. He, R. S. Zemel, and M. A. Carreira-Perpi\u02dcn\u00b4an. Multiscale conditional random \ufb01elds for image labeling.\n\nIn CVPR, volume 2, pages 695\u2013702, 2004.\n\n[8] J. Verbeek and B. Triggs. Region classi\ufb01cation with Markov \ufb01eld aspect models. In CVPR, 2007.\n[9] C. Rother, V. Kolmogorov, T. Minka, and A. Blake. Cosegmentation of image pairs by histogram match-\n\ning: Incorporating a global constraint into MRFs. In CVPR, volume 1, pages 993\u20131000, 2006.\n\n[10] M. Andreetto, L. Zelnik-Manor, and P. Perona. Non-parametric probabilistic image segmentation.\n\nIn\n\nICCV, 2007.\n\n[11] J. Pitman and M. Yor. The two-parameter Poisson\u2013Dirichlet distribution derived from a stable subordina-\n\ntor. Ann. Prob., 25(2):855\u2013900, 1997.\n\n[12] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. J. Amer. Stat.\n\nAssoc., 101(476):1566\u20131581, December 2006.\n\n[13] C. Fowlkes, D. Martin, and J. Malik. Learning af\ufb01nity functions for image segmentation: Combining\n\npatch-based and gradient-based approaches. In CVPR, volume 2, pages 54\u201361, 2003.\n\n[14] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: A database and web-based tool\n\nfor image annotation. IJCV, 77:157\u2013173, 2008.\n[15] S. Goldwater, T. L. Grif\ufb01ths, and M. Johnson.\n\nInterpolating between types and tokens by estimating\n\npower-law generators. In NIPS 18, pages 459\u2013466. MIT Press, 2006.\n\n[16] Y. W. Teh. A hierarchical Bayesian language model based on Pitman\u2013Yor processes. In Coling/ACL,\n\n2006.\n\n[17] E. B. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent Pitman-Yor\nprocesses. Technical report, Dept. of Statistics, University of California, Berkeley. In preparation, 2009.\n\n[18] X. Ren and J. Malik. Learning a classi\ufb01cation model for segmentation. In ICCV, 2003.\n[19] Z. Tu and S. C. Zhu. Image segmentation by data-driven Markov chain Monte Carlo. IEEE Trans. PAMI,\n\n24(5):657\u2013673, May 2002.\n\n[20] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local\n\nbrightness, color, and texture cues. IEEE Trans. PAMI, 26(5):530\u2013549, May 2004.\n\n[21] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayes. Anal., 1(1):121\u2013\n\n144, 2006.\n\n[22] K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational Dirichlet process mixture models.\n\nIn\n\nIJCAI 20, pages 2796\u20132801, 2007.\n\n[23] J. Y. A. Wang and E. H. Adelson. Representing moving images with layers. IEEE Trans. IP, 3(5):625\u2013\n\n638, September 1994.\n\n[24] J. A. Duan, M. Guindani, and A. E. Gelfand. Generalized spatial Dirichlet process models. Biometrika,\n\n94(4):809\u2013825, 2007.\n\n[25] C. Fern\u00b4andez and P. J. Green. Modelling spatially correlated data via mixtures: A Bayesian approach. J.\n\nR. Stat. Soc. B, 64(4):805\u2013826, 2002.\n\n[26] M. A. T. Figueiredo. Bayesian image segmentation using Gaussian \ufb01eld priors. In CVPR Workshop on\n\nEnergy Minimization Methods in Computer Vision and Pattern Recognition, 2005.\n\n[27] M. W. Woolrich and T. E. Behrens. Variational Bayes inference of spatial mixture models for segmenta-\n\ntion. IEEE Trans. MI, 25(10):1380\u20131391, October 2006.\n\n[28] P. Orbanz and J. M. Buhmann. Smooth image segmentation by nonparametric Bayesian inference. In\n\nECCV, volume 1, pages 444\u2013457, 2006.\n\n[29] R. D. Morris, X. Descombes, and J. Zerubia. The Ising/Potts model is not well suited to segmentation\n\ntasks. In IEEE DSP Workshop, 1996.\n\n[30] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation of image segmentation algo-\n\nrithms. IEEE Trans. PAMI, 29(6):929\u2013944, June 2007.\n\n\f", "award": [], "sourceid": 1027, "authors": [{"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}