{"title": "Learning Segmentation by Random Walks", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 879, "abstract": null, "full_text": "Learning Segmentation by Random Walks \n\nMarina Meila \n\nUniversity of Washington \nmmp~stat.washington.edu \n\nJianbo Shi \n\nCarnegie Mellon University \n\njshi~cs.cmu.edu \n\nAbstract \n\nWe present a new view of image segmentation by pairwise simi(cid:173)\nlarities. We interpret the similarities as edge flows in a Markov \nrandom walk and study the eigenvalues and eigenvectors of the \nwalk's transition matrix. This interpretation shows that spectral \nmethods for clustering and segmentation have a probabilistic foun(cid:173)\ndation. In particular, we prove that the Normalized Cut method \narises naturally from our framework. Finally, the framework pro(cid:173)\nvides a principled method for learning the similarity function as a \ncombination of features. \n\n1 \n\nIntroduction \n\nThis paper focuses on pairwise (or similarity-based) clustering and image segmen(cid:173)\ntation. In contrast to statistical clustering methods, that assume a probabilistic \nmodel that generates the observed data points (or pixels), pairwise clustering de(cid:173)\nfines a similarity function between pairs of points and then formulates a criterion \n(e.g. maximum total intracluster similarity) that the clustering must optimize. The \noptimality criteria quantify the intuitive notion that points in a cluster (or pixels \nin a segment) are similar, whereas points in different clusters are dissimilar. \n\nAn increasingly popular approach to similarity based clustering and segmentation \nis by spectral methods. These methods use eigenvalues and eigenvectors of a matrix \nconstructed from the pairwise similarity function. Spectral methods are sometimes \nregarded as continuous approximations of previously formulated discrete graph the(cid:173)\noretical criteria as in image segmentation method of [9], or as in the web clustering \nmethod of [4, 2]. As demonstrated in [9,4], these methods are capable of delivering \nimpressive segmentation/clustering results using simple low-level features. \n\nIn spite of their practical successes, spectral methods are still incompletely un(cid:173)\nderstood. The main achievement of this work is to show that there is a simple \nprobabilistic interpretation that can offer insights and serve as an analysis tool for \nall the spectral methods cited above. We view the pairwise similarities as edge flows \nin a Markov random walk and study the properties of the eigenvectors and values of \nthe resulting transition matrix. Using this view, we were able to show that several \nof the above methods are subsumed by the Normalized Cut (NCut) segmentation \nalgorithm of [9] in a sense that will be described. Therefore, in the following, we will \nfocus on the NCut algorithm and will adopt the terminology of image segmentation \n(i.e. the data points are pixels and the set of all pixels is the image), keeping in \nmind that all the results are also valid for similarity based clustering. \n\n\fA probabilistic interpretation of NCut as a Markov random walk not only sheds \nnew lights on why and how spectral methods work in segmentation, but also offers \na principled way of learning the similarity function. A segmented image can provide \na \"target\" transition matrix to which a learning algorithm matches in KL divergence \nthe \"learned\" transition probabilities. The latter are output by a model as a function \nof a set of features measured from the training image. This is described in section \n5. Experimental results on learning segmenting objects with smooth and rounded \nshape is described in section 6. \n\n2 The Normalized Cut criterion and algorithm \n\nHere and in the following, an image will be represented by a set of pixels I. A \nsegmentation is a partioning of I into mutually disjoint subsets. For each pair of \npixels i, j E I a similarity Sij = Sji ~ 0 is given. In the NCut framework the \nsimilarities Sij are viewed as weights on the edges ij of a graph G over I. The \nmatrix S = [Sij] plays the role of a \"real-valued\" adjacency matrix for G. Let \ndi = LjEf Sij, called the degree of node i, and the volume of a set A c I be \nvol A = LiEA di \u00b7 The set of edges between A and its complement A is an edge cut \nor shortly a cut. The normalized cut (NCut) criterion of [9] is a graph theoretical \ncriterion for segmenting an image into two by minimizing \n\n(1) \n\nover all cuts A, A. Minimizing NCut means finding a cut ofrelatively small weight \nbetween two subsets with strong internal connections. In [9] it is shown that opti(cid:173)\nmizing NCut is NP hard. \n\nThe NCut algorithm was introduced in [9] as a continuous approximation for solving \nthe discrete minimum NCut problem by way of eigenvalues and eigenvectors. It \nuses the Laplacian matrix L = D - S where D is a diagonal matrix formed \nwith the degrees of the nodes. The algorithm consists of solving the generalized \neigenvalues/vectors problem \n\n(2) \nThe NCut algorithm focuses on the second smallest eigenvalue of (2) and its cor(cid:173)\nresponding eigenvector, call them ),L and xL respectively. In [9] it is shown that \nwhen there is a partitioning of A, A of I such that \n\nLx = )'Dx \n\nL _ {a, i E A \n\nXi \n\n-\n\n(3, \n\ni E A \n\n(3) \n\nthen A, A is the optimal NCut and the value of the cut itself is NCut(A, A) \nThis result represents the basis of spectral segmentation by normalized cuts. One \nsolves the generalized spectral problem (2), then finds a partitioning of the elements \nof xL into two sets containing roughly equal values. The partitioning can be done by \nthresholding the elements. The partitioning of the eigenvector induces a partition \non I which is the desired segmentation. \n\nAs presented above, the NCut algorithm lacks a satisfactory intuitive explanation. \nIn particular, the NCut algorithm and criterion offer little intuition about (1) what \ncauses xL to be piecewise constant? (2) what happens when there are more than \ntwo segments and (3) how does the algorithm degrade its performance when xL is \nnot piecewise constant? \n\n\fThe random walk interpretation that we describe now will answer the first two ques(cid:173)\ntions as well as give a better understanding of what spectral clustering is achieving. \nWe shall not approach the third issue here: instead, we point to the results of [2] \nthat apply to the NCut algorithm as well. \n\n3 Markov walks and normalized cuts \n\nBy \"normalizing\" the similarity matrix S one obtains the stochastic matrix \n\nP = D- 1S \n\n(4) \nwhose row sums are all 1. As it is known from the theory of Markov random walks, \nPij represents the probability of moving from node i to j in one step, given that \nwe are in i. The eigenvalues of Pare A1 = 1 ~ A2 ~ ... An ~ -1; xl.. \u00b7n are the \neigenvectors. The first eigenvector of P is Xl =1, the vector whose elements are all \nIs. W.l.o.g we assume that no node has degree O. \n\nLet us now examine the spectral problem for the matrix P, namely the solutions of \nthe equation \n\n(5) \nProposition 1 If A, X are solutions of (5) and P = D- 1 S, then (1 - A), x are \nsolutions of (2). \n\nPx = AX \n\nIn other words, the NCut algorithm and the matrix P have the same eigenvectors; \nthe eigenvalues of P are identical to 1 minues the generalized eigenvalues in (2). \nProposition 1 shows the equivalence between the spectral problem formulated by \nthe NCut algorithm and the eigenvalues/vectors of the stochastic matrix P. This \nalso helps explaining why the NCut algorithm uses the second smallest generalized \neigenvector: the smallest eigenvector of (2) corresponds to the largest eigenvector \nof P, which in most cases of interest is equal to 1 thus containing no information. \n\nThe NCut criterion can also be understood in this framework. First define ?roo = \n[?ri\"]iEI bY?ri\" = !oh-. It is easy to verify that pT?roo = ?roo and thus that ?roo is a \nstationary distribution of the Markov chain. If the chain is ergodic, which happens \nunder mild conditions [1], then ?roo is the only distribution over I with this property. \nNote also that the Markov chain is reversible because ?ri\" Pij = ?rj Pji = Sij /voII. \nDefine PAB = Pr[A -+ BIA] as the probability of the random walk transitioning \nfrom set A c I to set B C I in one step if the current state is in A and the random \nwalk is started in its stationary distribution. \n\nEiEA,jEB Sij \n\nvol(A) \n\n(6) \n\nFrom this it follows that \n\nNCut(A, A) = PAA + PAA \n\n(7) \nIf the NCut is small for a certain partition A, A then it means that the probabilities \nof evading set A, once the walk is in it and of evading its complement A are both \nsmall. Intuitively, we have partioned the set I into two parts such that the random \nwalk, once in one of the parts, tends to remain in it. \n\nThe NCut is strongly related to a the concept of low conductivity sets in a \nMarkov random walk. A low conductivity set A is a subset of I such that \nh(A) = max( PAA , PAA ) is small. They have been studied in spectral graph \ntheory in connection with the mixing time of Markov random walks [1]. More re(cid:173)\ncently, [2] uses them to define a new criterion for clustering. Not coincidentally, the \nheuristic analyzed there is strongly similar to the NCut algorithm. \n\n\f4 Stochastic matrices with piecewise constant eigenvectors \n\nIn the following we will use the transition matrix P to achieve a better understanding \nof the NCut algorithm. Recall that the NCut algorithm looks at the second \"largest\" \neigenvector of P, denoted by X2 and equal to X L, in order to obtain a partioning \nof I. We define a vector x to be piecewise constant relative to a partition ~ = \n(Al' A 2 , \u2022\u2022. A k) of I iff Xi = Xj for i,j pixels in the same set As, s = 1, ... k. Since \nhaving piecewise constant eigenvectors is ideal case for spectral segmentation, it is \nimportant to understand when the matrix P has this desired property. We study \nwhen the first k out of n eigenvectors are piecewise constant. \n\nProposition 2 Let P be a matrix with rows and columns indexed by I that has \nindependent eigenvectors. Let ~ = (Al' A 2 , \u2022\u2022 . Ak) be a partition of I. Then, P \nhas k eigenvectors that are piecewise constant w. r. t. ~ and correspond to non-zero \neigenvalues if and only if the sums Pis = l:jEA. Pij are constant for all i E As' and \nall s, s' = 1, ... k and the matrix R = [Pss' ]s,s'=l ,.\"k (with Pss' = l:jEA~ Pij , i E \nAs) is non-singular. \nLemma 3 If the matrix P of dimension n is of the form P = D- l S with S sym(cid:173)\nmetric and D non-singular then P has n independent eigenvectors. \n\nWe call a stochastic matrix P satisfying the conditions of Proposition 2 a block(cid:173)\nstochastic matrix. Intuitively, Proposition 2 says that a stochastic matrix has piece(cid:173)\nwise constant eigenvectors if the underlying Markov chain can be aggregated into a \nMarkov chain with state space ~ = {Al , . .. Ak } and transition probability matrix \nP. This opens interesting connections between the field of spectral segmentation \nand the body of work on aggregability or (lumpability) [3] of Markov chains. The \nproof of Proposition 2 is provided in [5]. \n\nProposition 2 shows that a much broader condition exists for N cut algorithm to \nproduce an exact segmentation/clustering solution. Such condition shows that in \nfact spectral clustering is able to group pixels by the similarity of their transition \nprobabilities to subsets of I. Experiments [9] show that NCut works well on many \ngraphs that have a sparse complex connection structure supporting this result with \npractical evidence. Proposition 2 generalizes previous results of [10]. \n\nThe NCut algorithm and criterion is one of the recently proposed spectral segmenta(cid:173)\ntion methods. In image segmentation, there are algorithms of Perona and Freeman \n(PF) [7] and Scott and Longuet-Higgins (SLH) [8]. In web clustering, there are \nalgorithms of Kleinberg[4] (K), the long known latent semantic analysis (LSA), and \nin the variant proposed by Kannan, Vempala and Yetta (KVV) [2]. It is easy to \nshow that each of the above ideal situations imply that the resulting stochastic \nmatrix P satisfies the conditions of Proposition 2 and thus the NCut algorithm \nwill also work exactly in these situations. In this sense NCut subsumes PF, SLH \nand (certain variants of) K. Moreover, none of the three other methods takes into \naccount more information than NCut does. Another important aspect of a spectral \nclustering algorithm is robustness. Empirical results of [10] show that NCut is at \nleast as robust as PF and SLH. \n\n5 The framework for learning image segmentation \nThe previous section stressed the connection between NCut as a criterion for image \nsegmentation and searching for low conductivity sets in a random walk. Here we \nwill exploit this connection to develop a framework for supervised learning of image \nsegmentation. Our goal is to obtain an algorithm that starts with a training set of \n\n\fsegmented images and with a set of features and learns a function of the features \nthat produces correct segmentations, as shown in figure 1. \n\nLearning \n\nImage \n\nFeatules \n\n....... ~ ..... ==~~ \n\n,g \n'J~ '1j \n\nP,j \n\nHuman labeled Segmentation \n\nFigure 1: The general framework for learning image segmentation. \n\nFor simplicity, assume the training set consists of one image only and its correct \nsegmentation. From the latter it is easy to obtain \"ideal\" or target transition prob(cid:173)\nabilities \n\np .*. = {Oi \n\n1J \n\n~ (j. A \nTAT' J EA. \n\nfor i in segment A with IAI elements \n\n(8) \nWe also have a predefined set of features r, q = 1, ... Q which measure similarity \nbetween two pixels according to different criteria and their values for 1. \nThe model is the part of the framework that is subject to learning. It takes the \nfeatures fi~ as inputs and outputs the global similarity measure Sij. For the present \n\nexperiments we use the simple model Sij = eL:q Aql;; \nIntuitively, it represents \na set of independent \"experts\", the factors eAql\" voting on the probability of a \ntransition i -+ j. \nIn our framework, based on the fact that a segmentation is equivalent to a random \nwalk, optimality is defined as the minimization of the conditional K ullback-Leibler \n(KL) divergence between the target probabilities Ptj and the transition probabili(cid:173)\nties Pij obtained by normalizing Sij. Because P* is fixed, the above minimization \nis equivalent to maximizing the cross entropy between the two (conditional) distri(cid:173)\nbutions, i.e. max J, where \n\nJ = L \n\niEI \n\nI~I LPtj logPij \n\njEI \n\n(9) \n\nIf we interpret the factor 1/lll as a uniform distribution over states 71\"0 then the \ncriterion in (9) is equivalent to the KL divergence between two distributions over \ntransitions KL(Pi~jllPi-+j) where pt:j = 7I\"?Pi~*) ' \n\nMaximizing J can be done via gradient ascent in the parameters A. We obtain \n\noj _ 1 ('\"' * f q \nOAq \n\nf q ) \nTIf ~ Pij ij - ~ Pij ij \n\n' \" ' \n\n-\n\n1J \n\n1J \n\n(10) \n\nOne can further note that the optimum of J corresponds to the solution of the \nfollowing maximum entropy problem: \n\nmaxH(jli) S.t. < fi~ > ... OP;l i = < f& > ... OP;ji for q = 1, ... Q \nP;li \n\n(11) \n\nSince this is a convex optimization problem, it has a unique optimum. \n\n\f6 Segmentation with shape and region information \nIn this section, we exemplify our approach on a set of synthetic and real images and \nwe use features carrying contour and shape information. First we use a set of local \nfiler banks as edge detectors. They capture both edge strength and orientation. \nFrom this basic information we construct two features: the intervening contour \n(IC) and the co-linearity/co-circularity (CL). \n\nD \n\nlC \n\n(a) \n\n(b) \n\n.' \n,~ ;A \\u \n.. \n\n00,' \n\nDeL \n\nJ \n\n\u2022 \n\n(c) \n\nFigure 2: Features for segmenting objects with smooth rounded shape. (a) The edge \nstrength provides a cue of region boundary. It biases against random walks in a direction \northogonal to an edge. (b) Edge orientation provides a cue for the object's shape. The \ninduced edge flow is used to bias the random walk along the edge, and transitions between \nco-circular edge flows are encouraged. (c) Edge flow for the bump in figure 3. \n\n(g) \n\n(h) \n\n(d) \n\n(e) \n\n(f) \n\nlIO \n\n02 O. \n\n1 \nonoovO\"oogoccnlrasl \n\n06 \n\n08 \n\n12 \n\nU \n\n0750 \n\n~o \n\n100 \n\n1S(! \n\nImaQem1cnsrlyr