{"title": "The g Factor: Relating Distributions on Features to Distributions on Images", "book": "Advances in Neural Information Processing Systems", "page_first": 1231, "page_last": 1238, "abstract": null, "full_text": "The 9 Factor: Relating Distributions on \n\nFeatures to Distributions on Images \n\nJames M. Coughlan and A. L. Yuille \nSmith-Kettlewell Eye Research Institute, \n\n2318 Fillmore Street, \n\nSan Francisco, CA 94115, USA. \n\nTel. (415) 345-2146/2144. Fax. (415) 345-8455. \n\nEmail: coughlan@ski.org.yuille@ski.org \n\nAbstract \n\nWe describe the g-factor, which relates probability distributions \non image features to distributions on the images themselves. The \ng-factor depends only on our choice of features and lattice quanti(cid:173)\nzation and is independent of the training image data. We illustrate \nthe importance of the g-factor by analyzing how the parameters of \nMarkov Random Field (i.e. Gibbs or log-linear) probability models \nof images are learned from data by maximum likelihood estimation. \nIn particular, we study homogeneous MRF models which learn im(cid:173)\nage distributions in terms of clique potentials corresponding to fea(cid:173)\nture histogram statistics (d. Minimax Entropy Learning (MEL) \nby Zhu, Wu and Mumford 1997 [11]) . We first use our analysis \nof the g-factor to determine when the clique potentials decouple \nfor different features . Second, we show that clique potentials can \nbe computed analytically by approximating the g-factor. Third, \nwe demonstrate a connection between this approximation and the \nGeneralized Iterative Scaling algorithm (GIS), due to Darroch and \nRatcliff 1972 [2], for calculating potentials. This connection en(cid:173)\nables us to use GIS to improve our multinomial approximation, \nusing Bethe-Kikuchi[8] approximations to simplify the GIS proce(cid:173)\ndure. We support our analysis by computer simulations. \n\n1 \n\nIntroduction \n\nThere has recently been a lot of interest in learning probability models for vision. \nThe most common approach is to learn histograms of filter responses or, equiva(cid:173)\nlently, to learn probability distributions on features (see right panel of figure (1)). \nSee, for example, [6], [5], [4]. (In this paper the features we are considering will be \nextracted from the image by filters - hence we use the terms \"features\" and \"filters\" \nsynonymously. ) \n\n\fAn alternative approach, however, is to learn probability distributions on the images \nthemselves. The Minimax Entropy Learning (MEL) theory [11] uses the maximum \nentropy principle to learn MRF distributions in terms of clique potentials deter(cid:173)\nmined by the feature statistics (i.e. histograms of filter responses). (We note that \nthe maximum entropy principle is equivalent to performing maximum likelihood es(cid:173)\ntimation on an MRF whose form is determined by the choice of feature statistics.) \nWhen applied to texture modeling it gives a way to unify the filter based approaches \n(which are often very effective) with the MRF distribution approaches (which are \ntheoretically attractive). \n\n) \\ \n\nFigure 1: Distributions on images vs. distributions on features. Left and center \npanels show a natural image and its image gradient magnitude map, respectively. \nRight panel shows the empirical histogram (i.e. a distribution on a feature) of \nthe image gradient across a dataset of natural images. This feature distribution \ncan be used to create a MRF distribution over images[10]. This paper introduces \nthe g-factor to examine connections between the distribution over images and the \ndistribution over features. \n\nAs we describe in this paper (see figure (1)), distributions on images and on fea(cid:173)\ntures can be related by a g-factor (such factors arise in statistical physics, see [3]) . \nUnderstanding the g-factor allows us to approximate it in a form that helps explain \nwhy the clique potentials learned by MEL take the form that they do as functions \nof the feature statistics. Moreover, the MEL clique potentials for different features \noften seem to be decoupled and the g-factor can explain why, and when, this occurs. \n(I.e. the two clique potentials corresponding to two features A and B are identical \nwhether we learn them jointly or independently). \n\nThe g-factor is determined only by the form of the features chosen and the spatial \nlattice and quantization of the image gray-levels. It is completely independent of \nthe training image data. It should be stressed that the choice of image lattice, \ngray-level quantization and histogram quantization can make a big difference to the \ng-factor and hence to the probability distributions which are the output of MEL. \n\nIn Section (2), we briefly review Minimax Entropy Learning. Section (3) introduces \nthe g-factor and determines conditions for when clique potentials are decoupled. \nIn Section (4) we describe a simple approximation which enables us to learn the \nclique potentials analytically, and in Section (5) we discuss connections between \nthis approximation and the Generalized Iterative Scaling (GIS) algorithm. \n\n2 Minimax Entropy Learning \n\nSuppose we have training image data which we assume has been generated by an \n(unknown) probability distribution PT(X) where x represents an image. Minimax \nEntropy Learning (MEL) [11] approximates PT(X) by selecting the distribution with \n\n\f-\n\n-\n\n-\n\nmaximum entropy constrained by observed feature statistics i(X) = ;fobs. This gives \n>:. \u00a2( \u00a3) \nP(xIA) = e Z [>:] \n,where A is a parameter chosen such that Lx P(xIA)\u00a2>(X) = 'l/Jobs\u00b7 \nOr equivalently, so that <910;{[>:] = ;fobs. \nWe will treat the special case where the statistics i are the histogram of a shift(cid:173)\ninvariant filter {fi(X) : i = 1, ... , N} , where N is the total number of pixels in the im(cid:173)\nage. So 'l/Ja = \u00a2>a(x) = -tv L~l ba,' i(X) where a = 1, ... , Q indicates the (quantized) \nfilter response values. The potentials become A\u00b7\u00a2>(X) = -tv La=l Li=l A(a)ba,fi(X) = \n-tv L~l A(fi(X)). Hence P(xl,X) becomes a MRF distribution with clique potentials \ngiven by A(fi (x)). This determines a Markov random field with the clique structure \ngiven by the filters {fd. \nMEL also has a feature selection stage based on Minimum Entropy to determine \nwhich features to use in the Maximum Entropy Principle. The features are evalu(cid:173)\nated by computing the entropy - Lx P(xl,X) log P(xl,X) for each choice of features \n(with small entropies being preferred). A filter pursuit procedure was described to \ndetermine which filters/features should be considered (our approximations work for \nthis also). \n\n~ ~ \n\nQ \n\nN \n\n3 The g-Factor \n\nThis section defines the g-factor and starts investigating its properties in subsec(cid:173)\ntion (3.1). \nIn particular, when, and why, do clique potentials decouple? More \nprecisely, when do the potentials for filters A and B learned simultaneously differ \nfrom the potentials for the two filters when they are learned independently? \n\nWe address these issues by introducing the g-factor g(;f) and the associated distri(cid:173)\nbution Po (;f): \n\n(1) \n\nx space -----+ iii space GG \n\ng(ijiJ = number of images x \n\nwith histogram iii \n\nFigure 2: The g-factor g(;f) counts the number of images x that have statistics ;f. \nNote that the g-factor depends only on the choice of filters and is independent of \nthe training image data. \n\nHere L is the number of grayscale levels of each pixel, so that LN is the total number \nof possible images. The g-factor is essentially a combinational factor which counts \nthe number of ways that one can obtain statistics ;f, see figure (2). Equivalently, \nPo is the default distribution on ;f if the images are generated by white noise (i.e. \ncompletely random images). \n\n\fWe can use the g-factor to compute the induced distribution P(~I'x) on the statistics \ndetermined by MEL: \n\n~ ~ \n\nA \n\nP(1/1 I'\\) = \n\nL \n\n-\nX\n\n~ ~ \n\n6;;: 2(-)P(xl'\\) = \n'j','j' x \n\ng( ~)eX.,j; \n\n~, Z[,\\] = \n\nZ[,\\] \n\n~ \n\nL \n\n,j; \n\n~ X,j; \n. \n\ng(1/1)e\u00b7\n\n(2) \n\nA \n\n~ \n\n~ ~ ~ \n\nObserve that both P(~I'x) and log Z[,X] are sufficient for computing the parameters \nX. The ,X can be found by solving either of the following two (equivalent) equations: \nL:,j; P(1/1 I,\\) 1/1 = 1/1obs, or \n= 1/1obs, which shows that knowledge of the g-factor \nand eX. ,j; are all that is required to do MEL. \nObserve from equation (2) that we have P(~I'x = 0) = Po(~) . \nsetting ,X = 0 corresponds to a uniform distribution on the images x. \n\nIn other words, \n\n8 10 zrXl \n\n;X \n\n~ \n\n3.1 Decoupling Filters \n\nWe now derive an important property of the minimax entropy approach. As men(cid:173)\ntioned earlier, it often seems that the potentials for filters A and B decouple. \nIn other words, if one applies MEL to two filters A, B simultaneously bv letting \n...... \n....A \n1/1 = (1/1 \nto \nthe equations: \n\n...... B...... \n...... B \n,1/1 ), '\\ = (,\\ ,'\\ ), and 1/1obs = (1/1 obs ' 1/1 obs)' then the solutIOns'\\ , '\\ \n\n....A -B \n\n:..tA \n\n...... B \n\n\"'\"\"'A \n\n...... \n\n. \n\nLP(xl,XA , ,XB)(iA(x) , iB(x)) = (~:bs'~!s)' \nx \n\n(3) \n\nare the same (approximately) as the solutions to the equations L:x p(xl,XA )iA(x) = \n~!s and L:x P(xl,XB)iB(x) = ~!s, see figure (3) for an example. \n\nFigure 3: Evidence for decoupling of features. The left and right panels show the \nclique potentials learned for the features a I ax and a I ay respectively. The solid lines \ngive the potentials when they are learned individually. The dashed lines show the \npotentials when they are learned simultaneously. Figure courtesy of Prof. Xiuwen \nLiu, Florida State University. \n\nWe now show how this decoupling property arises naturally if the g-factor for the \ntwo filters factorizes. This factorization, of course, is a property only of the form \nof the statistics and is completely independent of whether the statistics of the two \nfilters are dependent for the training data. \n\nProperty I: Suppose we have two sufficient statistics iA(x), iB (x) which are in(cid:173)\ndependent on the lattice in the sense that g(~A,~B ) = gA (~A )gB(~B) , \nlogZ[,XA,,XB] = \n\nthen \nlogZA[,XA] + logZB[,XB] and p(~A,~B ) = pA(~A)pB(~B ). \n\n\f. \n\nThis implies that the parameters XA, XB can be solved from the independent \n81ogZA[XA] _ \n-A \n'ljJobs' \n-\nequatwns \nL.,j;B pB(;fB );fB = ;f~s ' \n\n-B \n'ljJobs or L.,j;A P ('ljJ)'ljJ \n\nA A -A -A \n\n8 1ogZ B [XB ] \n\n-A \n'ljJobs' \n\n8XA \n\n8XB \n\n_ \n-\n\n= \n\nMoreover, the resulting distribution PUC) can be obtained by multiplying the distri(cid:173)\nbutions (l/Z A)eXA .,j;A(x) and (l/ZB) eXB.,j;B(x) together. \n\nThe point here is that the potential terms for the two statistics ;fA,;fB decouple if \nthe phase factor g(;fA,;fB) can be factorized. We conjecture that this is effectively \nthe case for many linear filters used in vision processing. For example, it is plausible \nthat the g-factor for features 0/ ox and 0/ oy factorizes - and figure (3) shows \nthat their clique potentials do decouple (approximately). Clearly, if factorization \nbetween filters occurs then it gives great simplification to the system. \n\n4 Approximating the g-factor for a Single Histogram \n\nWe now consider the case where the statistic is a single histogram. Our aim is to \nunderstand why features whose histograms are of stereotypical shape give rise to \npotentials of the form given by figure (3). Our results, of course, can be directly \nextended to multiple histograms if the filters decouple, see subsection (3.1). We \nfirst describe the approximation and then discuss its relevance for filter pursuit. \nWe rescale the X variables by N so that we have: \n\neNX.\u00a2(x) \n\nA \n\n_ \n\n_ eNX.,j; \n\nP(X'I-\\) = \n\nZ[X] \n\n, P('ljJ I-\\) = g('ljJ) Z[X] , \n\n(4) \n\nWe now consider the approximation that the filter responses {Ii} are independent \nof each other when the images are uniformly distributed. This is the multinomial \napproximation. (We attempted a related approximation [1] which was less success(cid:173)\nful.) It implies that we can express the phase factor as being proportional to a \nmultinomial distribution: \n(nt:) \n9
P(t)(x) is the expected histogram for the distribution at time t. \nThis implies that the corresponding clique potential update equation is given by: \n>.it +1) = >.it ) + log 'lfJ~bs - log 'lfJit ). \nIf we initialize GIS so that the initial distribution is the uniform distribution, \np(O) (x) = L -N, then the distribution after one iteration is p(1) (x) ex \ni.e. \ne2::a