{"title": "A Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 767, "abstract": null, "full_text": "A Phase Space Approach to Minimax \nEntropy Learning and the Minutemax \n\nApproximations \n\nJames M. Coughlan \nSmith-Kettlewell Inst. \n\nSan Francisco, CA 94115 \n\nA.L.Yuille \n\nSmith-Kettlewell Inst. \n\nSan Francisco, CA 94115 \n\nAbstract \n\nThere has been much recent work on measuring image statistics \nand on learning probability distributions on images. We observe \nthat the mapping from images to statistics is many-to-one and \nshow it can be quantified by a phase space factor. This phase \nspace approach throws light on the Minimax Entropy technique for \nlearning Gibbs distributions on images with potentials derived from \nimage statistics and elucidates the ambiguities that are inherent to \ndetermining the potentials. In addition, it shows that if the phase \nfactor can be approximated by an analytic distribution then this \napproximation yields a swift \"Minutemax\" algorithm that vastly \nreduces the computation time for Minimax entropy learning. An \nillustration of this concept, using a Gaussian to approximate the \nphase factor, gives a good approximation to the results of Zhu \nand Mumford (1997) in just seconds of CPU time. The phase \nspace approach also gives insight into the multi-scale potentials \nfound by Zhu and Mumford (1997) and suggests that the forms of \nthe potentials are influenced greatly by phase space considerations. \nFinally, we prove that probability distributions learned in feature \nspace alone are equivalent to Minimax Entropy learning with a \nmultinomial approximation of the phase factor. \n\n1 \n\nIntroduction \n\nBayesian probability theory gives a powerful framework for visual perception (Knill \nand Richards 1996). This approach, however, requires specifying prior probabilities \nand likelihood functions. Learning these probabilities is difficult because it requires \nestimating distributions on random variables of very high dimensions (for example, \nimages with 200 x 200 pixels, or shape curves of length 400 pixels). An important \n\n\f762 \n\nJ M. Coughlan and A. L. Yuille \n\nrecent advance is the Minimax Entropy Learning theory. This theory was developed \nby Zhu, Wu and Mumford (1997 and 1998) and enables them to learn probability \ndistributions for the intensity properties and shapes of natural stimuli and clutter. \nIn addition, when applied to real world images it has an interesting link to the work \non natural image statistics (Field 1987), (Ruderman and Bialek 1994), (Olshaussen \nand Field 1996). We wish to simplify Minimax and make the learning easier, faster \nand more transparent. \n\nIn this paper we present a phase space approach to Minimax Entropy learning. This \napproach is based on the observation that the mapping from images to statistics \nis many-to-one and can be quantified by a phase space factnr. If this phase space \nfactor can be approximated by an analytic function then we obtain approximate \n\"Minutemax\" algorithms which greatly speed up the learning process. In one version \nof this approximation, the unknown parameters of the distribution to be learned \nare related linearly to the empirical statistics of the image data set, and may be \nsolved for in seconds or less. Independent of this approximation, the Minutemax \nframework also illuminates an important combinatoric aspect of Minimax, namely \nthe fact that many different images can give rise to the same image statistics. This \n\"phase space\" factor explains the ambiguities inherent in learning the parameters \nof the unknown distribution, and motivates the approximation that reduces the \nproblem to linear algebra. Finally, we prove that probability distributions learned in \nfeature space alone are equivalent to Minimax Entropy learning with a multinomial \napproximation of the phase factor. \n\n2 A Phase Space Perspective on Minimax \n\nWe wish to learn a distribution P(I) on images, where I denotes the set of pixel \nvalues [(x, y) on a finite image lattice, and each value [(x , y) is quantized to a finite \nset of intensity values. (In fact, this approach is general and applies to any patterns, \nnot just images.) We define a set of image statistics \u00a21 (I), \u00a22(1), ... , \u00a2s(I), which \nwe concatenate as a single vector function \u00a2(I) . If these statistics have empirical \nmean d =< \u00a2(I) > on a dataset of images (we assume a large enough dataset for \nthe law of large numbers to apply; see Zhu and Mumford (1997) for an analysis \nof the errors inherent in this assumption) then the maximum entropy distribution \nPM(I) with these empirical statistics is an exponential (Gibbs) distribution of the \nform \n\nPM(I) = -\n\n.... -, \n\n(1) \n\neX\u00b7i(I) \n\nZ('\\) \n\nwhere the potential X is set so that < \u00a2(I) > M= 1. \nIn summary, the goal of Minimax Learning is to to find an appropriate set of \nimage filters for the domain of interest (i.e. maximally informative filters) and to \nestimate X given 1. Extensive computation is required to determine X; the phase \nspace approach to Minimax Le~ning motivates approximations that make X easy \nto estimate. \n\n2.1 \n\nImage Histogram Statistics \n\nThe statistics we consider (following Zhu, Wu and Mumford (1997, 1998)) are de(cid:173)\nfined as histograms of the responses of one or more filters applied acrOss an entire \nimage. Consider a single filter f (linear or non-linear) with response fx(l) centered \nat position x in the image. Without loss of generality, we will assume the filter has \nquantized integer responses from 1 through f max, \n\n\fA Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations \n\n763 \n\nFor notational convenience we transform the filter response fx(l) to a binary repre(cid:173)\nsentation bx(I) , defined as a column vector with fmax components: bx,z(l) = 6z'/x(I) , \nwhere index z ranges from 1 through f max . This vector is composed of all zeros \nexcept for the entry corresponding to the filter response, which is set to one. The \nimage statistics vector is then a histogram vector defined as the average of the \niv L:x bx (I). The entries in \u00a2(I) then sum to 1. \nbx (I) 's over all N pixels: \u00a2(I) = \n(We can generalize to the case of multiple filters f(1), f(2), . . . , f(m), as detailed in \nCoughlan and Yuille (1999).) \n\n2.2 The Phase Factor \n\nThe original Minimax distribution PM (I) induces a distribution PM(\u00a2) on the statis(cid:173)\ntics themselves, without reference to a particular image: \n\n(2) \n\nwhere g(\u00a2) is a combinatoric phase space factor, with a corresponding normalized \ncombinatoric distribution g(\u00a2), defined by: \n\ng(\u00a2o) = L 6io ,i(I), and g(\u00a2) = g(\u00a2)/Q N , \n\nI \n\n(3) \n\nwhere the phase space factor g( \u00a2) counts the number of images 1 having statistics \n\u00a2. N is the number of pixels and Q is the number of pixel intensity levels, Le. \nQN is the total number of possible images I. It should be emphasized that the \nphase factor depends only on the set of filters chosen and is independent of the true \ndistribution P(I). Thus the phase factor can be computed offline, independent of \nthe image data set. \n\nIn this paper we will discuss two useful approximations to g(\u00a2): a Gaussian ap(cid:173)\nproximation, which yields the swift approximation for learning, and a multinomial \napproximation, which establishes a connection between Minimax and standard fea(cid:173)\nture learning. \n\n2.3 The Non-Uniqueness of the Potential X \nGiven a set of filters and their empirical mean statistics d, is the potential X uniquely \nspecified? Clearly, any solution for X may be shifted by an additive constant \n(Ai -+ A~ = Ai + k for all i), yielding a different normalization constant Z(~) \nbut preserving PM(I). In this section we show that other, non-trivial ambiguities \nin X which preserve PM(I) can exist, stemming from the fact that some values of \n\u00a2 are inconsistent with every possible image 1 and hence never arise (in any possi(cid:173)\nble image dataset). These \"intrinsic\" ambiguities are inherent to Minimax and are \nindependent of the true distribution P(I). We will also discuss a second type of \npossible ambiguity which depends on the characteristics of the image dataset used \nfor learning. \nWe can uncover the intrinsic ambiguities in X by examining the covariance C of \ng(\u00a2). (See Coughlan and Yuille (1999) for details on calculating the mean c and \ncovariance C for any set of linear filters or non-linear filters that are scalar functions \n\n\f764 \n\nJ. M. Coughlan and A. L. Yuille \n\nof linear filters.) Defining the set of all possible statistics values
on the image dataset. (As shown in Coughlan and Yuille (1999), there is a \nconvex set of distributions, of which the true distribution P(I) is a member, which \nshare the same mean statistics < \u00a2 >.) This second kind of ambiguity stems from \nthe fact that the mean statistics convey only a fraction of the information that \nis contained in the true distribution P(I). To resolve this second ambiguity it is \nnecessary to extract more information from the image data set. The simplest way \nto achieve this is to use a larger (or more informative) set of filters to lower the \nentropy of PM(I) (this topic is discussed in more detail in Zhu, Wu and Mumford \n(1997, 1998), Coughlan and Yuille (1999)). Alternatively, one can extend Minimax \nto include second-order statistics, i.e. the covariance of \u00a2 in addition to its mean d. \nThis is an important topic for future research. \n\n3 The Minutemax Approximations \n\nWe now illustrate the phase space approach by showing that suitable approximations \nof the phase space factor g( \u00a2) make it easy to estimate the potential X given the \nempirical mean d. The resulting fast approximations to Minimax Learning are \ncalled \"Minutemax\" algorithms. \n\n3.1 The Gaussian Approximation of g(\u00a2) \n\nIf the phase space factor g( \u00a2) may be approximated as a multi-variate Gaussian \n(see Coughlan and Yuille (1999) for a justification of this approximation) then the \nprobability distribution PM(\u00a2) = g(\u00a2)e).\u00b7i/Z(X) reduces to another multi-variate \nGaussian. (Note that we are making the Gaussian approximation in \u00a2 space- the \nspace of all possible image statistics histograms-and not filter response (feature) \nspace.) As we will see, this result greatly simplifies the problem of estimating the \npotential X. \nRecall that the mean and covariance of g( \u00a2) are denoted by c and G, respectively. \nThe null space of G has dimension n and is spanned by vectors il(1), il(2) ... il(n). \nAs discussed in Theorem 1, for all feasible values of \u00a2 (Le. all \u00a2 E
gauss= .,p = J, and so we can write a linear equation relating X and \nd: d= c+cX. \nIt can be shown (Zhu - private communication) that solving this equation is equiv(cid:173)\nalent to one step of Newton-Raphson for minimization of an appropriate cost func(cid:173)\ntion. This will fail to be a good approximation if the cost function is highly non(cid:173)\nquadratic. As explained in Coughlan and Yuille (1999), the Gaussian approximation \nis also equivalent to a second-order perturbation expansion of the partition function \nZ(X); higher-order corrections can be made by computing higher-order moments of \ng($). \n\n3.2 Experimental Results \n\nWe tested the Gaussian Minutemax procedure on two sets of filters: a single (fine \nscale) image gradient filter aI/ax, and a set of multi-scale image gradient filters \ndefined at three scales, similar to those used by Zhu and Mumford (1997). In both \nsets, the fine scale gradient filter is linear with kernel (1, -1), representing a dis(cid:173)\ncretization of a/ax. In the second set, the medium scale filter kernel is (U2 , -U2 )/4 \nand the coarse scale kernel is (U4 , -U4 )/16, where Un denotes the n x n matrix of all \nones. The responses of the medium and coarse filters were rounded (i.e. quantized) \nto the nearest integer, thus adding a non-linearity to these filters. Finally, d was \nmeasured on a data set of over 100 natural images; the fine scale components of d \nare shown in the first panel of Figure (1) and were empirically very similar to the \nmedium and coarse scale components. \nA X that solves d = c + cX is shown in the third panel of Figure (1) for the first \nfilter (along with c in the second panel) and in the three panels of Figure (2) for \nthe multi-scale filter set. The form of X is qualitatively similar to that obtained by \nZhu and Mumford (1997) (bearing in mind that Zhu disregarded any filter responses \nwith magnitude above Q/2, i.e. his filter response range is half of ours). In addition, \nthe eigenvectors of C with small eigenvalues are large away from the origin, so one \nshould not trust the values of the potentials there (obtained by any algorithm). \n\nZhu and Mumford (1997) report interactions between filters ' applied at different \nscales. This is because the resulting potentials appear different than the potential \nat the fine scale even though the histograms appear similar at all scales. We argue, \nhowever, that some of this \"interaction\" is due to the different phase factors at \ndifferent scales. In other words the potentials would look different at different scales \neven if the empirical histograms were identical because of differing phase factors. \n\n3.3 The Multinomial Approximation of g(\u00a2) \n\nMany learning theories simply make probability distributions on feature space. How \ndo they differ from Minimax Entropy Learning which works on image space? By \n\n\f766 \n\n1. M Coughlan and A. L. Yuille \n\n., ~ \nI \n\n~ \n\n; \n! \n\n. , \n\n,'. \n\n.. \n\nl \n\nFigure 2: From left to right: the fine, medium and coarse components of - X as \ncomputed by the Gaussian Minutemax approximation. \n\n\". \n\nFigure 3: Left to right: d, c, and -X as given by multinomial approximation for the \na / ax filter at fine scale. \n\nexamining the phase factor we will show that the two approaches are not identical \nin general. The feature space learning ignores the coupling between the filters \nwhich arise due to how the statistics are obtained. More precisely, the probability \ndistribution obtained on feature space, PF, is equivalent to the Minimax distribution \nPM if, and only if, the phase factor is multinomial. \n\nWe begin the analysis by considering a single filter. As before we define the com(cid:173)\nbinatoric mean c = L:r$ g( i)i. The multinomial approximation of g( i) is equiv(cid:173)\nalent to assuming that the combinatoric frequencies of filter responses are inde(cid:173)\npendent from pixel to pixel. Since the combinatoric frequency of filter response \nj E {I, 2, .. . , fmax} is Cj and there are N