{"title": "Empirical Entropy Manipulation for Real-World Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 851, "page_last": 857, "abstract": null, "full_text": "Empirical Entropy Manipulation for \n\nReal-World Problems \n\nPaul Viola: Nicol N. Schraudolph, Terrence J. Sejnowski \n\nComputational Neurobiology Laboratory \nThe Salk Institute for Biological Studies \n\n10010 North Torrey Pines Road \n\nLa Jolla, CA 92037-1099 \n\nviola@salk.edu \n\nAbstract \n\nNo finite sample is sufficient to determine the density, and therefore \nthe entropy, of a signal directly. Some assumption about either the \nfunctional form of the density or about its smoothness is necessary. \nBoth amount to a prior over the space of possible density functions. \nBy far the most common approach is to assume that the density \nhas a parametric form. \n\nBy contrast we derive a differential learning rule called EMMA \nthat optimizes entropy by way of kernel density estimation. En(cid:173)\ntropy and its derivative can then be calculated by sampling from \nthis density estimate. The resulting parameter update rule is sur(cid:173)\nprisingly simple and efficient. \n\nWe will show how EMMA can be used to detect and correct cor(cid:173)\nruption in magnetic resonance images (MRI). This application is \nbeyond the scope of existing parametric entropy models. \n\n1 \n\nIntroduction \n\nInformation theory is playing an increasing role in unsupervised learning and visual \nprocessing. For example, Linsker has used the concept of information maximization \nto produce theories of development in the visual cortex (Linsker, 1988). Becker and \nHinton have used information theory to motivate algorithms for visual processing \n(Becker and Hinton, 1992). Bell and Sejnowski have used information maximization \n\n\u2022 Author to whom correspondence should be addressed. Current address: M.LT., 545 \n\nTechnology Square, Cambridge, MA 02139. \n\n\f852 \n\nP. VIOLA, N. N. SCHRAUDOLPH, T. J. SEJNOWSKI \n\nto solve the \"cocktail party\" or signal separation problem (Bell and Sejnowski, \n1995). In order to simplify analysis and implementation, each of these techniques \nmakes specific assumptions about the nature of the signals used, typically that the \nsignals are drawn from some parametric density. In practice, such assumptions are \nvery inflexible. \n\nIn this paper we will derive a procedure that can effectively estimate and manip(cid:173)\nulate the entropy of a wide variety of signals using non-parametric densities. Our \ntechnique is distinguished by is simplicity, flexibility and efficiency. \n\nWe will begin with a discussion of principal components analysis (PCA) as an exam(cid:173)\nple of a simple parametric entropy manipulation technique. After pointing out some \nof PCA's limitation, we will then derive a more powerful non-parametric entropy \nmanipulation procedure. Finally, we will show that the same entropy estimation \nprocedure can be used to tackle a difficult visual processing problem. \n\n1.1 Parametric Entropy Estimation \n\nTypically parametric entropy estimation is a two step process. We are given a \nparametric model for the density of a signal and a sample. First, from the space \nof possible density functions the most probable is selected. This often requires a \nsearch through parameter space. Second, the entropy of the most likely density \nfunction is evaluated. \n\nParametric techniques can work well when the assumed form of the density matches \nthe actual data. Conversely, when the parametric assumption is violated the result(cid:173)\ning algorithms are incorrect. The most common assumption, that the data follow the \nGaussian density, is especially restrictive. An entropy maximization technique that \nassumes that data is Gaussian, but operates on data drawn from a non-Gaussian \ndensity, may in fact end up minimizing entropy. \n\n1.2 Example: Principal Components Analysis \n\nThere are a number of signal processing and learning problems that can be formu(cid:173)\nlated as entropy maximization problems. One prominent example is principal com(cid:173)\nponent analYllill (PCA). Given a random variable X, a vector v can be used to define \na new random variable, Y\" = X . v with variance Var(Y,,) = E[(X . v - E[X . v])2]. \nThe principal component v is the unit vector for which Var(Yv) is maximized. \nIn practice neither the density of X nor Y\" is known. The projection variance is \ncomputed from a finite sample, A, of points from X, \n\nVar(Y,,) ~ Var(Y,,) == EA[(X . v - EA[X . v])2] , \n\nA \n\n(1) \n\nwhere VarA(Y,,) and E A [\u00b7] are shorthand for the empirical variance and mean eval(cid:173)\nuated over A. Oja has derived an elegant on-line rule for learning v when presented \nwith a sample of X (Oja, 1982). \nUnder the assumption that X is Gaussian is is easily proven that Yv has maximum \nentropy. Moreover, in the absence of noise, Yij, contains maximal information about \nX. However, when X is not Gaussian Yij is generally not the most informative \nprojection. \n\n2 Estimating Entropy with Parzen Densities \n\nWe will now derive a general procedure for manipulating and estimating the entropy \nof a random variable from a sample. Given a sample of a random variable X, we can \n\n\fEmpirical Entropy Manipulation for Real-world Problems \n\n853 \n\nconstruct another random variable Y = F(X,l1). The entropy, heY), is a function of \nv and can be manipulated by changing 11. The following derivation assumes that Y is \na vector random variable. The joint entropy of a two random variables, h(Wl' W2), \ncan be evaluated by constructing the vector random variable, Y = [Wl' w2jT and \nevaluating heY). \n\nRather than assume that the density has a parametric form, whose parameters are \nselected using maximum likelihood estimation, we will instead use Parzen window \ndensity estimation (Duda and Hart, 1973). In the context of entropy estimation, the \nParzen density estimate has three significant advantages over maximum likelihood \nparametric density estimates: (1) it can model the density of any signal provided \nthe density function is smooth; (2) since the Parzen estimate is computed directly \nfrom the sample, there is no search for parameters; (3) the derivative of the entropy \nof the Parzen estimate is simple to compute. \nThe form of the Parzen estimate constructed from a sample A is \np.(y, A) = ~A I: R(y - YA) = EA[R(y - YA)] \n\n(2) \n\n, \n\nYAEA \n\nwhere the Parzen estimator is constructed with the window function R(\u00b7) which \nintegrates to 1. We will assume that the Parzen window function is a Gaussian \ndensity function. This will simplify some analysis, but it is not necessary. Any \ndifferentiable function could be used. Another good choice is the Cauchy density. \n\nUnfortunately evaluating the entropy integral \n\nhey) ~ -E[log p.(~, A)] = -i: log p.(y, A)dy \n\nis inordinately difficult. This integral can however be approximated as a sample \nmean: \n\n(3) \nwhere EB{ ] is the sample mean taken over the sample B. The sample mean \nconverges toward the true expectation at a rate proportional to 1/ v' N B (N B is \nthe size of B). To reiterate, two samples can be used to estimate the entropy of a \ndensity: the first is used to estimate the density, the second is used to estimate the \nentropyl. We call h\u00b7 (Y) the EMMA estimate of entropy2. \n\nOne way to extremize entropy is to use the derivative of entropy with respect to v. \nThis may be expressed as \n\n~h(Y) ~ ~h\u00b7(Y) = __ 1_ '\" LYAEA f;gt/J(YB - YA) \nN B L....iB Ly EA gt/J(YB - YA) \ndl1 \nd 1 \n1 \n\ndv \n\nYBE \n\nA \n\n= NB I: I: Wy (YB , YA) dl1 \"2 Dt/J(YB - YA), \n\nYBEB YAEA \n\nwhere WY(Yl' Y2) = L \n\n_ \n\ngt/J(Yl - Y2) \n\n( \n\nYAEA gt/J Yl - YA \n\n) \n\n, \n\n(4) \n\n(5) \n\n(6) \n\nDt/J(Y) == yT.,p-ly, and gt/J(Y) is a multi-dimensional Gaussian with covariance .,p. \nWy(Yl' Y2) is an indicator of the degree of match between its arguments, in a \"soft\" \n\nlUsing a procedure akin to leave-one-out cross-validation a single sample can be used \n\nfor both purposes. \n\n2EMMA is a random but pronounceable subset of the letters in the words \"Empirical \n\nentropy Manipulation and Analysis\". \n\n\f854 \n\nP. VIOLA, N. N. SCHRAUDOLPH, T. J. SEJNOWSKl \n\nsense. It will approach one if Yl is significantly closer to Y2 than any element of A. \nTo reduce entropy the parameters v are adjusted such that there is a reduction in \nthe average squared distance between points which Wy indicates are nearby. \n\n2.1 Stochastic Maximization Algorithm \n\nBoth the calculation of the EMMA entropy estimate and its derivative involve a \ndouble summation. As a result the cost of evaluation is quadratic in sample size: \nO(NANB). While an accurate estimate of empirical entropy could be obtained by \nusing all of the available data (at great cost), a stochastic estimate of the entropy \ncan be obtained by using a random subset of the available data (at quadratically \nlower cost). This is especially critical in entropy manipulation problems, where the \nderivative of entropy is evaluated many hundreds or thousands of times. Without \nthe quadratic savings that arise from using smaller samples entropy manipulation \nwould be impossible (see (Viola, 1995) for a discussion of these issues). \n\n2.2 Estimating the Covariance \n\nIn addition to the learning rate .A, the covariance matrices of the Parzen window \nfunctions, g,p, are important parameters of EMMA. These parameters may be cho(cid:173)\nsen so that they are optimal in the maximum likelihood sense. For simplicity, we \nassume that the covariance matrices are diagonal,.,p = DIAG(O\"~,O\"~, ... ). Follow(cid:173)\ning a derivation almost identical to the one described in Section 2 we can derive an \nequation analogous to (4), \n\nd . \n-h (Y) = -\ndO\"k \n\n1\"\" \"\" \nL...J L...J WY(YB' YA) \nN B \nYsE YAEa \n\nb \n\n( 1 ) ([y]~ \n) \n- - - 1 \n-\nO\"~ \nO\"k \n\n(7) \n\nwhere [Y]k is the kth component of the vector y. The optimal, or most likely, \n.,p minimizes h\u00b7 (Y). In practice both v and .,p are adjusted simultaneously; for \nexample, while v is adjusted to maximize h\u00b7 (YlI ), .,p is adjusted to minimize h\u00b7 (y,,). \n\n3 Principal Components Analysis and Information \n\nAs a demonstration, we can derive a parameter estimation rule akin to principal \ncomponents analysis that truly maximizes information. This new EMMA based \ncomponent analysis (ECA) manipulates the entropy of the random variable Y\" = \nX\u00b7v under the constraint that Ivl = 1. For any given value of v the entropy of Yv can \nbe estimated from two samples of X as: h\u00b7(Yv ) = -EB[logEA[g,p(xB\u00b7v - XA\u00b7 v)]], \nwhere .,p is the variance of the Parzen window function. Moreover we can estimate \nthe derivative of entropy: \n\nd~ h\u00b7(YlI ) = ; L L Wy(YB, YA) .,p-l(YB - YA)(XB - XA) \n\nB B \n\nA \n\n, \n\nwhere YA = XA . v and YB = XB . v. The derivative may be decomposed into parts \nwhich can be understood more easily. Ignoring the weighting function Wy.,p-l we \nare left with the derivative of some unknown function f(y\"): \n\nd \ndvf(Yv ) = N N L L(YB - YA)(XB - XA) \n\n1 \n\nB A B \n\nA \n\n(8) \n\nWhat then is f(y\")? The derivative of the squared difference between samples is: \nd~ (YB - YA)2 = 2(YB - YA)(XB - XA) \n\n. So we can see that \n\nf(Y,,) = 2N IN L L(YB - YA)2 \n\nB A B \n\nA \n\n\fEmpirical Entropy Manipulation for Real-world Problems \n\n855 \n\n\u2022 I \n\n: . \n\nECA-MIN \nECA-MAX \nBCM \nBINGO \nPCA \n\n3 \n\n2 \n\no \n-I \n\n-2 \n\n-3 \n\n\u2022\u2022 t \n\n-4 \n\n-2 \n\no \n\n2 \n\n4 \n\nFigure 1: See text for description. \n\nis one half the expectation of the squared difference between pairs of trials of Yv \u2022 \n\nRecall that PCA searches for the projection, Yv , that has the largest sample vari(cid:173)\nance. Interestingly, f(Yv ) is precisely the sample variance. Without the weighting \nterm Wll ,p-l, ECA would find exactly the same vector that PCA does: the max(cid:173)\nimum variance projection vector. However because of Wll , the derivative of ECA \ndoes not act on all points of A and B equally. Pairs of points that are far apart are \nforced no further apart. Another way of interpreting ECA is as a type of robust \nvariance maximization. Points that might best be interpreted as outliers, because \nthey are very far from the body of other points, playa very small role in the mini(cid:173)\nmization. This robust nature stands in contrast to PCA which is very sensitive to \noutliers. \n\nFor densities that are Gaussian, the maximum entropy projection is the first prin(cid:173)\ncipal component. In simulations ECA effectively finds the same projection as PCA, \nand it does so with speeds that are comparable to Oja's rule. ECA can be used both \nto find the entropy maximizing (ECA-MAX) and minimizing (ECA-MIN) axes. For \nmore complex densities the PCA axis is very different from the entropy maximizing \naxis. To provide some intuition regarding the behavior of ECA we have run ECA(cid:173)\nMAX, ECA-MIN, Oja's rule, and two related procedures, BCM and BINGO, on \nthe same density. BCM is a learning rule that was originally proposed to explain \ndevelopment of receptive fields patterns in visual cortex (Bienenstock, Cooper and \nMunro, 1982). More recently it has been argued that the rule finds projections \nthat are far from Gaussian (Intrator and Cooper, 1992). Under a limited set of \nconditions this is equivalent to finding the minimum entropy projection. BINGO \nwas proposed to find axes along which there is a bimodal distribution (Schraudolph \nand Sejnowski, 1993). \n\nFigure 1 displays a 400 point sample and the projection axes discussed above. The \ndensity is a mixture of two clusters. Each cluster has high kurtosis in the horizontal \ndirection. The oblique axis projects the data so that it is most uniform and hence \nhas the highest entropy; ECA-MAX finds this axis. Along the vertical axis the \ndata is clustered and has low entropy; ECA-MIN finds this axis. The vertical axis \nalso has the highest variance. Contrary to published accounts, the first principal \ncomponent can in fact correspond to the minimum entropy projection. BCM, while \nit may find minimum entropy projections for some densities, is attracted to the \nkurtosis along the horizontal axis. For this distribution BCM neither minimizes nor \nmaximizes entropy. Finally, BINGO successfully discovers that the vertical axis is \nvery bimodal. \n\n\f856 \n\nP. VIOLA, N. N. SCHRAUOOLPH, T. J. SEJNOWSKI \n\n\\ Corrupted(cid:173)\n:. Corrected .\u2022 \n\n. ' .. . : \n\n1200 \n\n1000 \n\n800 \n\n600 \n\n400 \n\n200 \n\nFigure 2: At left: A slice from an MRI scan of a head. Center: The scan after \ncorrection. Right: The density of pixel values in the MRI scan before and after \ncorrection. \n\n~.1 0 0.1 0.2 0.3 0.4 \n\n'. \n0.7 0.8 0.9 \n\n4 Applications \n\nEMMA has proven useful in a number of applications. In object recognition EMMA \nhas been used align 3D shape models with video images (Viola and Wells III, 1995). \nIn the area of medical imaging EMMA has been used to register data that arises \nfrom differing medical modalities such as magnetic resonance images, computed \ntomography images, and positron emission tomography (Wells, Viola and Kikinis, \n1995). \n\n4.1 MRI Processing \n\nIn addition, EMMA can be used to process magnetic resonance images (MRI). \nAn MRI is a 2 or 3 dimensional image that records the density of tissues inside the \nbody. In the head, as in other parts of the body, there are a number of distinct tissue \nclasses including: bone, water, white matter, grey matter, and fat. ~n principle the \ndensity of pixel values in an MRI should be clustered, with one cluster for each \ntissue class. In reality MRI signals are corrupted by a bias field, a multiplicative \noffset that varies slowly in space. The bias field results from unavoidable variations \nin magnetic field (see (Wells III et al., 1994) for an overview of this problem). \n\nBecause the densities of each tissue type cluster together tightly, an uncorrupted \nMRI should have relatively low entropy. Corruption from the bias field perturbs \nthe MRI image, increasing the values of some pixels and decreasing others. The \nbias field acts like noise, adding entropy to the pixel density. We use EMMA to find \na low-frequency correction field that when applied to the image, makes the pixel \ndensity have a lower entropy. The resulting corrected image will have a tighter \nclustering than the original density. \nCall the uncorrupted scan s(z); it is a function of a spatial random variable z. The \ncorrupted scan, c( x) = s( z) + b( z) is a sum of the true scan and the bias field. There \nare physical reasons to believe b( x) is a low order polynomial in the components of \nz. EMMA is used to minimize the entropy of the corrected signal, h( c( x) - b( z, v\u00bb, \nwhere b( z, v), a third order polynomial with coefficients v, is an estimate for the \nbias corruption. \n\nFigure 2 shows an MRI scan and a histogram of pixel intensity before and after \ncorrection. The difference between the two scans is quite subtle: the uncorrected \nscan is brighter at top right and dimmer at bottom left. This non-homogeneity \n\n\fEmpirical Entropy Manipulation for Real-world Problems \n\n857 \n\nmakes constructing automatic tissue classifiers difficult. In the histogram of the \noriginal scan white and grey matter tissue classes are confounded into a single peak \nranging from about 0.4 to 0.6. The histogram of the corrected scan shows much \nbetter separation between these two classes. For images like this the correction field \ntakes between 20 and 200 seconds to compute on a Sparc 10. \n\n5 Conclusion \n\nWe have demonstrated a novel entropy manipulation technique working on problems \nof significant complexity and practical importance. Because it is based on non(cid:173)\nparametric density estimation it is quite flexible, requiring no strong assumptions \nabout the nature of signals. The technique is widely applicable to problems in \nsignal processing, vision and unsupervised learning. The resulting algorithms are \ncomputationally efficient. \n\nAcknowledgements \n\nThis research was support by the Howard Hughes Medical Institute. \n\nReferences \nBecker, S. and Hinton, G. E. (1992). A self-organizing neural network that discovers \n\nsurfaces in random-dot stereograms. Nature, 355:161-163. \n\nBell, A. J. and Sejnowski, T. J. (1995). An information-maximisation approach to blind \nseparation. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advance8 in \nNeural Information Proce88ing, volume 7, Denver 1994. MIT Press, Cambridge. \n\nBienenstock, E., Cooper, L., and Munro, P. (1982). Theory for the development of neuron \nselectivity: Orientation specificity and binocular interaction in visual cortex. Journal \nof Neur08cience, 2. \n\nDuda, R. and Hart, P. (1973). Pattern Cla88ification and Scene AnalY8i8. Wiley, New \n\nYork. \n\nIntrator, N. and Cooper, L. N. (1992). Objective function formulation of the bcm the(cid:173)\n\nory of visual cortical plasticity: Statistical connections, stability conditions. Neural \nNetwork., 5:3-17. \n\nLinsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, pages \n\n105-117. \n\nOja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of \n\nMathematical Biology, 15:267-273. \n\nSchraudolph, N. N. and Sejnowski, T. J. (1993). Unsupervised discrimination of clustered \ndata via optimization of binary information gain. In Hanson, S. J., Cowan, J. D., \nand Giles, C. L., editors, Advance. in Neural Information Proce88ing, volume 5, pages \n499-506, Denver 1992. Morgan Kaufmann, San Mateo. \n\nViola, P. A. (1995). Alignment by Ma:cimization of Mutual Information. PhD thesis, \n\nMassachusetts Institute of Technology. MIT AI Laboratory TR 1548. \n\nViola, P. A. and Wells III, W. M. (1995). Alignment by maximization of mutual infor(cid:173)\n\nmation. In Fifth Inti. Conf. on Computer Vi8ion, pages 16-23, Cambridge, MA. \nIEEE. \n\nWells, W., Viola, P., and Kikinis, R. (1995). Multi-modal volume registration by maxi(cid:173)\nmization of mutual information. In Proceeding. of the Second International Sympo-\n8ium on Medical Robotic. and Computer A88i8ted Surgery, pages 55 - 62. Wiley. \n\nWells III, W., Grimson, W., Kikinis, R., and Jolesz, F. (1994). Statistical Gain Correction \nand Segmentation of MRI Data. In Proceeding. of the Computer Society Conference \non Computer Vi.ion and Pattern Recognition, Seattle, Wash. IEEE, Submitted. \n\n\f", "award": [], "sourceid": 1040, "authors": [{"given_name": "Paul", "family_name": "Viola", "institution": null}, {"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}