{"title": "Hierarchical Modeling of Local Image Features through $L_p$-Nested Symmetric Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1696, "page_last": 1704, "abstract": "We introduce a new family of distributions, called $L_p${\\em -nested symmetric distributions}, whose densities access the data exclusively through a hierarchical cascade of $L_p$-norms. This class generalizes the family of spherically and $L_p$-spherically symmetric distributions which have recently been successfully used for natural image modeling. Similar to those distributions it allows for a nonlinear mechanism to reduce the dependencies between its variables. With suitable choices of the parameters and norms, this family also includes the Independent Subspace Analysis (ISA) model, which has been proposed as a means of deriving filters that mimic complex cells found in mammalian primary visual cortex. $L_p$-nested distributions are easy to estimate and allow us to explore the variety of models between ISA and the $L_p$-spherically symmetric models. Our main findings are that, without a preprocessing step of contrast gain control, the independent subspaces of ISA are in fact more dependent than the individual filter coefficients within a subspace and, with contrast gain control, where ISA finds more than one subspace, the filter responses were almost independent anyway.", "full_text": "Hierarchical Modeling of Local Image Features\n\nthrough Lp-Nested Symmetric Distributions\n\nMax Planck Institute for Biological Cybernetics\n\nFabian Sinz\n\nSpemannstra\u00dfe 41\n\n72076 T\u00a8ubingen, Germany\n\nfabee@tuebingen.mpg.de\n\nEero P. Simoncelli\n\nCenter for Neural Science, and Courant Institute\nof Mathematical Sciences, New York University\n\nNew York, NY 10003\n\neero.simoncelli@nyu.edu\n\nMatthias Bethge\n\nMax Planck Institute for Biological Cybernetics\n\nSpemannstra\u00dfe 41\n\n72076 T\u00a8ubingen, Germany\n\nmbethge@tuebingen.mpg.de\n\nAbstract\n\nWe introduce a new family of distributions, called Lp-nested symmetric distri-\nbutions, whose densities are expressed in terms of a hierarchical cascade of Lp-\nnorms. This class generalizes the family of spherically and Lp-spherically sym-\nmetric distributions which have recently been successfully used for natural im-\nage modeling. Similar to those distributions it allows for a nonlinear mechanism\nto reduce the dependencies between its variables. With suitable choices of the\nparameters and norms, this family includes the Independent Subspace Analysis\n(ISA) model as a special case, which has been proposed as a means of deriv-\ning \ufb01lters that mimic complex cells found in mammalian primary visual cortex.\nLp-nested distributions are relatively easy to estimate and allow us to explore the\nvariety of models between ISA and the Lp-spherically symmetric models. By \ufb01t-\nting the generalized Lp-nested model to 8 \u00d7 8 image patches, we show that the\nsubspaces obtained from ISA are in fact more dependent than the individual \ufb01l-\nter coef\ufb01cients within a subspace. When \ufb01rst applying contrast gain control as\npreprocessing, however, there are no dependencies left that could be exploited by\nISA. This suggests that complex cell modeling can only be useful for redundancy\nreduction in larger image patches.\n\n1 Introduction\n\nFinding a precise statistical characterization of natural images is an endeavor that has concerned\nresearch for more than \ufb01fty years now and is still an open problem. A thorough understanding of\nnatural image statistics is desirable from an engineering as well as a biological point of view. It\nforms the basis not only for the design of more advanced image processing algorithms and compres-\nsion schemes, but also for a better comprehension of the operations performed by the early visual\n\n1\n\n\fsystem and how they relate to the properties of the natural stimuli that are driving it. From both\nperspectives, redundancy reducing algorithms such as Principal Component Analysis (PCA), Inde-\npendent Component Analysis (ICA), Independent Subspace Analysis (ISA) and Radial Factorization\n[11; 21] have received considerable interest since they yield image representations that are favorable\nfor compression and image processing and at the same time resemble properties of the early visual\nsystem. In particular, ICA and ISA yield localized, oriented bandpass \ufb01lters which are reminiscent\nof receptive \ufb01elds of simple and complex cells in primary visual cortex [4; 16; 10]. Together with the\nRedundancy Reduction Hypothesis by Barlow and Attneave [3; 1], those observations have given\nrise to the idea that these \ufb01lters represent an important aspect of natural images which is exploited\nby the early visual system.\nSeveral result, however, show that the density model of ICA is too restricted to provide a good model\nfor natural images patches. Firstly, several authors have demonstrated that \ufb01lter responses of ICA\n\ufb01lters on natural images are not statistically independent [20; 23; 6]. Secondly, after whitening, the\noptimum of ICA in terms of statistical independence is very shallow or, in other words, all whitening\n\ufb01lters yield almost the same redundancy reduction [5; 2]. A possible explanation for that \ufb01nding is\nthat, after whitening, densities of local image features are approximately spherical [24; 23; 12; 6].\nThis implies that those densities cannot be made independent by ICA because (i) all whitening \ufb01lters\ndiffer only by an orthogonal transformation, (ii) spherical densities are invariant under orthogonal\ntransformations, and (iii) the only spherical and factorial distribution is the Gaussian. Once local\nimage features become more distant from each other, the contour lines of the density deviates from\nspherical and become more star-shaped. In order to capture this star-shaped contour lines one can\nuse the more general Lp-spherically symmetric distributions which are characterized by densities of\n\nthe form \u03c7(y) = g((cid:31)y(cid:31)p) with (cid:31)y(cid:31)p = ((cid:31) (cid:124) yi(cid:124) p)1(cid:47) p and p > 0 [9; 10; 21].\n\np=0.8\n\np=2\n\np=0.8\n\np=1.5\n\nFigure 1: Scatter plots and marginal histograms of neighboring (left) and distant (right) symmetric whitening\n\ufb01lters which are shown at the top. The dashed Contours indicate the unit sphere for the optimal p of the best\n\ufb01tting non-factorial (dashed line) and factorial (solid line) Lp-spherically symmetric distribution, respectively.\nWhile close \ufb01lters exhibit p = 2 (spherically symmetric distribution), the value of p decreases for more distant\n\ufb01lters.\n\nAs illustrated in Figure 1, the relationship between local bandpass \ufb01lter responses undergoes a grad-\nual transition from L2-spherical for nearby to star-shaped (Lp-spherical with p < 2) for more distant\nfeatures [12; 21]. Ultimately, we would expect extremely distant features to become independent,\nhaving a factorial density with p (cid:30) 0(cid:46)8. When using a single Lp-spherically symmetric model for\nthe joint distribution of nearby and more distant features, a single value of p can only represent a\ncompromise for the whole variety of iso-probability contours. This raises the question whether a\ncombination of local spherical models, as opposed to a single Lp-spherical model, yields a better\ncharacterization of the statistics of natural image patches. Possible ways to join several local models\nare Independent Subspace Analysis (ISA) [10], which uses a factorial combination of locally Lp-\nspherical densities, or Markov Random Fields (MRFs) [18; 13]. Since MRFs have the drawback\nof being implicit density models and computationally very expensive for inference, we will focus\non ISA and our model. In principle, ISA could choose its subspaces such that nearby features are\ngrouped into a joint subspace which can then be well described by a spherical symmetric model\n(p = 2) while more distant pixels, living in different subspaces, are assumed to be independent. In\nfact, previous studies have found ISA to perform better than ICA for image patches as small as 8(cid:215) 8\nand to yield an optimal p (cid:30) 2 for the local density models [10]. On the other hand, the ISA model\nassumes a binary partition into either a Lp-spherical or a factorial distribution which does not seem\nto be fully justi\ufb01ed considering the gradual transition described above.\n\n2\n\n\fHere, we propose a new family of hierarchical models by replacing the Lp-norms in the Lp-spherical\nmodels by Lp-nested functions, which consist of a cascade of nested Lp-norms and therefore allow\nfor different values of p for different groups of \ufb01lters. While this family includes the Lp-spherical\nfamily and ISA models, it also includes densities that avoid the hard partition into either factorial\nor Lp-spherical. At the same time, parameter estimation for these models can still be similarly\nef\ufb01cient and robust as for Lp-spherically symmetric models. We \ufb01nd that this family (i) \ufb01ts the data\nsigni\ufb01cantly better than ISA and (ii) generates interesting \ufb01lters which are grouped in a sensible way\nwithin the hierarchy. We also \ufb01nd that, although the difference in performance between Lp-spherical\nand Lp-nested models is signi\ufb01cant, it is small on 8 \u00d7 8 patches, suggesting that within this limited\nspatial range, the iso-probability contours of the joint density can still be reasonably approximated\nby a single Lp-norm. Preliminary results on 16 \u00d7 16 patches exhibit a more pronounced difference\nbetween the Lp-nested and the Lp-spherically symmetric distribution, suggesting that the change in\np becomes more important for modelling densities over a larger spatial range.\n\n2 Models\n\nLp-Nested Symmetric Distributions Consider the function\n\n\uf8f6\uf8f8 p\u2205\n\np(cid:96)\n\n\uf8f6\uf8f7\uf8f8 1\n\np\u2205\n\n|yi|p(cid:96)\n\n(1)\n\n\uf8eb\uf8ed\n\n(cid:33) p\u2205\n\n\uf8eb\uf8ec\uf8ed(cid:32) n1(cid:88)\nn(cid:88)\n(cid:13)(cid:13)(cid:13) ((cid:107)y1:n1(cid:107)p1, ...,(cid:107)yn\u2212n(cid:96)+1:n(cid:107)p(cid:96))(cid:62) (cid:13)(cid:13)(cid:13)p\u2205\n\n+ ... +\n\np1\n\n|yi|p1\n\nf(y) =\n\n=\n\ni=1\n\ni=n1+...+n(cid:96)\u22121+1\n\n.\n\nWe call this type of functions Lp-nested and the resulting class of distributions Lp-nested symmetric.\nLp-nested symmetric distributions are a special case of the \u03bd-spherical distributions which have a\ndensity characterized by the form \u03c1(y) = g(\u03bd(y)) where \u03bd : Rn \u2192 R is a positively homogeneous\nit ful\ufb01lls \u03bd(ay) = a\u03bd(y) for any a \u2208 R+ and y \u2208 Rn [7]. Lp-\nfunction of degree one, i.e.\nnested functions are obviously positively homogeneous. Of course, Lp-nested functions of Lp-\nnested functions are again Lp-nested. Therefore, an Lp-nested function f in its general form can be\nvisualized by a tree in which each inner node corresponds to an Lp-norm while the leaves stand for\nthe coef\ufb01cients of the vector y.\nBecause of the positive homogeneity it is possible to normalize a vector y with respect to \u03bd and\nobtain a coordinate respresentation x = r \u00b7 u where r = \u03bd(y) and u = y/\u03bd(y). This implies that\nthe random variable Y has the stochastic representation Y .= RU with independent U and R [7]\nwhich makes it a generalization of the Gaussian Scale Mixture model [23]. It can be shown that\nfor a given \u03bd, U always has the same distribution while the distribution \u0001(r) of R determines the\nspeci\ufb01c \u03c1(y) [7]. For a general \u03bd, it is dif\ufb01cult to determine the distribution of U since the partition\nfunction involves the surface area of the \u03bd-unit sphere which is not analytically tractable in most\ncases. Here, we show that Lp-nested functions allow for an analytical expression of the partition\nfunction. Therefore, the corresponding distributions constitute a \ufb02exible yet tractable subclass of\n\u03bd-spherical distributions.\nIn the remaining paper we adopt the following notational convention: We use multi-indices to index\nsingle nodes of the tree. This means that I = \u2205 denotes the root node, I = (\u2205, i) = i denotes\nits ith child, I = (i, j) the jth child of i and so on. The function values at individual inner nodes\nI are denoted by fI, the vector of function values of the children of an inner node I by f I,1:(cid:96)I =\n(fI,1, ..., fI,(cid:96)I )(cid:62). By de\ufb01nition, parents and children are related via fI = (cid:107)f I,1:(cid:96)I(cid:107)pI . The number of\nchildren of a particular node I is denoted by (cid:96)I.\nLp-nested symmetric distributions are a very general class of densities. For instance, since every Lp-\nnorm (cid:107) \u00b7 (cid:107)p is an Lp-nested function, Lp-nested distributions includes the family of Lp-spherically\nsymmetric distributions including (for p = 2) the family of spherically symmetric distributions.\nWhen e.g. setting f = (cid:107)\u00b7(cid:107)2 or f = ((cid:107) \u00b7 (cid:107)p\n2)1/p, and choosing the radial distribution \u0001 appropriately,\n\u03c1(y) = Z\u22121 exp (\u2212(cid:107)y(cid:107)p\n2), respectively. On the other hand, when choosing the Lp-nested function\nf as in equation (1) and \u0001 to be the radial distribution of a p-generalized Normal distribution \u0001(r) =\n\none can recover the Gaussian \u03c1(y) = Z\u22121 exp(cid:0)\u2212(cid:107)y(cid:107)2\n\n(cid:1) or the generalized spherical Gaussian\n\n2\n\n3\n\n\fZ\u22121rn\u22121 exp (\u2212rp\u2205 /s) [8; 22], the inner nodes f 1:(cid:96)\u2205 become independent and we can recover an\nISA model. Note, however, that not all ISA models are also Lp-nested since Lp-nested symmetry\nrequires the radial distribution to be that of a p-generalized Normal.\nIn general, for a given radial distribution \u0001 on the Lp-nested radius f(y), an Lp-nested symmetric\ndistribution has the form\n\n1\n\n\u00b7 \u0001(f(y)) =\n\n1\n\n\u03c1(y) =\n\nSf (f(y))\n\n(2)\nwhere Sf (f(y)) = Sf (1)\u00b7f n\u22121(y) is the surface area of the Lp-nested sphere with the radius f(y).\nThis means that the partition function of a general Lp-nested symmetric distribution is the partition\nfunction of the radial distribution normalized by the surface area of the Lp-nested sphere with radius\nf(y). For a given f and a radius f\u2205 = f(y) this surface area is given by the equation\n\nSf (1) \u00b7 f n\u22121(y)\n\n\u00b7 \u0001(f(y))\n\n2n(cid:89)\n\nI\u2208I\n\n1\np(cid:96)I\u22121\n\nI\n\n(cid:34)(cid:80)k\n\n(cid:96)I\u22121(cid:89)\n\nk=1\n\nB\n\nSf (f\u2205) = fn\u22121\u2205\n\n(cid:35)\n\n2n(cid:89)\n\nI\u2208I\n\n(cid:81)(cid:96)I\n\nk=1 \u0393\np(cid:96)I\u22121\n\u0393\n\nI\n\n(cid:104) nI,k\n(cid:105)\n(cid:105)\n(cid:104) nI\n\npI\n\npI\n\ni=1 nI,k\npI\n\n,\n\nnI,k+1\n\npI\n\n= fn\u22121\u2205\n\nwhere I denotes the set of all multi-indices of inner nodes, nI the number of leaves of the subtree\nunder I and B [a, b] the beta function. Therefore, if the partition function of the radial distribution\ncan be computed easily, so can the partition function of the multivariate Lp-nested distribution.\nSince the only part of equation (2) that includes free parameters is the radial distribution \u0001, maximum\nlikelihood estimation of those parameters \u03d1 can be carried out on the univariate distribution \u0001 only,\nbecause\nargmax\u03d1 log \u03c1(y|\u03d1) (2)= argmax\u03d1 (\u2212 log Sf (f(y)) + log \u0001(f(y)|\u03d1)) = argmax\u03d1 log \u0001(f(y)|\u03d1).\nThis means that parameter estimation can be done ef\ufb01ciently and robustly on the values of the Lp-\nnested function.\nSince, for a given f, an Lp-nested distribution is fully speci\ufb01ed by a radial distribution, changing\nthe radial distribution also changes the Lp-nested distribution. This suggests an image decomposi-\ntion constructed from a cascade of nonlinear, gain-control-like mappings reducing the dependence\nbetween the \ufb01lter coef\ufb01cients. Similar to Radial Gaussianization or Lp-Radial Factorization algo-\nrithms [12; 21], the radial distribution \u0001\u2205 of the root node is mapped into the radial distribution of\na p-generalized Normal via histogram equalization, thereby making its children exponential power\ndistributed and statistically independent [22]. This procedure is then repeated recursively for each\nof the children until the leaves of the tree are reached.\nBelow, we estimate the multi-information (MI) between the \ufb01lters or subtrees at different levels of\nthe hierarchy. In order to do that robustly, we need to know the joint distribution of their values. In\nparticular, we are interested in the joint distribution of the children f I,1:(cid:96)I of a node I (e.g. layer 2\nin Figure 2). Just from the form of an Lp-nested function one might guess that those children are\nLp-spherically symmetric distributed. However, this is not the case. For example, the children f 1:(cid:96)\u2205\nof the root node (assuming that none of them is a leaf) follow the distribution\n\n\u03c1(f 1:(cid:96)\u2205) = \u0001\u2205((cid:107)f 1:(cid:96)\u2205(cid:107)p\u2205)\nS(cid:107)\u00b7(cid:107)p\u2205 ((cid:107)f 1:(cid:96)\u2205(cid:107)p\u2205)\n\n(cid:96)\u2205(cid:89)\n\nfni\u22121\n\n.\n\ni\n\n(3)\n\ni=1\n\n(cid:17) \u223c\nDir(cid:2)n1/p\u2205, ..., n(cid:96)\u2205 /p\u2205(cid:3) following a Dirichlet distribution (see Additional Material). We call this\n\nThis implies that f 1:(cid:96)\u2205 can be represented as a product of two independent random variables\nu = f 1:(cid:96)\u2205 /(cid:107)f 1:(cid:96)\u2205(cid:107)p\u2205 \u2208 R(cid:96)\u2205\n\n+ and r = (cid:107)f 1:(cid:96)\u2205(cid:107)p\u2205 \u2208 R+ with r \u223c \u0001\u2205 and\n\ndistribution a Dirichlet Scale Mixture (DSM). A similar form can be shown for the joint distribution\nof leaves and inner nodes (summarizing the whole subtree below them). Unfortunately, only the\nchildren f 1:(cid:96)\u2205 of the root node are really DSM distributed. We were not able to analytically cal-\nculate the marginal distribution of an arbitrary node\u2019s children f I,1:(cid:96)I , but we suspect it to have a\nsimilar form. For that reason we \ufb01t DSMs to those children f I,1:(cid:96)\u2205 in the experiments below and\nuse the estimated model to assess the dependencies between them. We also use it for measuring the\ndependencies between the subspaces of ISA.\n\nup\u2205\n1 , ..., up\u2205\n(cid:96)\u2205\n\n(cid:16)\n\n4\n\n\fFitting DSMs via maximum likelihood can be carried out similarly to estimating Lp-nested distri-\nbutions: Since the radial variables u and r are independent, the Dirichlet and the radial distribution\ncan be estimated on the normalized data points {ui}m\ni=1 inde-\npendently.\n\ni=1 and their respective norms {ri}m\n\nLp-Spherically Symmetric Distributions and Independent Subspace Analysis The family of\nLp-spherically symmetric distributions are a special case of Lp-nested distributions for which\nf(y) = (cid:107)y(cid:107)p [9]. We use the ISA model by [10] where the \ufb01lter responses y are modelled by\na factorial combination of Lp-spherically symmetric distributions sitting on each subspace\n\nK(cid:89)\n\n\u03c1(y) =\n\n\u03c1k((cid:107)yIk(cid:107)pk).\n\n3 Experiments\n\nk=1\n\nGiven an image patch x, all models used in this paper de\ufb01ne densities over \ufb01lter responses y = W x\nof linear \ufb01lters. This means, that all models have the form \u03c1(y) = | det W|\u00b7\u03c1(W x). The (n\u22121)\u00d7n\nmatrix W has the form W = QSP where P \u2208 R(n\u22121)\u00d7n has mutually orthogonal rows and projects\nonto the orthogonal complement of the DC-\ufb01lter (\ufb01lter with equal coef\ufb01cients), S \u2208 R(n\u22121)\u00d7(n\u22121)\nis a whitening matrix and Q \u2208 SOn\u22121 is an orthogonal matrix determining the \ufb01nal \ufb01lter shapes\nof W . When we speak of optimizing the \ufb01lters according to a model, we mean optimizing Q over\nSOn\u22121. The reason for projecting out the DC component is, that it can behave quite differently\ndepending on the dataset. Therefore, it is usually removed and modelled separately. Since the DC\ncomponent is the same for all models and would only add a constant offset to the measures we use\nin our experiments, we ignore it in the experiments below.\nData We use ten pairs of independently sampled training and test sets of 8 \u00d7 8 (16 \u00d7 16) patches\nfrom the van Hateren dataset, each containing 100, 000 (500, 000) examples. Hyv\u00a8arinen and K\u00a8oster\n[10] report that ISA already \ufb01nds several subspaces for 8 \u00d7 8 image patches. We perform all exper-\niments with two different types of preprocessing: either we only whiten the data (WO-data), or we\nwhiten it and apply an additional contrast gain control step (CGC-data), for which we use the radial\nfactorization method described in [12; 21] with p = 2 in the symmetric whitening basis.\nWe use the same whitening procedure as in [21; 6]: Each dataset is centered on the mean over\nexamples and dimensions and rescaled such that whitening becomes volume conserving. Similarly,\nwe use the same orthogonal matrix to project out the DC-component of each patch (matrix P above).\nOn the remaining n\u22121 dimensions, we perform symmetric whitening (SYM) with S = C\u2212 1\n2 where\nC denotes the covariance matrix of the DC-corrected data C = cov [P X].\nEvaluation Measures We use the Average Log Loss per component (ALL) for assessing the qual-\nity of the different models, which we estimate by taking the empirical average over a large ensemble\nof test points ALL = \u2212 1\ni=1 log \u03c1(yi). The ALL equals the entropy\nif the model distribution equals the true distribution and is larger otherwise. For the CGC-data, we\nadjust the ALL by the log-determinant of the CGC transformation [11]. In contrast to [10] this al-\nlows us to quantitively compare models across the two different types of preprocessing (WO and\nCGC), which was not possible in [10].\nIn order to measure the dependence between different random variables, we use the multi-\ninformation per component (MI)\nwhich is the difference between the\nsum of the marginal entropies and the joint entropy. The MI is a positive quantity which is zero\nif and only if the joint distribution is factorial. We estimate the marginal entropies by a jackknifed\nMLE entropy estimator [17] (corrected for the log of the bin width in order to estimate the differen-\ntial entropy) where we adjust the bin width of the histograms suggested by Scott [19]. Instead of the\njoint entropy, we use the ALL of an appropriate model distribution. Since the ALL is theoretically\nalways larger than the true joint entropy (ignoring estimation errors) using the ALL instead of the\njoint entropy should underestimate the true MI, which is still suf\ufb01cient for our purpose.\nParameter Estimation For all models (ISA, DSM, Lp-spherical and Lp-nested), we estimate the\nparameters \u03d1 for the radial distribution as described above in Section 2. For a given \ufb01lter matrix\n\nn\u22121 (cid:104)log \u03c1(y)(cid:105)Y \u2248 \u2212 1\n\n(cid:16)(cid:80)d\n\n1\nn\u22121\n\ni=1 H[Yi] \u2212 H[Y ]\n\n(cid:17)\n\n(cid:80)m\n\nm(n\u22121)\n\n5\n\n\fW the values of the exponents p are estimated by minimizing the ALL at the ML estimates \u02c6\u03d1\nover p = (p1, ..., pq)(cid:62). For the Lp-nested distributions, we use the Nelder-Mead [15] method for\nthe optimization over p = (p1, ..., pq)(cid:62) and for the Lp-spherically symmetric distributions we use\nGolden Search over the single p. For the ISA model, we carry out a Golden Search over p for\neach subspace independently. For the Lp-spherical and the single models on the ISA subspaces,\nwe use a search range of p \u2208 [0.1, 2.1] on p. For estimating the Dirichlet Scale Mixtures, we use\nthe fastfit package by Tom Minka to estimate the parameters of the Dirichlet distribution. The\nradial distribution is estimated independently as described above.\nWhen \ufb01tting the \ufb01lters W to the different models (ISA, Lp-spherical and Lp-nested), we use a\ngradient ascent on the log-likelihood over the orthogonal group by alternating between optimizing\nthe parameters p and \u03d1 and optimizing for W . For the gradient ascent, we compute the standard\nEuclidean gradient with respect to W \u2208 R(n\u22121)\u00d7(n\u22121) and project it back onto the tangent space of\nSOn\u22121. Using the gradient \u2207W obtained in that manner, we perform a line search with respect to\nt using the backprojections of W + t \u00b7 \u2207W onto SOn\u22121. This method is a simpli\ufb01ed version of the\none proposed by [14].\nExperiments with Independent Subspace Analysis and Lp-Spherically Symmetric Distribu-\ntions We optimized \ufb01lters for ISA models with K = 2, 4, 8, 16 subspaces embracing 32, 16, 8, 4\ncomponents (one subspace always had one dimension less due to the removal of the DC component),\nand for an Lp-spherically symmetric model. When optimizing for W we use a radial \u0393-distribution\nfor the Lp-spherically symmetric models and a radial \u03c7p distribution ((cid:107)yIk(cid:107)pk\npk is \u0393-distributed) for\nthe models on the single single subspaces of ISA, which is closer to the one used by [10]. After\noptimization, we make a \ufb01nal optimization for p and \u03d1 using a mixture of log normal distributions\n(log N ) with K = 6 mixture components on the radial distribution(s).\nLp-Nested Symmetric Distributions As for the Lp-spherically symmetric models, we use a radial\n\u0393-distribution for the optimization of W and a mixture of log N distributions for the \ufb01nal \ufb01t. We use\ntwo different kind of tree structures for our experiments with Lp-nested symmetric distributions. In\nthe deep tree (DT) structure we \ufb01rst group 2\u00d72 blocks of four neighboring SYM \ufb01lters. Afterwards,\nwe group those blocks again in a quadtree manner until we reached the root node (see Figure 2A).\nThe second tree structure (PNDk) was motivated by ISA. Here, we simply group the \ufb01lter within\neach subspace and joined them at the root node afterwards (see Figure 2B). In order to speed up\nparameter estimation, each layer of the tree shared the same value of p.\nMulti-Information Measurements For the ISA models, we estimated the MI between the \ufb01lter\nresponses within each subspace and between the Lp-radii (cid:107)yIk(cid:107)pk , 1 \u2264 k \u2264 K. In the former case\nwe used the ALL of an Lp-spherically symmetric distribution with especially optimized p and \u03d1, in\nthe latter a DSM with optimized radial and Dirichlet distribution as a surrogate for the joint entropy.\nFor the Lp-nested distribution, we estimate the MI between the children f I,1:(cid:96)I of all inner nodes\nI. In case the children are leaves, we use the ALL of an Lp-spherically symmetric distribution as\nsurrogate for the joint entropy, in case the children are inner nodes themselves, we use the ALL of\nan DSM. The red arrows in Figure 2A exemplarily depict the entities between which the MI was\nestimated.\n\n4 Results and Discussion\n\nFigure (2) shows the optimized \ufb01lters for the DT and the PND16 tree structure (we included the\n\ufb01lters optimized on the \ufb01rst of ten datasets for all tree structures in the Additional Material). For\nboth tree structures, the \ufb01lters on the lowest level are grouped according to spatial frequency and\norientation, whereas the variation in orientation is larger for the PND16 tree structure and some\n\ufb01lters are unoriented. The next layer of inner nodes, which is only present in the DT tree structure,\nroughly joins spatial location, although each of those inner nodes has one child whose leaves are\nglobal \ufb01lters.\nWhen looking at the various values of p at the inner nodes, we can observe that nodes which are\nhigher up in the tree usually exhibit a smaller value of p. Surprisingly, as can be seen in Figure 3\nB and C, a smaller value of p does not correspond to a larger independence between the subtrees,\nwhich are even more correlated because almost every subtree contains global \ufb01lters. The small value\nof p is caused by the fact that the DSM (the distribution of the subtree values) has to account for\nthis correlation which it can only do by decreasing the value of p (see Figure 3 and the DSM in\n\n6\n\n\fFigure 2: Examples for the tree structures of Lp-nested distributions used in the experiments: (A) shows\nthe DT structure with the corresponding optimized values. The red arrows display examples of groups of \ufb01lters\nor inner nodes, respectively, for which we estimated the MI. (B) shows the PND16 tree structure with the\ncorresponding values of p at the inner nodes and the optimized \ufb01lters.\nthe Additional Material). Note that this \ufb01nding is exactly opposite to the assumptions in the ISA\nmodel which can usually not generate such a behavior (Figure 3A) as it models the two subtrees to\nbe independent. This is likely to be one reason for the higher ALL of the ISA models (see Table 1).\n\nFigure 3: Independence of subspaces for WO-data not just\ufb01ed: (A) Subspace radii sampled from ISA2, (B)\nsubspace radii of natural image patches in the ISA2 basis, (C) subtree values of the PND2 in the PND2 basis, and\n(D) samples from the PND2 model. While the ISA2 model spreads out the radii almost over the whole positive\nquadrant due to the independence assumption the samples from the Lp-nested subtrees are more concentrated\naround the diagonal like the true data. The Lp-nested model can achieve this behavior since (i) it does not\nassume a radial distribution that leads to independent radii on the subtrees and (ii) the subtree values f1 and f2\nare DSM[n1/p\u2205, n2/p\u2205, ] distributed. By changing the value of p\u2205, the DSM model can put more mass towards\nthe diagonal, which produces the \u201dbeam-like\u201d behavior shown in the plot.\nTable 1 shows the ALL and the MI measurements for all models. Except for the ISA models on\nWO-data, all performances are similar, whereas the Lp-nested models usually achieve the lowest\nALL independent of the particular tree structure used. For the WO-data, the Lp-spherical and the\nISA2 model come close to the performance of the Lp-nested models. For the other ISA models on\nWO-data the ALL gets worse with increasing number of subspaces (bold font number in Table 1).\nThis re\ufb02ects the effect described above: Contrary to the assumptions of the ISA model, the responses\nof the different subspaces become in fact more correlated than the single \ufb01lter responses. This can\nalso be seen in the MI measurements discussed below.\nWhen looking at the ALL for CGC data, on the other hand, ISA suddenly becomes competitive.\nThis importance of CGC for ISA has already been noted in [10]. The small differences between all\nthe models in the CGC case shows that the contour change of the joint density for 8\u00d78 patches is too\nsmall to allow for a large advantage of the Lp-nested model, because contrast gain control (CGC)\n\n7\n\nABLayer 1Layer 2Layer 3p1=0.77071p2=0.8438p3=2.276p1=0.8413p2=1.6930510152025303540455005101520253035404550||y1:32||p1 sampled||y32:63||p2 sampled0510152025303540455005101520253035404550||y1:32||p1||y32:63||p20510152025303540455005101520253035404550f1f20510152025303540455005101520253035404550f1 sampledf2 sampledABCD\fdirectly corresponds to modeling the distribution with an Lp-spherically symmetric distribution [21].\nPreliminary results on 16 \u00d7 16 data (1.39 \u00b1 0.003 for the Lp-nested and 1.45 \u00b1 0.003 for the Lp-\nspherical model on WO-data), however, show a more pronounced improvement with for the Lp-\nnested model, indicating that a single p does not suf\ufb01ce anymore to capture all dependencies when\ngoing to larger patch sizes.\nWhen looking at the MI measurements between the \ufb01lters/subtrees at different levels of the hierarchy\nin the Lp-nested, Lp-spherically symmetric and ISA models, we can observe that for the WO-data,\nthe MI actually increases when going from lower to higher layers. This means that the MI between\nthe direct \ufb01lter responses (layer 3 for DT and layer 2 for all others) is in fact lower than the MI\nbetween the subspace radii or the inner nodes of the Lp-nested tree (layer 1-2 for DT, layer 1 for all\nothers). The highest MI is achieved between the children of the root node for the DT tree structure\n(DT layer 1). As explained above this observation contradicts the assumptions of the ISA model and\nprobably causes it worse performance on the WO-data.\nFor the CGC-data, on the other hand, the MI has been substantially decreased by CGC over all levels\nof the hierarchy. Furthermore, the single \ufb01lter responses inside a particular subspace or subtree are\nnow more dependent than the subtrees or subspaces themselves. This suggests that the competitive\nperformance of ISA is not due to the model but only due to the fact that CGC made the data already\nindependent. In order to double check this result, we \ufb01tted an ICA model to the CGC-data [21] and\nfound an ALL of 1.41 \u00b1 0.004 which is very close to the performance of ISA and the Lp-nested\ndistributions (which would not be the case for WO-data [21]).\nTaken together, the ALL and the MI measurements suggest that ISA is not the best way to join\nmultiple local models into a single joint model. The basic assumption of the ISA model for natural\nimages is that \ufb01lter coef\ufb01cients can either be dependent within a subspace or must be independent\nbetween different subspaces. However, the increasing ALL for an increasing number of subspaces\nand the fact that the MI between subspaces is actually higher than within the subspaces, demonstrates\nthat this hard partition is not justi\ufb01ed when the data is only whitened.\n\nFamily\nModel\nALL\n\nALL CGC\nMI Layer 1\n\nMI Layer 1 CGC\n\nMI Layer 2\n\nMI Layer 2 CGC\n\nMI Layer 3\n\nMI Layer 3 GCG\n\nFamily\nModel\nALL\n\nALL CGC\nMI Layer 1\n\nMI Layer 1 CGC\n\nMI Layer 2\n\nMI Layer 2 CGC\n\nDeep Tree\n1.39 \u00b1 0.004\n1.39 \u00b1 0.005\n0.84 \u00b1 0.019\n0.0 \u00b1 0.004\n0.42 \u00b1 0.021\n0.002 \u00b1 0.005\n0.28 \u00b1 0.036\n0.04 \u00b1 0.005\nLp-spherical\n1.41 \u00b1 0.004\n1.41 \u00b1 0.004\n0.34 \u00b1 0.004\n0.00 \u00b1 0.005\n\n-\n\n-\n-\n\nPND2\n\n1.39 \u00b1 0.004\n1.40 \u00b1 0.004\n0.48 \u00b1 0.008\n0.10 \u00b1 0.002\n0.35 \u00b1 0.017\n0.01 \u00b1 0.0008\n\n-\n-\n\nPND4\n\nLp-nested\n1.39 \u00b1 0.004\n1.40 \u00b1 0.005\n0.7 \u00b1 0.002\n0.02 \u00b1 0.003\n0.33 \u00b1 0.017\n0.01 \u00b1 0.004\n\n-\n-\n\nPND8\n\n1.40 \u00b1 0.004\n1.40 \u00b1 0.004\n0.75 \u00b1 0.003\n0.0 \u00b1 0.009\n0.28 \u00b1 0.019\n0.01 \u00b1 0.006\n\n-\n-\n\nPND16\n\n1.39 \u00b1 0.004\n1.39 \u00b1 0.004\n0.61 \u00b1 0.0036\n0.0 \u00b1 0.01\n0.25 \u00b1 0.025\n0.02 \u00b1 0.008\n\n-\n-\n\nISA\n\nISA2\n\n1.40 \u00b1 0.005\n1.41 \u00b1 0.008\n0.47 \u00b1 0.01\n0.00 \u00b1 0.09\n0.36 \u00b1 0.017\n0.004 \u00b1 0.003\n\nISA4\n\n1.43 \u00b1 0.006\n1.39 \u00b1 0.007\n0.69 \u00b1 0.012\n0.00 \u00b1 0.06\n0.33 \u00b1 0.019\n0.03 \u00b1 0.012\n\nISA8\n\n1.46 \u00b1 0.006\n1.40 \u00b1 0.005\n0.7 \u00b1 0.018\n0.00 \u00b1 0.04\n0.31 \u00b1 0.032\n0.02 \u00b1 0.018\n\nISA16\n\n1.55 \u00b1 0.006\n1.41 \u00b1 0.007\n0.63 \u00b1 0.0039\n0.00 \u00b1 0.02\n0.24 \u00b1 0.024\n0.0006 \u00b1 0.013\n\nTable 1: ALL and MI for all models: The upper part shows the results for the Lp-nested models, the lower\npart show the results for the Lp-spherical and the ISA models. The ALL for the Lp-nested models is almost\nequal for all tree structures and a bit lower compared to the Lp-spherical and the ISA models. For the whitened\nonly data, the ALL increases signi\ufb01cantly with the number of subspaces (bold font). For the CGC data, most\nmodels perform similarly well. When looking at the MI, we can see that higher layers for whitened only data\nare in fact more dependent than lower ones. For CGC data, the MI has dropped substantially over all layers due\nto CGC. In that case, the lower layers are more independent.\n\nIn summary, our results show that Lp-nested symmetric distributions yield a good performance on\nnatural image patches, although the advantage over Lp-spherically symmetric distributions is fairly\nsmall, suggesting that the distribution within these small patches (8\u00d7 8) is captured reasonably well\nby a single Lp-norm. Furthermore, our results demonstrate that\u2014at least for 8 \u00d7 8 patches\u2014the\nassumptions of ISA are too rigid for WO-data and are trivially ful\ufb01lled for the CGC-data, since\nCGC already removed most of the dependencies. We are currently working to extend this study to\nlarger patches, which we expect will reveal a more signi\ufb01cant advantage for Lp-nested models.\n\n8\n\n\fReferences\n[1] F. Attneave. Informational aspects of visual perception. Psychological Review, 61:183\u2013193, 1954.\n[2] R. Baddeley. Searching for \ufb01lters with \u201cinteresting\u201d output distributions: an uninteresting direction to\n\nexplore? Network: Computation in Neural Systems, 7(2):409\u2013421, 1996.\n\n[3] H. B. Barlow. Sensory mechanisms, the reduction of redundancy, and intelligence. 1959.\n[4] Anthony J. Bell and Terrence J. Sejnowski. An Information-Maximization approach to blind separation\n\nand blind deconvolution. Neural Computation, 7(6):1129\u20131159, November 1995.\n\n[5] Matthias Bethge. Factorial coding of natural images: how effective are linear models in removing higher-\n\norder dependencies? Journal of the Optical Society of America A, 23(6):1253\u20131268, June 2006.\n\n[6] Jan Eichhorn, Fabian Sinz, and Matthias Bethge. Natural image coding in v1: How much use is orientation\n\nselectivity? PLoS Comput Biol, 5(4):e1000336, April 2009.\n\n[7] Carmen Fernandez, Jacek Osiewalski, and Mark F. J. Steel. Modeling and inference with \u03bd-spherical\n\ndistributions. Journal of the American Statistical Association, 90(432):1331\u20131340, Dec 1995.\n\n[8] Irwin R. Goodman and Samuel Kotz. Multivariate \u03b8-generalized normal distributions. Journal of Multi-\n\nvariate Analysis, 3(2):204\u2013219, Jun 1973.\n\n[9] A. K. Gupta and D. Song. lp-norm spherical distribution. Journal of Statistical Planning and Inference,\n\n60:241\u2013260, 1997.\n\n[10] A. Hyvarinen and U. Koster. Complex cell pooling and the statistics of natural images. Network: Com-\n\nputation in Neural Systems, 18(2):81\u2013100, 2007.\n\n[11] S Lyu and E P Simoncelli. Nonlinear extraction of \u2019independent components\u2019 of natural images using\n\nradial Gaussianization. Neural Computation, 21(6):1485\u20131519, June 2009.\n\n[12] S Lyu and E P Simoncelli. Reducing statistical dependencies in natural signals using radial Gaussianiza-\ntion. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Adv. Neural Information Processing\nSystems 21, volume 21, pages 1009\u20131016, Cambridge, MA, May 2009. MIT Press.\n\n[13] Siwei Lyu and E.P. Simoncelli. Modeling multiscale subbands of photographic images with \ufb01elds of\ngaussian scale mixtures. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(4):693\u2013\n706, 2009.\n\n[14] J. H. Manton. Optimization algorithms exploiting unitary constraints.\n\nProcessing, 50:635 \u2013 650, 2002.\n\nIEEE Transactions on Signal\n\n[15] J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4):308\u2013\n\n313, Jan 1965.\n\n[16] Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning\n\na sparse code for natural images. Nature, 381(6583):607\u2013609, June 1996.\n\n[17] Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191\u20131253,\n\nJun 2003.\n\n[18] S. Roth and M.J. Black. Fields of experts: a framework for learning image priors. In Computer Vision\nand Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages\n860\u2013867 vol. 2, 2005.\n\n[19] David W. Scott. On optimal and data-based histograms. Biometrika, 66(3):605\u2013610, Dec 1979.\n[20] E.P. Simoncelli. Statistical models for images: compression, restoration and synthesis. In Signals, Systems\n& Computers, 1997. Conference Record of the Thirty-First Asilomar Conference on, volume 1, pages\n673\u2013678 vol.1, 1997.\n\n[21] F. Sinz and M. Bethge. The conjoint effect of divisive normalization and orientation selectivity on redun-\n\ndancy reduction. In Neural Information Processing Systems 2008, 2009.\n\n[22] F. H. Sinz, S. Gerwinn, and M. Bethge. Characterization of the p-generalized normal distribution. Journal\n\nof Multivariate Analysis, 100(5):817\u2013820, 05 2009.\n\n[23] M. J. Wainwright and E. P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images.\n\nIn Advances in neural information processing systems, volume 12, pages 855\u2013861, 2000.\n\n[24] Christoph Zetzsche, Gerhard Krieger, and Bernhard Wegmann. The atoms of vision: Cartesian or polar?\n\nJournal of the Optical Society of America A, 16(7):1554\u20131565, Jul 1999.\n\n9\n\n\f", "award": [], "sourceid": 219, "authors": [{"given_name": "Matthias", "family_name": "Bethge", "institution": null}, {"given_name": "Eero", "family_name": "Simoncelli", "institution": null}, {"given_name": "Fabian", "family_name": "Sinz", "institution": null}]}