{"title": "What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 243, "page_last": 251, "abstract": "We study optimal image encoding based on a generative approach with non-linear feature combinations and explicit position encoding. By far most approaches to unsupervised learning learning of visual features, such as sparse coding or ICA, account for translations by representing the same features at different positions. Some earlier models used a separate encoding of features and their positions to facilitate invariant data encoding and recognition. All probabilistic generative models with explicit position encoding have so far assumed a linear superposition of components to encode image patches. Here, we for the first time apply a model with non-linear feature superposition and explicit position encoding. By avoiding linear superpositions, the studied model represents a closer match to component occlusions which are ubiquitous in natural images. In order to account for occlusions, the non-linear model encodes patches qualitatively very different from linear models by using component representations separated into mask and feature parameters. We first investigated encodings learned by the model using artificial data with mutually occluding components. We find that the model extracts the components, and that it can correctly identify the occlusive components with the hidden variables of the model. On natural image patches, the model learns component masks and features for typical image components. By using reverse correlation, we estimate the receptive fields associated with the model's hidden units. We find many Gabor-like or globular receptive fields as well as fields sensitive to more complex structures. Our results show that probabilistic models that capture occlusions and invariances can be trained efficiently on image patches, and that the resulting encoding represents an alternative model for the neural encoding of images in the primary visual cortex.", "full_text": "What Are the Invariant Occlusive Components of\n\nImage Patches? A Probabilistic Generative Approach\n\nZhenwen Dai\n\nUniversity of Shef\ufb01eld, UK, and\n\nFIAS, Goethe-University Frankfurt, Germany\n\nz.dai@sheffield.ac.uk\n\nGeorgios Exarchakis\n\nRedwood Center for Theoretical Neuroscience,\n\nThe University of California, Berkeley, US\n\nexarchakis@berkeley.edu\n\nJ\u00a8org L\u00a8ucke\n\nCluster of Excellence Hearing4all, University of Oldenburg, Germany,\n\nand BCCN Berlin, Technical University Berlin, Germany\n\njoerg.luecke@uni-oldenburg.de\n\nAbstract\n\nWe study optimal image encoding based on a generative approach with non-linear\nfeature combinations and explicit position encoding. By far most approaches to\nunsupervised learning of visual features, such as sparse coding or ICA, account\nfor translations by representing the same features at different positions. Some\nearlier models used a separate encoding of features and their positions to facil-\nitate invariant data encoding and recognition. All probabilistic generative mod-\nels with explicit position encoding have so far assumed a linear superposition of\ncomponents to encode image patches. Here, we for the \ufb01rst time apply a model\nwith non-linear feature superposition and explicit position encoding for patches.\nBy avoiding linear superpositions, the studied model represents a closer match to\ncomponent occlusions which are ubiquitous in natural images. In order to account\nfor occlusions, the non-linear model encodes patches qualitatively very different\nfrom linear models by using component representations separated into mask and\nfeature parameters. We \ufb01rst investigated encodings learned by the model using ar-\nti\ufb01cial data with mutually occluding components. We \ufb01nd that the model extracts\nthe components, and that it can correctly identify the occlusive components with\nthe hidden variables of the model. On natural image patches, the model learns\ncomponent masks and features for typical image components. By using reverse\ncorrelation, we estimate the receptive \ufb01elds associated with the model\u2019s hidden\nunits. We \ufb01nd many Gabor-like or globular receptive \ufb01elds as well as \ufb01elds sen-\nsitive to more complex structures. Our results show that probabilistic models that\ncapture occlusions and invariances can be trained ef\ufb01ciently on image patches, and\nthat the resulting encoding represents an alternative model for the neural encoding\nof images in the primary visual cortex.\n\n1\n\nIntroduction\n\nProbabilistic generative models are used to mathematically formulate the generation process of ob-\nserved data. Based on a good probabilistic model of the data, we can infer the processes that have\ngenerated a given data point, i.e., we can estimate the hidden causes of the generation. These hidden\ncauses are usually the objects we want to infer knowledge about, be it for medical data, biologi-\ncal processes, or sensory data such as acoustic or visual data. However, real data are usually very\ncomplex, which makes the formulation of an exact data model infeasible. Image data are a typical\nexample of such complex data. The true generation process of images involves, for instance, differ-\nent objects with different features at different positions, mutual occlusions, object shades, lighting\n\n1\n\n\fFigure 1: An illustration of the generation process of our model.\n\nconditions and re\ufb02ections due to self-structure and nearby objects. Even if a generative model can\ncapture some of these features, an inversion of the model using Bayes\u2019 rule very rapidly becomes\nanalytically and computationally intractable. As a consequence, generative modelers make compro-\nmises to allow for trainability and applicability of their generative approaches.\nTwo properties that have, since long, been identi\ufb01ed as crucial for models of images are object oc-\nclusions [1\u20135] and the invariance of object identity to translations [6\u201313]. However, models incor-\nporating both occlusions and invariances suffer from a very pronounced combinatorial complexity.\nThey could, so far, only be trained with very low dimensional hidden spaces [2, 14, 15]. At \ufb01rst\nglance, occlusion modeling is, furthermore, mathematically more inconvenient. For these reasons,\nmany studies including style and content models [16], other bi-linear models [17, 18], invariant\nsparse coding [19, 20], or invariant NMF [21] do not model occlusions. Analytical and computation\nreasons are often explicitly stated as the main motivation for the use of the linear superposition of\ncomponents (see, e.g., [16, 17]).\nIn this work, we for the \ufb01rst time study the encoding of natural image patches using a model with\nboth non-linear feature combinations and translation invariances.\n\n2 A Generative Model with Non-linear and Invariant Components\n\nThe model used to study image patch encoding assumes an exclusive component combination, i.e.,\nfor each pixel exclusively one cause is made responsible. It thus shares the property of exclusiveness\nwith visual occlusions. The model will later be shown to capture occluding components. We will,\nhowever, not model explicit occlusion using a depth variable (compare [2]) but will focus on the\nexclusiveness property. The applied model is a novel version of the invariant occlusive components\nmodel studied for mid-level vision earlier [22]. We \ufb01rst brie\ufb02y reiterate the basic model in the\nfollowing and discuss the main differences of the new version afterwards.\nWe consider image patches (cid:126)y with D2 observed scalar variables, (cid:126)y = (y1, . . . , yD2 ). An image\npatch is assumed to contain a subset from a set of H components. Each component h can be located\nat a different position denoted by an index variable xh \u2208 {1, . . . , D2}, which is associated with a\nset of permutation matrices that covers all the possible planar translations {T1, . . . , TD2} (similar\nformulations have also been used in sprite models [14, 15]). Each component h is modeled to\nappear in an image patch with probability \u03c0h \u2208 (0, 1). Following [22], we do not model component\npresence and absence explicitly but, for mathematical convenience, assign the special \u2018position\u2019 \u22121\nto all the components which are not chosen to generate the patch. Assuming a uniform distribution\nfor the positions, the prior distribution for components and their positions is thus given by:\n\n(cid:89)\n\nh\n\np((cid:126)x|(cid:126)\u03c0) =\n\np(xh|\u03c0h), p(xh|\u03c0h) =\n\n(cid:40)\n\n1 \u2212 \u03c0h, xh = \u22121\n\u03c0h\notherwise\nD2 ,\n\n,\n\n(1)\n\nwhere the hidden variable (cid:126)x = (x1, . . . , xH ) contains the information on presence/absence and\nposition of all the image components.\nIn contrast to linear models, the studied approach requires two sets of parameters for the encod-\ning of image components: component masks and component features. Component masks describe\nwhere an image component is located, and component features describe what a component encodes\n(compare [2, 3, 14, 15]). High values of mask parameters (cid:126)\u03b1h encode the pixels most associated\nwith a component h but the encoding has to be understood relative to a global component position.\nThe feature parameters (cid:126)wh encode the values of a component\u2019s features. Fig. 1 shows an example\n\n2\n\nComponent 1 mask feature Component 2 Background Translation \fof the mask and feature parameters for two typical low-level visual features. Given a particular po-\nsition, the mask and feature parameters of the component are transformed to the target position by\nmultiplying a corresponding translation matrix like Txh (cid:126)\u03b1h and Txh (cid:126)wh. When generating an image\npatch, two or more components may occupy the same pixel, but according to occlusion the pixel\nvalue is exclusively determined by only one of them. This exclusiveness is formulated by de\ufb01ning\na mask variable (cid:126)m = (m1, . . . , mD2 ). For a pixel at a position d, md determines which component\nis responsible for the pixel value. Therefore, md takes a value from the set of present components\n\u0393 = {h|xh (cid:54)= \u22121} plus a special value \u201c0\u201d indicating background, and the prior distribution of (cid:126)m\nis de\ufb01ned as:\n\np( (cid:126)m|(cid:126)x, A) =\n\np(md|(cid:126)x, A),\n\np(md = h|(cid:126)x, A) =\n\n\uf8f1\uf8f2\uf8f3\n\n\u03b10+(cid:80)\n\u03b10+(cid:80)\n\n\u03b10\n\nh(cid:48)\u2208\u0393(Tx\n(Txh\n(cid:126)\u03b1h)d\nh(cid:48)\u2208\u0393(Tx\n\nh(cid:48) (cid:126)\u03b1h(cid:48) )d\nh(cid:48) (cid:126)\u03b1h(cid:48) )d\n\n, h = 0\n, h \u2208 \u0393\n\n,\n\n(2)\n\nD2(cid:89)\n\nd=1\n\nD2(cid:89)\n\nd=1\n\nwhere A = ((cid:126)\u03b11, . . . , (cid:126)\u03b1H ) contains the mask parameters for all the components, and \u03b10 de\ufb01nes the\nmask parameter for background. The mask variable md chooses the component h with a high likeli-\nhood if the translated mask parameter of the corresponding component is high at the position d. Note\nthat md follows a mixture model given the presence/absence and positions of all the components (cid:126)x.\nThis can be thought of as an approximation to the distribution of mask variables marginalizing the\ndepth orderings and pixel transparency in the exact occlusive model (see Supplement A for a com-\nparison). After drawing the values of the hidden variables (cid:126)x and (cid:126)m, an image patch can be generated\nwith a Gaussian noise model, which is given by:\n\n(cid:40)N (yd; B, \u03c32\n\np((cid:126)y | (cid:126)m, (cid:126)x, \u0398) =\n\np(yd|md, (cid:126)x, \u0398),\n\np(yd|md = h, (cid:126)x, \u0398) =\n\nh = 0\nN (yd; (Txh (cid:126)wh)d, \u03c32), h \u2208 \u0393\n\nB),\n\n,\n\n(3)\n\nB) are all the model\nwhere \u03c32 is the variance of components, and \u0398 = ((cid:126)\u03c0, W, A, \u03c32, \u03b10, B, \u03c32\nB.\nparameters. The background distribution is a Gaussian distribution with mean B and variance \u03c32\nCompared to an occlusive model with exact EM (see Supplement A), our approach will use the\nexclusiveness approximation and a truncated posterior approximation in order to make learning\ntractable.\nThe model described in (1) to (3) has been optimized for the encoding of image patches. First,\nfeature variables are scalar to encode light intensities or input by the lateral geniculus nucleus (LGN)\nrather than color features for mid-level vision. Second, to capture the frequency of presence for\nindividual components, we implement the learning of the prior parameter of presence (cid:126)\u03c0. Third, the\npre-selection function in the variational approximation (see below) has been adapted to the usage\nof scalar valued features. As a scalar value is much less distinctive than the sophisticated image\nfeatures used in [22], the pre-selection of components has been changed to the complete component\ninstead of only salient features.\n\n3 Ef\ufb01cient Likelihood Optimization\n\nGiven a set of image patches Y = ((cid:126)y(1), . . . , (cid:126)y(n)), learning is formulated as inferring the best model\nparameters w.r.t. the log-likelihood L = p(Y |\u0398). Following the Expectation Maximization (EM)\napproach, the parameter update equations are derived. The equations of the mask parameter (cid:126)\u03b1h, and\nfeature parameter (cid:126)wh are the same as in [22]. Additionally, we derived the update equation for the\nprior parameter of presence:\n\nN(cid:88)\n\n(cid:88)\n\nn=1\n\n(cid:126)x\u2208{xh(cid:54)=\u22121}\n\n\u03c0h =\n\n1\nN\n\np((cid:126)x|(cid:126)y(n), \u0398).\n\nBy learning the prior parameters \u03c0h, the probabilities of individual components\u2019 presence can be\nestimated. This allows us to gain more insights about the statistics of image components. In the\nupdate equations, a posterior distribution has been estimated for each data point, which corresponds\nto the E-step of an EM algorithm. The posterior distribution of our model can be decomposed as:\n\n(5)\nin which p((cid:126)x|(cid:126)y, \u0398) and p(md|(cid:126)x, (cid:126)y, \u0398) are estimated separately. Computing the exact distribution\nof p((cid:126)x|(cid:126)y, \u0398) is intractable, as it includes the combinatorics of the presence/absence of components\nand their positions. An ef\ufb01cient posterior approximation, Expectation Truncation (ET), has been\nsuccessfully employed. ET approximates the posterior distribution as a truncated distribution [23]:\n\nd=1 p(md|(cid:126)x, (cid:126)y, \u0398),\n\np( (cid:126)m, (cid:126)x|(cid:126)y, \u0398) = p((cid:126)x|(cid:126)y, \u0398)(cid:81)D2\n\np((cid:126)x|(cid:126)y, \u0398) \u2248\n\np((cid:126)y, (cid:126)x|\u0398)\n\n, if (cid:126)x \u2208 Kn,\n\n(6)\nand zero otherwise. If Kn is chosen to be small but to contain the states with most posterior prob-\nability mass, the computation of the posterior distribution becomes tractable while a high accuracy\n\np((cid:126)y, (cid:126)x(cid:48)|\u0398)\n\n(cid:126)x(cid:48)\u2208Kn\n\n(cid:80)\n\n(4)\n\n3\n\n\fFigure 2: Numerical Experiments on Arti\ufb01cial Data. (a) eight samples of the generated data sets.\n(b) The parameters of the eight components used to generate the data set. The 1st row contains\nthe binary transparency parameters and the 2nd row shows the feature parameters. (c) The learned\nmodel parameters (H = 9). The top plot shows the learned prior probabilities (cid:126)\u03c0. The 1st row shows\nthe mask parameters A; the 2nd row shows the feature parameters W ; the 3rd row gives a good\nvisualization of only the frequent used elements/pixels (setting the feature parameter whd of the\nelements/pixels with \u03b1hd < 0.5 to zero). (d) The result of inference given a image patch (shown on\nthe left). The right side shows the four components inferred to be present (each takes a column). The\n1st and 2nd rows show the mask and features parameters shifted according to the MAP inference\n(cid:126)xMAP, and the 3rd row shows the inferred posterior p(md|(cid:126)xMAP, (cid:126)y, \u0398). All the plots are heat map\n(Jet color map) visualizations of scalar values.\nof the approximations can be maintained [23]. To select a proper subspace Kn, \u03c4 features (pixel\nintensities) are chosen according to their mask parameters. Based on the chosen features, a score\nvalue S(xh) is computed for each component at each position (see [22]). We select H(cid:48) components,\ndenoted as H, for the candidates that may appear in the given image according to the probabil-\nity p((cid:126)y, \u02c7xh|\u0398). \u02c7xh corresponds to the vector (cid:126)x with xh = x\u2217\nh and the rest components absent\nh is the best position of the component h w.r.t. S(xh). This is different\n(xh(cid:48) = \u22121, h(cid:48) (cid:54)= h), where x\u2217\nfrom the earlier work [22], where Kn is constructed directly according to S(xh). For each compo-\nnent, we select the set of its candidate positions Xh, xh \u2208 Xh, which contains the p best positions\n(cid:88)\nw.r.t. S(xh). Then the truncated subspace Kn is de\ufb01ned as:\n\nsj \u2264 \u03b3 and si = 0,\u2200i /\u2208 H) or (cid:88)\n\nKn = {(cid:126)x| (\n\nsj(cid:48) \u2264 1},\n\n(7)\nwhere sh represents the presence/absence state of the component h (sh = 0 if xh = \u22121 \u222a xh /\u2208 Xh\nand sh = 1 if xh \u2208 Xh). To avoid converging to local optima, we used the directional annealing\nscheme [22] for our learning algorithm.\n\nj(cid:48)\n\nj\n\n4 Numerical Experiments on Arti\ufb01cial Data\n\nThe goal of the experiment on arti\ufb01cial data is to verify that the model and inference method can\nrecover the correct parameters, and to investigate inference on the data generated according to occlu-\nsions with explicit depth variable. We generated 4\u00d74 gray-scale image patches. In the data set, eight\ndifferent components are used, which are four vertical \u2018bars\u2019 and four horizontal \u2018bars\u2019, and each bar\nhas a different intensity and has a binary vector indicating its \u2018transparency\u2019 (1 for non-transparent\nand 0 for transparent, see Fig. 2b) . When generating an image patch, a subset of components is\nselected according to their prior probabilities \u03c0h = 0.25, and the selected components are combined\naccording to a random depth ordering (\ufb02at priors on the ordering). A component with smaller depth\nwill occlude the components with larger depth, and for each image patch we sample a new depth-\nordering. For the pixels in which all the selected components are transparent, the value is determined\naccording to the background with zero intensity (B = 0). All the pixels generated by components\nare subject to a Gaussian noise with \u03c3 = 0.02 and the pixels belonging to the background have a\nGaussian noise with \u03c3B = 0.001. In total, we generated N = 1, 000 image patches. Fig. 2a shows\neight samples. The arti\ufb01cial data is similar to data generated by the occlusive components analysis\nmodel (OCA; [2]), except of the use of scalar features and the assumption of shift-invariance.\nFig. 2c shows the learned model parameters on the generated data set. We learned nine components\n(H = 9). The initial feature value W was set to randomly selected data points. The initial mask\nparameter A was independently and uniformly drawn from the interval (0, 1). The initial annealing\ntemperature was set to T = 5. After keeping constant for 20 iterations, the temperature linearly\ndecreased to 1 in 100 iterations. For the robustness of learning, \u03c3 decreased together with the\ntemperature from 0.2 to 0.02, and an additive Gaussian noise with zero mean and \u03c3w = 0.04 was\n\n4\n\n\finjected into W and \u03c3w gradually decreased to zero. The algorithm terminated when the temperature\nwas equal to 1 and the difference of the pseudo data log-likelihood of two consecutive iterations was\nsuf\ufb01ciently small (less than 0.1%). The approximation parameters used in learning was H(cid:48) = 8,\n\u03b3 = 4, p = 2 and \u03c4 = 3. In this result, all the eight generative components have been successfully\nlearned. The 2nd to last component (see Fig. 2c) is a dumpy component (low \u03c0h, i.e., very rarely\nused). Its single pixel structure is therefore an artifact. With the learned parameters, the model could\ninfer the present components, their positions and the pixel-to-component assignment. Fig. 2d shows\na typical example. Given an image patch on the left, the present components and their positions\nare correctly inferred. Furthermore, as shown on the 3rd row, the posterior probabilities of the\nmask variable p(md|(cid:126)x, (cid:126)y, \u0398) give a clear assignment of the contributing component for each pixel.\nThis information is potentially very valuable for tasks like parts-based object segmentation or to\ninfer the depth ordering among the components. We assess the reliability of our learning algorithm\nby repeating the learning procedure with the same con\ufb01guration but different random parameter\ninitializations. The algorithm recovers all the generative components in 11 out of 20 repetitive runs.\nThe 9 runs not recovering all bars did still recover reasonable solutions with usually 7 bars out of\n8 bars represented. In general, optima of bar stimuli seem to have much more pronounced local\noptima, e.g., compared to image patches.\n\n5 Numerical Experiments on Image Patches\n\nAfter we veri\ufb01ed the inference and learning algorithm on arti\ufb01cial data, it was applied to patches of\nnatural images. As training set we used N = 100, 000 patches of size 16 \u00d7 16 pixels extracted at\nrandom positions from random images of the van Hateren natural image database [24]. We modeled\nthe sensitivity of neurons in the LGN using a difference-of-Gaussians (DoG) \ufb01lter for different\npositions, i.e., we processed all patches by convolving them with a DoG kernel. Following earlier\nstudies (see [5] for references), the ratio between the standard deviation of the positive and the\nnegative Gaussian was chosen to be 1/3 and the amplitudes chosen to obtain a mean-free center-\nsurround \ufb01lter. Fig. 3a shows some samples of the image patches after preprocessing.\nOur algorithm learned H = 100 components from the natural image data set. The model parameters\nwere initialized in the same way as for arti\ufb01cial data. The annealing temperature was initialized with\nT = 10, kept constant for 10 iterations, the temperature linearly decreased to 1 in 100 iterations. \u03c3\ndecreased together with the temperature from 0.5 to 0.2, and an additive Gaussian noise with zero\nmean and \u03c3w = 0.2 was injected into W and \u03c3w gradually decreased to zero. The approximation\nparameters used for learning were H(cid:48) = 6, \u03b3 = 4, p = 2 and \u03c4 = 50. After 134 iterations, the\nmodel parameters had essentially converged.\nFigs. 3bc show the learned mask parameters and the learned feature values for all the 100 compo-\nnents. Mask parameters de\ufb01ne the frequently used areas within a component, and feature parameters\nreveal the appearance of a component on image patches. As can be observed, image components\nare very differently represented from linear models. See the component in Fig. 3d as an example:\nmask parameters are localized and all positive; feature parameters have positive and negative values\nacross the whole patch. Masks and features can be combined to resemble a familiar Gabor func-\ntion via point-wise multiplication (see Fig. 3d). All the above shown component representations are\nsorted in descending order according to the learned prior probabilities of occurrence (cid:126)\u03c0 (see Fig. 3e).\n\n6 Estimation of Receptive Fields\n\nFor visualization, mask and feature parameters can be combined via point-wise multiplication. To\nmore systematically and quantitatively interpret the learned components and to compare them to\nbiological experimental \ufb01ndings, we estimated the predicted receptive \ufb01elds (RFs). RFs estimates\nwere computed with reverse correlation based on the model inference results. Reverse correlation\ncan be de\ufb01ned as procedure to \ufb01nd the best linear approximation of the components\u2019 presence given\nan image patch (cid:126)y(n). More formally, we search for a set of predicted receptive \ufb01elds (cid:126)Rh, h \u2208\n{1, . . . , H} that minimize the following cost function:\nh( (cid:126)RT\n\n(8)\nwhere (cid:126)y(n) is the nth stimulus and \u03bb is the coef\ufb01cient for L2 regularization. sh is a binary variable\nrepresenting the presence/absence state of the component h, where sh = 0 if xh = \u22121, and sh = 1\n\n\u00afTxh (cid:126)y(n) \u2212 sh)2 + \u03bb(cid:80)\n\np((cid:126)x|(cid:126)y(n), \u0398)(cid:80)\n\n(cid:126)RT\nh\n\n(cid:126)Rh,\n\nh\n\nn\n\n(cid:126)x\u2208Kn\n\nh\n\n(cid:80)\n\n(cid:80)\n\nf = 1\nN\n\n5\n\n\fFigure 3: The invariant occlusive components from natural image patches. (a) shows 20 samples of\nthe pre-processed image patches. (b) shows the mask parameter and (c) shows the feature parameter.\n(d) shows an example of the relation with the learned model parameters and the estimated RFs. (e)\nshows the learned prior probabilities (cid:126)\u03c0. (f) shows the estimated Receptive Fields (RF). The RFs were\n\ufb01tted with 2 dimensional Gabor and DoG functions. The dashed line marks the RFs that have a more\nglobular structure. The solid lines mark the RFs the were \ufb01tted accurately by a Gabor function. The\ndotted lines marks the RFs that were not approximated very well by the \ufb01tted function. All the\nshown model parameters in (b-c) and receptive \ufb01elds in (f) are sorted in descent according to (cid:126)\u03c0. The\nplots (a-d) and (f) are heat map visualization with local scaling on individual \ufb01elds (Jet color map),\nand (a), (c) and (f) \ufb01x light green to be zero.\n\notherwise. As our model allows the components to be at different locations, the reverse correlation\nis computed by shifting the stimuli according to the inferred location of each components. \u00afTxh rep-\nresents the transformation matrix applied to the stimulus for the component h, which is the opposite\ntransformation of the inferred transformation Txh ( \u00afTxhTxh = 1). For the absent components, the\nstimulus is used without any transformations (T\u22121 = 1).\nDue to the intractability of computing an exact posterior distribution, given a data point, the cost\nfunction only sums across the truncated subspace Kn in the variational approximation (see Sec. 3).\nBy setting the derivative of the cost function to zero, (cid:126)Rh can be estimated as:\n\n(cid:16)\n\u03bbN 1 +(cid:80)\ncovariance matrix(cid:80)N\n\n(cid:126)Rh =\n\nn(cid:104) \u00afTxh (cid:126)y(n)( \u00afTxh (cid:126)y(n))T(cid:105)qn\n\n(9)\nwhere (cid:104)\u00b7(cid:105)qn denotes the expectation value w.r.t. the posterior distribution p((cid:126)x|(cid:126)y(n), \u0398) and 1 is\nan identity matrix. When solving (cid:126)Rh, we often observe that many of the eigenvalues of the data\nn=1(cid:104) \u00afTxh(cid:126)y(n)( \u00afTxh(cid:126)y(n))T(cid:105)qn are close to zero, which makes the solution of (cid:126)Rh\nvery unstable. Therefore, we introduce a L2 regularization to the cost function. The regularization\ncoef\ufb01cient \u03bb is chosen between the minimum and maximum element of the data covariance matrix.\nThe estimated receptive \ufb01elds are not sensitive to the value of the regularization coef\ufb01cient \u03bb as\nlong as \u03bb is large enough to resolve the numerical instability (see Supplement for a comparison of\nthe receptive \ufb01elds estimated with different \u03bb values). From the experiments with arti\ufb01cial data and\n\n(cid:17)\u22121(cid:16)(cid:80)\n\n(cid:17)\n\nn(cid:104)(cid:126)s( \u00afTxh (cid:126)y(n))T(cid:105)qn\n\n6\n\n(b) (c) (d) (a) (e) (f) RF \fnatural image patches, we observed that the L2 regularization successfully eliminated the numerical\nstability problem.\nFig. 3f shows the RFs estimated according to our model. For further analysis, we matched the RFs\nusing Gabor functions and DoG functions as was suggested in [5]. If we factored in the occurrence\nprobabilities, we found that the model considered about 17% of all components of the patches to be\nglobular, 56% to be Gabor-like and 27% to have another structure (see Supplement for details). The\nprevalence of \u2018center-on\u2019 globular \ufb01elds may be a consequence of the prevalence of convex object\nshapes.\n7 Discussion\n\nThe encoding of image patches investigated in this study separates feature and position information\nof visual components. Functionally, such an encoding has been found very useful, e.g., for the con-\nstruction of object recognition systems. Many state-of-the-art systems for visual object classi\ufb01cation\nmake use of convolutional neural networks [12, 25, 26]. Such networks compute the responses of\na set of \ufb01lters for all positions in a prede\ufb01ned area and use the maximal response for further pro-\ncessing ([12] for a review). If we identify the prede\ufb01ned area with one image patch as processed by\nour approach, then the encoding studied here is to some extent similar to convolutional networks:\n(A) it uses like convolutional networks one set of component parameters for all positions; and (B) a\nhidden component variable of the generative model integrates or \u2018pools\u2019 the information across all\npositions. As the here studied approach is based on a generative data model, the integration across\npositions can directly be interpreted as inversion of the generation process. Crucially, the inversion\ncan take occlusions of visual features into account while convolutional networks do not model occlu-\nsions. Furthermore, the generative model uses a probabilistic encoding, i.e., it assigns probabilities\nto positions and features of a joint feature and position space. Ambiguous visual input can therefore\nbe represented appropriately. In contrast, convolutional networks use one position for each feature\nas representation. In this sense a convolutional encoding could be regarded as MAP estimate for the\nfeature position while the generative integration could be interpreted as probabilistic pooling. Many\nbilinear models have also been applied to image patches, e.g., [17, 18]. Such studies do report that\nneurally plausible receptive \ufb01elds (RFs) in the form of Gabor functions emerge [17, 18]. Likewise,\ninvariant versions of NMF [21] or ICA (in the form of ISA [9] have been applied to image patches.\nIn addition to Gabors, we observed in our study a large variety of further types of RFs. Gabor \ufb01lters\nwith different orientations, phase and frequencies, as well as globular \ufb01elds and \ufb01elds with more\ncomplex structures (Fig. 3f). Gabors have been studied since several decades, globular and more\ncomplex \ufb01elds have attracted attention in the last couple of years. In particular, globular \ufb01elds have\nattracted attention [5, 27, 28] as they have been reported together with Gabors in macaques and\nother species ([29] and [5] for further references). Such \ufb01elds have been associated with occlusions\nbefore [5, 28, 30]; and our study now for the \ufb01rst time reports globular \ufb01elds for an occlusive and\ntranslation invariant approach. The results may be taken as further evidence of the connection be-\ntween occlusions and globular \ufb01elds. However, also linear convolutional approaches have recently\nreported such \ufb01elds [19, 31]. Linear approaches seem to require a high degree of overcompleteness\nor speci\ufb01c priors while globular \ufb01elds naturally emerge for occlusion-like non-linearities. More con-\ncretely: for non-invariant linear sparse coding, globular \ufb01elds only emerged from a suf\ufb01ciently high\ndegree of overcompleteness onwards [32, 33] or with speci\ufb01c prior settings and overcompleteness\n[27]; for non-invariant occlusive models [5, 30] globular \ufb01elds always emerge alongside Gabors\nfor any overcompleteness. The results reported here can be taken as con\ufb01rming this observation\nfor position invariant encoding. The invariant non-linear model assigns high degrees of occurrences\n(high \u03c0h) to Gabor-like and to globular \ufb01elds (\ufb01rst rows in Fig. 3f). Components with more complex\nstructures are assigned lower occurrence frequencies. In total the model assumes a fraction between\n10 and 20% of all data components to be globular. Such high percentages may be related to the\nhigh percentages of globular \ufb01elds (\u223c16-23%) measured in vivo ([29] and [5] for references). In\ncontrast, the highest degrees of occurrences, e.g., for convolutional matching pursuit [31] seems to\nbe assigned exclusively to Gabor features. Globular \ufb01elds only emerge (alongside other non-Gabor\n\ufb01elds) for higher degrees of overcompleteness. A direct comparison in terms of occurrence frequen-\ncies is dif\ufb01cult because the linear models to not infer occurrence frequencies from data. The closest\nmatch to such frequencies would be an (inverse) sparsity which is set by hand for almost all linear\napproaches. The reason is the use of MAP-based point-estimates while our approach uses a more\nprobabilistic posterior estimate.\n\n7\n\n\fBecause of their separate encoding of features and positions, all models with separate position en-\ncoding can represent high degrees of over-completeness. Convolutional matching pursuit [31] shows\nresults for up to 64 \ufb01lters of size 8 \u00d7 8. With 8 horizontal and 8 vertical shifts, the number of non-\ninvariant components would amount to 8 \u00d7 8 \u00d7 64 = 3136. Convolutional sparse coding [19]\nreports results by assuming 128 components for 9 \u00d7 9 patches.The number of non-invariant com-\nponents would therefore amount to 10, 368. For our network we obtained results for up to 100\ncomponents of size 16 \u00d7 16. With 16 horizontal and 16 vertical shift this amounts to 25, 600 non-\ninvariant components. In terms of components per observed variable, invariant models are therefore\nnow computationally feasible in a regime the visual cortex is estimated to operate in [33].\nThe hidden units associated with component feature are fully translation invariant. In terms of neu-\nral encoding, their insensitivity to stimulus shifts would therefore place them into the category of\nV1 complex cells. Also globular \ufb01elds or \ufb01elds that seem sensitive to structures such as corners\nwould warrant such units the label \u2018complex cell\u2019. No hidden variable in the model can directly be\nassociated with simple cell responses. However, a possible neural network implementation of the\nmodel is an explicit representation of component features at different positions. The weight sharing\nof the model would be lost but units with explicit non-invariant representation could correspond to\nsimple cells. While such a correspondence can connect our predictions to experimental studies of\nsimple cells, recently developed approaches for the estimation of translation invariant cell responses\n[34, 35] can represent a more direct connection. To approximately implement the non-linear gen-\nerative model neurally, the integration of information would have to be a very active process. In\ncontrast to passive pooling mechanisms across units representing linear \ufb01lters (such as simple cells),\nit would involve neural units with explicit position encoding. Such units would control or \u2018gate\u2019\nthe information transfer from simple cells to downstream complex cells. As such our probabilistic\nmodel can be related to ideas of active control units for individual components [6, 7, 10, 11, 36] (also\ncompare [37]). A notable difference to all these models is that the here studied approach allows to\ninterpret active control as optimal inference w.r.t. a generative model of translations and occlusions.\nFuture work can go in different directions. Different transformations could be considered or learned\n[37], explicit modeling in time could be incorporated (compare [17]), and/or further hierarchical\nstages could be considered. The crucial challenge all such developments face are computational\nintractabilities due to large combinatorial hidden spaces. Base on the presented results, we believe,\nhowever, that advances in analytical and computational training technology will enable an increas-\ningly sophisticated modeling of image patches in the future.\nAcknowledgement.\nWe thank Richard E. Turner for helpful discussions and acknowledge funding by DFG grant LU 1196/4-2.\nReferences\n[1] D. Mumford and B. Gidas. Stochastic models for generic images. Q. Appl. Math., 59:85\u2013111, 2001.\n[2] J. L\u00a8ucke, R. Turner, M. Sahani, and M. Henniges. Occlusive Components Analysis. NIPS, 22:1069\u201377,\n\n2009.\n\n[3] Nicolas LeRoux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a generative model of images\n\nby factoring appearance and shape. Neural Computation, 23:593\u2013650, 2011.\n\n[4] D. Zoran and Y. Weiss. Natural images, Gaussian mixtures and dead leaves. NIPS, 25:1745\u20131753, 2012.\n[5] J. Bornschein, M. Henniges, and J. L\u00a8ucke. Are V1 receptive \ufb01elds shaped by low-level visual occlusions?\n\nA comparative study. PLoS Computational Biology, 9(6):e1003062.\n\n[6] G. E. Hinton. A parallel computation that assigns canonical object-based frames of reference. In Proc.\n\nIJCAI, pages 683\u2013685, 1981.\n\n[7] C. H. Anderson and D. C. Van Essen. Shifter circuits: a computational strategy for dynamic aspects of\n\nvisual processing. PNAS, 84(17):6297\u20136301, 1987.\n\n[8] M. Lades, J. Vorbr\u00a8uggen, J. Buhmann, J. Lange, C. v. d. Malsburg, R. W\u00a8urtz, and W. Konen. Distor-\nIEEE Transactions on Computers,\n\ntion invariant object recognition in the dynamic link architecture.\n42(3):300\u2013311, 1993.\n\n[9] A. Hyv\u00a8arinen and P. Hoyer. Emergence of phase- and shift-invariant features by decomposition of natural\n\nimages into independent feature subspaces. Neural Computation, 12(7):1705\u201320, 2000.\n\n[10] D. W. Arathorn. Map-Seeking circuits in Visual Cognition \u2014 A Computational Mechanism for Biological\n\nand Machine Vision. Standford Univ. Press, Stanford, California, 2002.\n\n8\n\n\f[11] J. L\u00a8ucke, C. Keck, and C. von der Malsburg. Rapid convergence to feature layer correspondences. Neural\n\nComputation, 20(10):2441\u20132463, 2008.\n\n[12] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. Proceed-\n\nings of 2010 IEEE International Symposium on Circuits and Systems, pages 253\u20136, 2010.\n\n[13] Y. Hu, K. Zhai, S. Williamson, and J. Boyd-Graber. Modeling Images using Transformed Indian Buffet\n\nProcesses. In ICML, 2012.\n\n[14] N. Jojic and B. Frey. Learning \ufb02exible sprites in video layers. In CVPR, 2001.\n[15] C. K. I. Williams and M. K. Titsias. Greedy learning of multiple objects in images using robust statistics\n\nand factorial learning. Neural Computation, 16:1039\u201362, 2004.\n\n[16] J. B. Tenenbaum and W. T. Freeman. Separating Style and Content with Bilinear Models. Neural Com-\n\nputation, 12(6):1247\u201383, 2000.\n\n[17] P. Berkes, R. E. Turner, and M. Sahani. A structured model of video reproduces primary visual cortical\n\norganisation. PLoS Computational Biology, 5(9):e1000495, 2009.\n\n[18] C. F. Cadieu and B. A. Olshausen. Learning intermediate-level representations of form and motion from\n\nnatural movies. Neural Computation, 24(4):827\u2013866, 2012.\n\n[19] K. Kavukcuoglu, P. Sermanet, Y.L. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolu-\n\ntional feature hierarchies for visual recognition. NIPS, 23:14, 2010.\n\n[20] K. Gregor and Y. LeCun. Ef\ufb01cient learning of sparse invariant representations. CoRR, abs/1105.5307,\n\n2011.\n\n[21] J. Eggert, H. Wersing, and E. K\u00a8orner. Transformation-invariant representation and NMF. In 2004 IEEE\n\nInternational Joint Conference on Neural Networks, pages 2535\u201339, 2004.\n\n[22] Z. Dai and J. L\u00a8ucke. Unsupervised learning of translation invariant occlusive components. In CVPR,\n\npages 2400\u20132407. 2012.\n\n[23] J. L\u00a8ucke and J. Eggert. Expectation truncation and the bene\ufb01ts of preselection in training generative\n\nmodels. Journal of Machine Learning Research, 11:2855\u2013900, 2010.\n\n[24] J. H. van Hateren and A. van der Schaaf. Independent component \ufb01lters of natural images compared with\nsimple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265:359\u201366, 1998.\n[25] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience,\n\n211(11):1019 \u2013 1025, 1999.\n\n[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, volume 25, pages 1106\u20131114, 2012.\n\n[27] M. Rehn and F. T. Sommer. A network that uses few active neurones to code visual input predicts the\ndiverse shapes of cortical receptive \ufb01elds. Journal of Computational Neuroscience, 22(2):135\u201346, 2007.\n[28] J. L\u00a8ucke. Receptive \ufb01eld self-organization in a model of the \ufb01ne-structure in V1 cortical columns. Neural\n\nComputation, 21(10):2805\u201345, 2009.\n\n[29] D. L. Ringach. Spatial structure and symmetry of simple-cell receptive \ufb01elds in macaque primary visual\n\ncortex. Journal of Neurophysiology, 88:455\u201363, 2002.\n\n[30] G. Puertas, J. Bornschein, and J. L\u00a8ucke. The maximal causes of natural scenes are edge \ufb01lters. In NIPS,\n\nvolume 23, pages 1939\u20131947. 2010.\n\n[31] A. Szlam, K. Kavukcuoglu, and Y. LeCun. Convolutional matching pursuit and dictionary training. arXiv\n\npreprint arXiv:1010.0422, 2010.\n\n[32] B. A. Olshausen, C. F. Cadieu, and D. K. Warland. Learning real and complex overcomplete representa-\n\ntions from the statistics of natural images. volume 7446, page 74460S. SPIE, 2009.\n\n[33] B. A. Olshausen. Highly overcomplete sparse coding. In Proc. of HVEI, page 86510S, 2013.\n[34] M. Eickenberg, R.J. Rowekamp, M. Kouh, and T.O. Sharpee. Characterizing responses of translation-\ninvariant neurons to natural stimuli: maximally informative invariant dimensions. Neural Computation,\n24(9):2384\u2013421, 2012.\n\n[35] B. Vintch, A. Zaharia, J.A. Movshon, and E.P. Simoncelli. Ef\ufb01cient and direct estimation of a neural\n\nsubunit model for sensory coding. In Proc. of NIPS, pages 3113\u20133121, 2012.\n\n[36] B. Olshausen, C. Anderson, and D. Van Essen. A neurobiological model of visual attention and invariant\npattern recognition based on dynamic routing of information. J Neuroscience, 13(11):4700\u20134719, 1993.\n[37] R. Memisevic and G. E. Hinton. Learning to represent spatial transformations with factored higher-order\n\nBoltzmann machines. Neural Computation, 22(6):1473\u20131492, 2010.\n\n[38] M.J.D. Powell. An ef\ufb01cient method for \ufb01nding the minimum of a function of several variables without\n\ncalculating derivatives. The Computer Journal, 7(2):155\u2013162, 1964.\n\n9\n\n\f", "award": [], "sourceid": 210, "authors": [{"given_name": "Zhenwen", "family_name": "Dai", "institution": "Goethe-University Frankfurt"}, {"given_name": "Georgios", "family_name": "Exarchakis", "institution": "UC Berkeley"}, {"given_name": "J\u00f6rg", "family_name": "L\u00fccke", "institution": "TU Berlin"}]}