{"title": "Learning Horizontal Connections in a Sparse Coding Model of Natural Images", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 512, "abstract": "It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1, and we discuss the implications of our findings for physiological experiments.", "full_text": "Learning Horizontal Connections in a Sparse Coding\n\nModel of Natural Images\n\nPierre J. Garrigues\nDepartment of EECS\n\nBruno A. Olshausen\n\nHelen Wills Neuroscience Inst.\n\nRedwood Center for Theoretical Neuroscience\n\nSchool of Optometry\n\nUniv. of California, Berkeley\n\nBerkeley, CA 94720\n\ngarrigue@eecs.berkeley.edu\n\nRedwood Center for Theoretical Neuroscience\n\nUniv. of California, Berkeley\n\nBerkeley, CA 94720\n\nbaolshausen@berkeley.edu\n\nAbstract\n\nIt has been shown that adapting a dictionary of basis functions to the statistics\nof natural images so as to maximize sparsity in the coef\ufb01cients results in a set\nof dictionary elements whose spatial properties resemble those of V1 (primary vi-\nsual cortex) receptive \ufb01elds. However, the resulting sparse coef\ufb01cients still exhibit\npronounced statistical dependencies, thus violating the independence assumption\nof the sparse coding model. Here, we propose a model that attempts to capture\nthe dependencies among the basis function coef\ufb01cients by including a pairwise\ncoupling term in the prior over the coef\ufb01cient activity states. When adapted to the\nstatistics of natural images, the coupling terms learn a combination of facilitatory\nand inhibitory interactions among neighboring basis functions. These learned in-\nteractions may offer an explanation for the function of horizontal connections in\nV1 in terms of a prior over natural images.\n\n1 Introduction\n\nOver the last decade, mathematical explorations into the statistics of natural scenes have led to the\nobservation that these scenes, as complex and varied as they appear, have an underlying structure that\nis sparse [1]. That is, one can learn a possibly overcomplete basis set such that only a small fraction\nof the basis functions is necessary to describe a given image, where the operation to infer this sparse\nrepresentation is non-linear. This approach is known as sparse coding. Exploiting this structure has\nled to advances in our understanding of how information is represented in the visual cortex, since\nthe learned basis set is a collection of oriented, Gabor-like \ufb01lters that resemble the receptive \ufb01elds\nin primary visual cortex (V1). The approach of using sparse coding to infer sparse representations\nof unlabeled data is useful for classi\ufb01cation as shown in the framework of self-taught learning [2].\nNote that classi\ufb01cation performance relies on \ufb01nding \u201chard-sparse\u201d representations where a few\ncoef\ufb01cients are nonzero while all the others are exactly zero.\n\nAn assumption of the sparse coding model is that the coef\ufb01cients of the representation are indepen-\ndent. However, in the case of natural images, this is not the case. For example, the coef\ufb01cients\ncorresponding to quadrature pair or colinear Gabor \ufb01lters are not independent. This has been shown\nand modeled in the early work of [3], in the case of the responses of model complex cells [4],\nfeedforward responses of wavelet coef\ufb01cients [5, 6, 7] or basis functions learned using indepen-\ndent component analysis [8, 9]. These dependencies are informative and exploiting them leads to\nimprovements in denoising performance [5, 7].\n\nWe develop here a generative model of image patches that does not make the independence as-\nsumption. The prior over the coef\ufb01cients is a mixture of a Gaussian when the corresponding basis\n\n1\n\n\ffunction is active, and a delta function centered at zero when it is silent as in [10]. We model the bi-\nnary variables or \u201cspins\u201d that control the activation of the basis functions with an Ising model, whose\ncoupling weights model the dependencies among the coef\ufb01cients. The representations inferred by\nthis model are also \u201chard-sparse\u201d, which is a desirable feature [2].\n\nOur model is motivated in part by the architecture of the visual cortex, namely the extensive network\nof horizontal connections among neurons in V1 [11]. It has been hypothesized that they facilitate\ncontour integration [12] and are involved in computing border ownership [13]. In both of these\nmodels the connections are set a priori based on geometrical properties of the receptive \ufb01elds. We\npropose here to learn the connection weights in an unsupervised fashion. We hope with our model to\ngain insight into the the computations performed by this extensive collateral system and compare our\n\ufb01ndings to known physiological properties of these horizontal connections. Furthermore, a recent\ntrend in neuroscience is to model networks of neurons using Ising models, and it has been shown\nto predict remarkably well the statistics of groups of neurons in the retina [14]. Our model gives a\nprediction for what is expected if one \ufb01ts an Ising model to future multi-unit recordings in V1.\n\n2 A non-factorial sparse coding model\n\nLet x \u2208 Rn be an image patch, where the xi\u2019s are the pixel values. We propose the following\ngenerative model:\n\nx = \u03a6a + \u03bd =\n\nm\n\nXi=1\n\nai\u03d5i + \u03bd,\n\nwhere \u03a6 = [\u03d51 . . . \u03d5m] \u2208 Rn\u00d7m is an overcomplete transform or basis set, and the columns \u03d5i\nare its basis functions. \u03bd \u223c N (0, \u01eb2In) is small Gaussian noise. Each coef\ufb01cient ai = si+1\n2 ui is a\nGaussian scale mixture (GSM). We model the multiplier s with an Ising model, i.e. s \u2208 {\u22121, 1}m\n2 sT W s+bT s, where Z is the normalization constant.\nhas a Boltzmann-Gibbs distribution p(s) = 1\nIf the spin si is down (si = \u22121), then ai = 0 and the basis function \u03d5i is silent. If the spin si is up\n(si = 1), then the basis function is active and the analog value of the coef\ufb01cient ai is drawn from a\nGaussian distribution with ui \u223c N (0, \u03c32\ni ). The prior on a can thus be described as a \u201chard-sparse\u201d\nprior as it is a mixture of a point mass at zero and a Gaussian.\n\nZ e\n\n1\n\nThe corresponding graphical model is shown in Figure 1.\nIt is a chain graph since it contains\nboth undirected and directed edges. It bears similarities to [15], which however does not have the\nintermediate layer a and is not a sparse coding model. To sample from this generative model, one\n\ufb01rst obtains a sample s from the Ising model, then samples coef\ufb01cients a according to p(a | s), and\nthen x according to p(x | a) \u223c N (\u03a6a, \u01eb2In).\n\nW1m\n\ns1\n\ns2\n\nsm\n\nW2m\n\na1\n\na2\n\nam\n\n\u03a6\n\nx1\n\nx2\n\nxn\n\nFigure 1: Proposed graphical model\n\nThe parameters of the model to be learned from data are \u03b8 = (\u03a6, (\u03c32\ni )i=1..m, W, b). This model\ndoes not make any assumption about which linear code \u03a6 should be used, and about which units\nshould exhibit dependencies. The matrix W of the interaction weights in the Ising model describes\nthese dependencies. Wij > 0 favors positive correlations and thus corresponds to an excitatory\nconnection, whereas Wij < 0 corresponds to an inhibitory connection. A local magnetic \ufb01eld\nbi < 0 favors the spin si to be down, which in turn makes the basis function \u03d5i mostly silent.\n\n2\n\n\f3 Inference and learning\n\n3.1 Coef\ufb01cient estimation\n\nWe describe here how to infer the representation a of an image patch x in our model. To do so, we\n\ufb01rst compute the maximum a posteriori (MAP) multiplier s (see Section 3.2). Indeed, a GSM model\nreduces to a linear-Gaussian model conditioned on the multiplier s, and therefore the estimation of\na is easy once s is known.\nGiven s = \u02c6s, let \u0393 = {i : \u02c6si = 1} be the set of active basis functions. We know that \u2200i /\u2208 \u0393, ai = 0.\nHence, we have x = \u03a6\u0393a\u0393 + \u03bd, where a\u0393 = (ai)i\u2208\u0393 and \u03a6\u0393 = [(\u03d5i)i\u2208\u0393]. The model reduces thus\nto linear-Gaussian, where a\u0393 \u223c N (0, H = diag((\u03c32\ni )i\u2208\u0393)). We have a\u0393 | x, \u02c6s \u223c N (\u00b5, K), where\n\u0393 x. Hence, conditioned on x and \u02c6s, the Bayes\nK = (\u01eb\u22122\u03a6\u0393\u03a6T\nLeast-Square (BLS) and maximum a posteriori (MAP) estimators of a\u0393 are the same and given by\n\u00b5.\n\n\u0393 + H\u22121)\u22121 and \u00b5 = \u01eb\u22122K\u03a6T\n\n3.2 Multiplier estimation\n\nThe MAP estimate of s given x is given by \u02c6s = arg maxs p(s | x). Given s, x has a Gaussian\ni . Using Bayes\u2019 rule, we can write\n\ni \u03d5i\u03d5T\n\ndistribution N (0, \u03a3), where \u03a3 = \u01eb2In + Pi : si=1 \u03c32\n\np(s | x) \u221d p(x | s)p(s) \u221d e\u2212Ex(s), where\n\nEx(s) =\n\n1\n2\n\nxT \u03a3\u22121x +\n\n1\n2\n\nlog det \u03a3 \u2212\n\n1\n2\n\nsT W s \u2212 bT s.\n\nWe can thus compute the MAP estimate using Gibbs sampling and simulated annealing. In the\nGibbs sampling procedure, the probability that node i changes its value from si to \u00afsi given x, all the\nother nodes s\u00aci and at temperature T is given by\n\np(si \u2192 \u00afsi|s\u00aci, x) = (cid:18)1 + exp(cid:18)\u2212\n\n\u2206Ex\n\nT (cid:19)(cid:19)\u22121\n\n,\n\nwhere \u2206Ex = Ex(si, s\u00aci) \u2212 Ex( \u00afsi, s\u00aci). Note that computing Ex requires the inverse and the\ndeterminant of \u03a3, which is expensive. Let \u00af\u03a3 and \u03a3 be the covariance matrices corresponding to the\nproposed state ( \u00afsi, s\u00aci) and current state (si, s\u00aci) respectively. They differ only by a rank 1 matrix,\ni.e. \u00af\u03a3 = \u03a3 + \u03b1\u03d5i\u03d5T\ni , where \u03b1 = 1\ni . Therefore, to compute \u2206Ex we can take advantage\nof the Sherman-Morrison formula\n\n2 ( \u00afsi \u2212 si)\u03c32\n\n\u00af\u03a3\u22121 = \u03a3\u22121 \u2212 \u03b1\u03a3\u22121\u03d5i(1 + \u03b1\u03d5T\n\ni \u03a3\u22121\u03d5i)\u22121\u03d5T\n\ni \u03a3\u22121\n\nand of a similar formula for the log det term\n\nUsing (1) and (2) \u2206Ex can be written as\n\nlog det \u00af\u03a3 = log det \u03a3 + log(cid:0)1 + \u03b1\u03d5T\n\n(1)\n\n(2)\n\n\u2206Ex =\n\n1\n2\n\n\u03b1(xT \u03a3\u22121\u03d5i)2\n1 + \u03b1\u03d5T\ni \u03a3\u22121\u03d5i\n\n\u2212\n\n1\n2\n\nlog(cid:0)1 + \u03b1\u03d5T\n\ni \u03a3\u22121\u03d5i(cid:1) .\ni \u03a3\u22121\u03d5i(cid:1) + ( \u00afsi \u2212 si)\uf8eb\n\uf8edXj6=i\n\nWij sj + bi\uf8f6\n\uf8f8 .\n\nThe transition probabilities can thus be computed ef\ufb01ciently, and if a new state is accepted we update\n\u03a3 and \u03a3\u22121 using (1).\n\n3.3 Model estimation\n\ni )i=1..m, W, b) that offer the best explanation of the data. Let p\u2217(x) = 1\n\nGiven a dataset D = {x(1), . . . , x(N )} of image patches, we want to learn the parameters \u03b8 =\n(\u03a6, (\u03c32\ni=1 \u03b4(x \u2212 x(i))\nbe the empirical distribution. Since in our model the variables a and s are latent, we use a variational\nexpectation maximization algorithm [16] to optimize \u03b8, which amounts to maximizing a lower bound\non the log-likelihood derived using Jensen\u2019s inequality\n\nN PN\n\nlog p(x | \u03b8) \u2265 Xs Za\n\nq(a, s | x) log\n\np(x, a, s | \u03b8)\nq(a, s | x)\n\nda,\n\n3\n\n\fwhere q(a, s | x) is a probability distribution. We restrict ourselves to the family of point mass\ndistributions Q = {q(a, s | x) = \u03b4(a \u2212 \u02c6a)\u03b4(s \u2212 \u02c6s)}, and with this choice the lower bound on the\nlog-likelihood of D can be written as\n\nL(\u03b8, q) = Ep\u2217 [log p(x, \u02c6a, \u02c6s | \u03b8)]\n\n= Ep\u2217 [log p(x | \u02c6a, \u03a6)]\n\n+ Ep\u2217[log p(\u02c6a | \u02c6s, (\u03c32\n\ni )i=1..m)]\n\n|\n\nL\u03a6\n\n{z\n\n}\n\n|\n\nL\u03c3\n\n{z\n\nWe perform coordinate ascent in the objective function L(\u03b8, q).\n\n(3)\n\n+ Ep\u2217 [log p(\u02c6s | W, b)]\n\n.\n\n|\n\nLW,b\n\n{z\n\n}\n\n}\n\n3.3.1 Maximization with respect to q\nWe want to solve maxq\u2208Q L(\u03b8, q), which amounts to \ufb01nding arg maxa,s log p(x, a, s) for every\nx \u2208 D. This is computationally expensive since s is discrete. Hence, we introduce two phases in\nthe algorithm.\n\nIn the \ufb01rst phase, we infer the coef\ufb01cients in the usual sparse coding model where the prior over a\n\nis factorial, i.e. p(a) = Qi p(ai) \u221d Qi exp{\u2212\u03bbS(ai)}. In this setting, we have\n2 + \u03bbXi\n\n1\n2\u01eb2 kx \u2212 \u03a6ak2\n\np(x|a)Yi\n\ne\u2212\u03bbS(ai) = arg min\n\n\u02c6a = arg max\n\na\n\na\n\nS(ai).\n\n(4)\n\nWith S(ai) = |ai|, (4) is known as basis pursuit denoising (BPDN) whose solution has been shown\nto be such that many coef\ufb01cient of \u02c6a are exactly zero [17]. This allows us to recover the sparsity\npattern \u02c6s, where \u02c6si = 2.1[ \u02c6ai 6= 0] \u2212 1 \u2200i. BPDN can be solved ef\ufb01ciently using a competitive\nalgorithm [18]. Another possible choice is S(ai) = 1[ai 6= 0] (p(ai) is not a proper prior though),\nwhere (4) is combinatorial and can be solved approximately using orthogonal matching pursuits\n(OMP) [19].\n\nAfter several iterations of coordinate ascent and convergence of \u03b8 using the above approximation,\nwe enter the second phase of the algorithm and re\ufb01ne \u03b8 by using the GSM inference described in\nSection 3.1 where \u02c6s = arg max p(s|x) and \u02c6a = E[a | \u02c6s, x].\n\n3.3.2 Maximization with respect to \u03b8\n\ni )i=1..m and (W, b) of our optimization problem.\n\nWe want to solve max\u03b8 L(\u03b8, q). Our choice of variational posterior allowed us to write the objective\nfunction as the sum of the three terms L\u03a6, L\u03c3 and LW,b (3), and hence to decouple the variables \u03a6,\n(\u03c32\nMaximization of L\u03a6. Note that L\u03a6 is the same objective function as in the standard sparse cod-\ning problem when the coef\ufb01cients a are \ufb01xed. Let {\u02c6a(i), \u02c6s(i)} be the coef\ufb01cients and multipliers\ncorresponding to x(i). We have\n\nL\u03a6 = \u2212\n\n1\n2\u01eb2\n\nN\n\nXi=1\n\nkx(i) \u2212 \u03a6\u02c6a(i)k2\n\n2 \u2212\n\nN n\n\n2\n\nlog 2\u03c0\u01eb2.\n\nWe add the constraint that k\u03d5ik2 \u2264 1 to avoid the spurious solution where the norm of the basis\nfunctions grows and the coef\ufb01cients tend to 0. We solve this \u21132 constrained least-square problem\nusing the Lagrange dual as in [20].\nMaximization of L\u03c3. The problem of estimating \u03c32\ni is a standard variance estimation problem for\na 0-mean Gaussian random variable, where we only consider the samples \u02c6ai such that the spin \u02c6si is\nequal to 1, i.e.\n\n\u03c32\ni =\n\n1\ncard{k : \u02c6si\n\n(k) = 1} Xk : \u02c6si\n\n(k)=1\n\n( \u02c6ai\n\n(k))2.\n\nMaximization of LW,b. This problem is tantamount to estimating the parameters of a fully visible\nBoltzmann machine [21] which is a convex optimization problem. We do gradient ascent in LW,b,\nwhere the gradients are given by \u2202LW,b\n= \u2212Ep\u2217 [si] + Ep[si].\n\u2202Wij\nWe use Gibbs sampling to obtain estimates of Ep[sisj] and Ep[si].\n\n= \u2212Ep\u2217 [sisj] + Ep[sisj] and \u2202LW,b\n\u2202bi\n\n4\n\n\fNote that since computing the parameters (\u02c6a, \u02c6s) of the variational posterior in phase 1 only depends\non \u03a6, we \ufb01rst perform several steps of coordinate ascent in (\u03a6, q) until \u03a6 has converged, which is\nthe same as in the usual sparse coding algorithm. We then maximize L\u03c3 and LW,b, and after that we\nenter the second phase of the algorithm.\n\n4 Recovery of the model parameters\n\nAlthough the learning algorithm relies on a method where the family of variational posteriors q(a, s |\nx) is quite limited, we argue here that if data D = {x(1), . . . , x(N )} is being sampled according\nto parameters \u03b80 that obey certain conditions that we describe now, then our proposed learning\nalgorithm is able to recover \u03b80 with good accuracy using phase 1 only.\nLet \u03b7 be the coherence parameter of the basis set which equals the maximum absolute inner product\nbetween two distinct basis functions. It has been shown that given a signal that is a sparse linear\ncombination of p basis functions, BP and OMP will identify the optimal basis functions and their\ncoef\ufb01cients provided that p < 1\n2 (\u03b7\u22121 + 1), and the sparsest representation of the signal is unique\n[19]. Similar results can be derived when noise is present (\u01eb > 0) [22], but we restrict ourselves to\nthe noiseless case for simplicity. Let ksk\u2191 be the number of spins that are up. We require (W0, b0)\n2 (\u03b7\u22121 + 1)(cid:1) \u2248 1, which can be enforced by imposing strong negative\nto be such that P r(cid:0)ksk\u2191 < 1\n\nbiases. A data point x(i) \u2208 D thus has a high probability of yielding a unique sparse representation in\nthe basis set \u03a6. Provided that we have a good estimate of \u03a6 we can recover its sparse representation\nusing OMP or BP, and therefore identify s(i) that was used to originally sample x(i). That is we\nrecover with high probability all the samples from the Ising model used to generate D, which allows\nus to recover (W0, b0).\nWe provide for illustration a simple example of model recovery where n = 7 and m = 8. Let\n\n(e1, . . . , e7) be an orthonormal basis in R7. We let \u03a60 = [e1, . . . e7, 1\u221a7 Pi ei]. We \ufb01x the biases\nb0 at \u22121.2 such that the model is suf\ufb01ciently sparse as shown by the histogram of ksk\u2191 in Figure\n2, and the weights W0 are sampled according to a Gaussian distribution. The variance parameters\n\u03c30 are \ufb01xed to 1. We then generate synthetic data by sampling 100000 data from this model using\n\u03b80. We then estimate \u03b8 from this synthetic data using the variational method described in Section 3\nusing OMP and phase 1 only. We found that the basis functions are recovered exactly (not shown),\nand that the parameters of the Ising model are recovered with high accuracy as shown in Figure 2.\n\nx 104 sparsity histogram\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n\u22121\n\n\u22122\n\n0\n\n\u22121\n\n\u22122\n\nb0\n\n4\nb\n\n1\n\n2\n\n3\n\n5\n\n6\n\n7\n\nW0\n\n \n\nW\n\n \n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n \n\n \n\nFigure 2: Recovery of the model. The histogram of ksk\u2191 is such that the model is sparse. The\nparameters (W, b) learned from synthetic data are close to the parameters (W0, b0) from which this\ndata was generated.\n\n5 Results for natural images\n\nWe build our training set by randomly selecting 16 \u00d7 16 image patches from a standard set of 10\n512 \u00d7 512 whitened images as in [1]. It has been shown that change of luminance or contrast have\nlittle in\ufb02uence on the structure of natural scenes [23]. As our goal is to uncover this structure, we\nsubtract from each patch its own mean and divide it by its standard deviation such that our dataset\nis contrast normalized (we do not consider the patches whose variance is below a small threshold).\nWe \ufb01x the number of basis functions to 256. In the second phase of the algorithm we only update\n\u03a6, and we have found that the basis functions do not change dramatically after the \ufb01rst phase.\nFigure 3 shows the learned parameters \u03a6, \u03c3 and b. The basis functions resemble Gabor \ufb01lters at\na variety of orientations, positions and scales. We show the weights W in Figure 4 according to\n\n5\n\n\f\u03a6\n\n\u03c3\n\n50\n\n100\n\n150\n\n200\n\n250\n\nb\n\n50\n\n100\n\n150\n\n200\n\n250\n\n2\n\n1\n\n0\n0\n\n0\n\n\u22120.5\n\n\u22121\n0\n\nFigure 3: On the left is shown the entire set of basis functions \u03a6 learned on natural images. On the\nright are the learned variances (\u03c32\n\ni )i=1..m (top) and the biases b in the Ising model (bottom).\n\nthe spatial properties (position, orientation, length) of the basis functions that are linked together\nby them. Each basis function is denoted by a bar that indicates its position, orientation, and length\nwithin the 16 \u00d7 16 patch.\n\n(a) 10 most positive weights\n\n(b) 10 most negative weights\n\n(c) Weights visualization\n\n\u03d5i\n\n\u03d5j\n\n\u03d5k\n\nWij < 0\nWik > 0\n\n(d) Association \ufb01elds\n\nFigure 4: (a) (resp. (b)) shows the basis function pairs that share the strongest positive (resp. neg-\native) weights ordered from left to right. Each subplot in (d) shows the association \ufb01eld for a basis\nfunction \u03d5i whose position and orientation are denoted by the black bar. The horizontal connec-\ntions (Wij )j6=i are displayed by a set of colored bars whose orientation and position denote those\nof the basis functions \u03d5j to which they correspond, and the color denotes the connection strength\n(see (c)). We show a random selection of 36 association \ufb01elds, see www.eecs.berkeley.edu/ gar-\nrigue/nips07.html for the whole set.\n\nWe observe that the connections are mainly local and connect basis functions at a variety of orien-\ntations. The histogram of the weights (see Figure 5) shows a long positive tail corresponding to a\nbias toward facilitatory connections. We can see in Figure 4a,b that the 10 most \u201cpositive\u201d pairs\nhave similar orientations, whereas the majority of the 10 most \u201cnegative\u201d pairs have dissimilar ori-\nentations. We compute for a basis function the average number of basis functions sharing with it\na weight larger than 0.01 as a function of their orientation difference in four bins, which we refer\nto as the \u201corientation pro\ufb01le\u201d in Figure 5. The error bars are a standard deviation. The resulting\norientation pro\ufb01le is consistent with what has been observed in physiological experiments [24, 25].\n\nWe also show in Figure 5 the tradeoff between the signal to noise ratio (SNR) of an image patch x\nand its reconstruction \u03a6\u02c6a, and the \u21130 norm of the representation k\u02c6ak0. We consider \u02c6a inferred using\nboth the Laplacian prior and our proposed prior. We vary \u03bb (see Equation (4)) and \u01eb respectively,\nand average over 1000 patches to obtain the two tradeoff curves. We see that at similar SNR the\nrepresentations inferred by our model are more sparse by about a factor of 2, which bodes well for\ncompression. We have also compared our prior for tasks such as denoising and \ufb01lling-in, and have\nfound its performance to be similar to the factorial Laplacian prior even though it does not exploit\nthe dependencies of the code. One possible explanation is that the greater sparsity of our inferred\nrepresentations makes them less robust to noise. Thus we are currently investigating whether this\n\n6\n\n\fproperty may instead have advantages in the self-taught learning setting in improving classi\ufb01cation\nperformance.\n\ncoupling weights histogram\n\n(W,\u03a6T\u03a6) correlation\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n\n\u22120.05\n\n0\n\n0.05\n\nweights\n\n0.1\n\n0.15\n\nj\ni\n\nW\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n\u22120.02\n\n\u22120.04\n\n\u22120.06\n0\n\n0.1\n\n0.2\n|\u03d5T\ni \u03d5j|\n\n0.3\n\n0.4\n\ns\nn\no\ni\nt\nc\ne\nn\nn\no\nc\n \nf\no\n \n#\n \ne\ng\na\nr\ne\nv\na\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u2212\u03c0 / 4\n\n1\n\norientation profile\n\n0\n\n\u03c0 / 4\n\n2\n\n3\n\norientation bins\n\ny\nt\ni\ns\nr\na\np\ns\n\n110\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n \n\n20\n5\n\n\u03c0 / 2\n\n4\n\ntradeoff SNR\u2212sparsity\n\n \n\nLaplacian prior\nproposed prior\n\n6\n\n7\n\n8\n\n9\n\nSNR\n\n10\n\n11\n\n12\n\n13\n\nFigure 5: Properties of the weight matrix W and comparison of the tradeoff curve SNR - \u21130 norm\nbetween a Laplacian prior over the coef\ufb01cients and our proposed prior.\n\nTo access how much information is captured by the second-order statistics, we isolate a group\n(\u03d5i)i\u2208\u039b of 10 basis functions sharing strong weights. Given a collection of image patches that\nwe sparsify using (4), we obtain a number of spins (\u02c6si)i\u2208\u039b from which we can estimate the em-\npirical distribution pemp, the Boltzmann-Gibbs distribution pIsing consistent with \ufb01rst and second\norder correlations, and the factorial distribution pf act (i.e. no horizontal connections) consistent\nwith \ufb01rst order correlations. We can see in Figure 6 that the Ising model produces better estimates\nof the empirical distribution, and results in better coding ef\ufb01ciency since KL(pemp||pIsing) = .02\nwhereas KL(pemp||pf act) = .1.\n\n10\u22121\n\n10\u22122\n\nfactorial model\nIsing model\n\n10\u22123\n\nall spins up\n\ny\nt\ni\nl\ni\n\na\nb\no\nr\np\n\n \nl\n\na\nc\ni\nr\ni\np\nm\nE\n\nall spins down\n\n3 spins up\n\n10\u22124\n\n10\u22125\n\n10\u22125\n\n10\u22124\n\n10\u22123\n\nModel probability\n\n10\u22122\n\n10\u22121\n\nFigure 6: Model validation for a group of 10 basis functions (right). The empirical probabilities of\nthe 210 patterns of activation are plotted against the probabilities predicted by the Ising model (red),\nthe factorial model (blue), and their own values (black). These patterns having exactly three spins\nup are circled. The prediction of the Ising model is noticably better than that of the factorial model.\n\n6 Discussion\n\nIn this paper, we proposed a new sparse coding model where we include pairwise coupling terms\namong the coef\ufb01cients to capture their dependencies. We derived a new learning algorithm to adapt\nthe parameters of the model given a data set of natural images, and we were able to discover the de-\npendencies among the basis functions coef\ufb01cients. We showed that the learned connection weights\nare consistent with physiological data. Furthermore, the representations inferred in our model have\ngreater sparsity than when they are inferred using the Laplacian prior as in the standard sparse coding\nmodel. Note however that we have not found evidence that these horizontal connections facilitate\ncontour integration, as they do not primarily connect colinear basis functions. Previous models in\nthe literature simply assume these weights according to prior intuitions about the function of hori-\nzontal connections [12, 13]. It is of great interest to develop new models and unsupervised learning\nschemes possibly involving attention that will help us understand the computational principles un-\nderlying contour integration in the visual cortex.\n\n7\n\n\fReferences\n[1] B.A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381(6583):607\u2013609, June 1996.\n\n[2] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unla-\n\nbeled data. Proceedings of the Twenty-fourth International Conference on Machine Learning, 2007.\n\n[3] G. Zetzsche and B. Wegmann. The atoms of vision: Cartesian or polar? J. Opt. Soc. Am., 16(7):1554\u2013\n\n1565, 1999.\n\n[4] P. Hoyer and A. Hyv\u00a8arinen. A multi-layer sparse coding network learns contour coding from natural\n\nimages. Vision Research, 42:1593\u20131605, 2002.\n\n[5] M.J. Wainwright, E.P. Simoncelli, and A.S. Willsky. Random cascades on wavelet trees and their use in\nmodeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89\u2013123,\nJuly 2001.\n\n[6] O. Schwartz, T. J. Sejnowski, and P. Dayan. Soft mixer assignment in a hierarchical generative model of\n\nnatural scene statistics. Neural Comput, 18(11):2680\u20132718, November 2006.\n\n[7] S. Lyu and E. P. Simoncelli. Statistical modeling of images with \ufb01elds of gaussian scale mixtures. In\n\nAdvances in Neural Computation Systems (NIPS), Vancouver, Canada, 2006.\n\n[8] A. Hyv\u00a8arinen, P.O. Hoyer, J. Hurri, and M. Gutmann. Statistical models of images and early vision.\nProceedings of the Int. Symposium on Adaptive Knowledge Representation and Reasoning (AKRR2005),\nEspoo, Finland, 2005.\n\n[9] Y. Karklin and M.S. Lewicki. A hierarchical bayesian model for learning non-linear statistical regularities\n\nin non-stationary natural signals. Neural Computation, 17(2):397\u2013423, 2005.\n\n[10] B.A. Olshausen and K.J. Millman. Learning sparse codes with a mixture-of-gaussians prior. Advances in\n\nNeural Information Processing Systems, 12, 2000.\n\n[11] D. Fitzpatrick. The functional organization of local circuits in visual cortex: insights from the study of\n\ntree shrew striate cortex. Cerebral Cortex, 6:329\u201341, 1996.\n\n[12] O. Ben-Shahar and S. Zucker. Geometrical computations explain projection patterns of long-range hori-\n\nzontal connections in visual cortex. Neural Comput, 16(3):445\u2013476, March 2004.\n\n[13] L. Zhaoping. Border ownership from intracortical interactions in visual area v2. Neuron, 47:143\u2013153,\n\n2005.\n\n[14] E. Schneidman, M.J. Berry, R. Segev, and W. Bialek. Weak pairwise correlations imply strongly correlated\n\nnetwork states in a neural population. Nature, April 2006.\n\n[15] G. Hinton, S. Osindero, and K. Bao. Learning causally linked markov random \ufb01elds. Arti\ufb01cial Intelligence\n\nand Statistics, Barbados, 2005.\n\n[16] M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L.K. Saul. An introduction to variational methods for\n\ngraphical models. Learning in Graphical Models, Cambridge, MA: MIT Press, 1999.\n\n[17] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Review,\n\n43(1):129\u2013159, 2001.\n\n[18] C.J. Rozell, D.H. Johnson, R.G. Baraniuk, and B.A. Olshausen. Neurally plausible sparse coding via com-\npetitive algorithms. In Proceedings of the Computational and Systems Neuroscience (Cosyne) meeting,\nSalt Lake City, UT, February 2007.\n\n[19] J.A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Informa-\n\ntion Theory, 50(10):2231\u20132242, 2004.\n\n[20] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Ef\ufb01cient sparse coding algorithms. In Advances in Neural\n\nInformation Processing Systems 19, pages 801\u2013808. MIT Press, Cambridge, MA, 2007.\n\n[21] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive\n\nScience, 9(1):147\u2013169, 1985.\n\n[22] J.A. Tropp.\n\nJust relax: convex programming methods for identifying sparse signals in noise.\n\nTransactions on Information Theory, 52(3):1030\u20131051, 2006.\n\nIEEE\n\n[23] Z. Wang, A.C. Bovik, and E.P. Simoncelli. Structural approaches to image quality assessment. In Alan\nBovik, editor, Handbook of Image and Video Processing, chapter 8.3, pages 961\u2013974. Academic Press,\nMay 2005. 2nd edition.\n\n[24] R. Malach, Y. Amir, M. Harel, and A. Grinvald. Relationship between intrinsic connections and functional\narchitecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex.\nProc. Natl. Acad. Sci. U.S.A., 82:935\u2013939, 1993.\n\n[25] W. Bosking, Y. Zhang, B. Scho\ufb01eld, and D. Fitzpatrick. Orientation selectivity and the arrangement of\n\nhorizontal connections in the tree shrew striate cortex. J. Neuroscience, 17(6):2112\u20132127, 1997.\n\n8\n\n\f", "award": [], "sourceid": 375, "authors": [{"given_name": "Pierre", "family_name": "Garrigues", "institution": null}, {"given_name": "Bruno", "family_name": "Olshausen", "institution": null}]}