{"title": "Whence Sparseness?", "book": "Advances in Neural Information Processing Systems", "page_first": 180, "page_last": 186, "abstract": null, "full_text": "Whence Sparseness? \n\nC. van Vreeswijk \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \n\n17 Queen Square, London WCIN 3AR, United Kingdom \n\nAbstract \n\nIt has been shown that the receptive fields of simple cells in VI can be ex(cid:173)\nplained by assuming optimal encoding, provided that an extra constraint \nof sparseness is added. This finding suggests that there is a reason, in(cid:173)\ndependent of optimal representation, for sparseness. However this work \nused an ad hoc model for the noise. Here I show that, if a biologically \nmore plausible noise model, describing neurons as Poisson processes, is \nused sparseness does not have to be added as a constraint. Thus I con(cid:173)\nclude that sparseness is not a feature that evolution has striven for, but is \nsimply the result of the evolutionary pressure towards an optimal repre(cid:173)\nsentation. \n\n1 Introduction \n\nRecently there has been an resurgence of interest in using optimal coding strategies to \n'explain' the response properties of neuron in the primary sensory areas [1]. Notably this \napproach was used Olshausen and Field [2] to infer the receptive field of simple cells in the \nprimary visual cortex. To arrive at the correct results however, they had to add sparseness \nof activity as an extra constraint. Others have shown that similar results are obtained if one \nassumes that the neurons represent independent components of natural stimuli [3]. The fact \nthat these studies need to impose an extra constraint suggests strongly that the subsequent \nprocessing of the stimuli uses either sparseness or independence of the neuronal activity. \nIt is therefore highly important to determine whether these constraints are really necessary. \nHere it will be argued that the necessity of the sparseness constraint in these models is due \nto modeling the noise in the system incorrectly. Modeling the noise in a biologically more \nplausibly way leads to a representation of the input in which the sparseness of the activity \nnaturally follows from the optimality of the representation. \n\n2 Gaussian Noise \n\nSeveral approaches have been used to find an output that represents the input optimally, \nfor example, minimizing the square difference between the input and its reconstruction. In \nthis paper I will concentrate on a different definition of optimality, I require that the mutual \n\n\finformation between the input and output is maximized. If the number of output units is \nat least equal to the dimensionality of the input space a perfect reconstruction of the input \nis possible, unless there is noise in the system. So for an (over)-complete representation \noptimal encoding only makes sense in the presence of noise. It is important to note that \nthe optimal solution depends on the model of noise that is taken, even if one takes the limit \nwhere the noise goes to zero. Thus it is important to have an adequate noise model. \n\nMost optimal coding schemes describe the neuronal output by an input-dependent mean \nto which Gaussian noise is added. This is, roughly speaking, also the implicit assumption \nin an optimization procedure in which the mean square reconstruction error is minimized, \nbut it is also often used explicitly when the the mutual information is maximized. It is \ninstructive to see, in the latter case, why one needs to impose extra constraints to obtain un(cid:173)\nambiguous results: Assume the input s has dimension Ni and is drawn from a distribution \np(s). There are No ~ Ni output neurons whose rates r satisfy \n\nr=Ws+ue, \n\n(1) \n\nwhere e is a No dimensional univariate Gaussian with zero mean, P\u20ac(e) \n= \n(2'71,)-No /2 exp( -eT e/2) (the superscript T denotes the transpose). The task is to find \nthe No x Ni matrix Wm that maximizes the mutual information 1M between r and s, \ndefined by [4] \n\nIM(r, s) = / dr / dsp(r, s){log[p(rls)l-IOg[/ ds'p(r, s'm. \n\n(2) \n\nHere p( r, s) is the joint probability distribution of rand sand p( r Is) is the conditional \nprobability of r, given s. It is immediately clear that replacing W by eW with e > 1 \nincreases the mutual information by effectively reducing the noise by a factor l/e. Thus \nmaximal mutual information is obtained as the rates become infinite. Thus, to get sensible \nresults, a constraint has to be placed on the rates. A natural constraint in this framework is \na constraint on the average square rates r, < rT r >= Nom. Here I have used < ... > to \ndenote the average over the noise and inputs and R~ > u 2 is the mean square rate. \nUnder this constraint, however, the optimal solution is still vastly degenerate. Namely if \nW M is a matrix that gives the maximum mutual information, for any unitary matrix U \n(UTU = 1), UW m will also maximize 1M. This is straightforward to show. For r = \nW mS + ue the mutual information is given by \n\nIM(r,s;Wm) = /dr /dSP(S)p\u20ac(r-:;Vm s ) {IOg[p\u20ac(r-:;Vms)]_ \n\nlog [/ ds' p(s')P\u20ac (r - :mS')]} , \n\n(3) \n\nwhere I have used IM(r, S; W) to denote the mutual information when the matrix W is \nused. In the case where r satisfies r = UW mS + ue the mutual information is given by \nequation 3, with Wm replaced by UWm. Changing variables from r to r' = UTr and \nusing I det(U)1 = 1, this can be rewritten as \n\n/ dr' / dsp(s)P\u20ac \n\n(U r'- ;VmS) {lOg [P\u20ac (U r'- ;VmS)] -\n\nlog [/ ds'p(s')P\u20ac (U r ' - :;Vms')]} . \n\n(4) \n\n\fBecausepe(e) is a function ofeTe only, Pe (Ue) = pe(e), and therefore Im(r, 8; UWm) = \nIm(r, 8, W m). In other words, because we have assumed a model in which the noise \nis described by independent Gaussians, or generally the distribution of the noise e is a \nfunction of eT e only, the mutual information is invariant to unitary transformations of the \n\noutput. Clearly, this degeneracy is a result of the particular choice of the noise statistics and \nunlikely to survive when we try to account for biologically observed noise more accurately. \nIn the latter case it may well happen that the degeneracy is broken in such a way that \nmaximizing the mutual information with a constraint on the average rates is itself sufficient \nto obtain a sparse representation. \n\n3 Poisson Noise \n\nTo obtain a robust insight in this issue, it is important that the system can be treated analyt(cid:173)\nically. The desire for biologically plausibility of the system should therefore be balanced \nby the desire to keep it simple. Ubiquitous features found in electrophysiological experi(cid:173)\nments likely to be of importance are (see for example [5]): i) Neurons transmit information \nthrough spikes. ii) Consecutive inter-spike intervals are at best weakly correlated. iii) With \nrepeated presentation of the stimulus the variance in the number of spikes a neuron emits \nover a given period varies nearly linearly with the mean number of emitted spikes. \n\nA simple model that captures all these features of the biological system is the Poisson \nprocess [6]. I will thus consider a system in which the neurons are described by such a pro(cid:173)\ncess. The general model is as follows: The inputs are given by an Ni dimensional vector 8 \ndrawn from a distribution p(8). These give rise to No inputs u into the cells, which satisfy \nu = W 8, where W is the coupling matrix. The inputs u are transformed into rates through \na transfer function g, ri = g( Ui) . The output of the network is observed for a time T. Opti(cid:173)\nmal encoding of the input is defined by the maximal mutual information between the spikes \nthe neurons emit and the stimulus. Let ni be the total number of spikes for neuron i and \nn the No dimensional array of spike counts, then p(nlr) = TIihT)ni exp(-riT)/ni!. \nOptimal coding is achieved by determining W m such that \n\nWm = argmaxw(IM(n, 8; W)). \n\n(5) \n\nAs before there is need for a constraint on W so that solutions with infinite rates are ex(cid:173)\ncluded. Whereas with Gaussian channels fixing the mean square rates is the most natural \nchoice for the constraint, for Poissonian neurons it is more natural to fix the mean num(cid:173)\nber of emitted spikes, L:i < ri >= NoRo. By rescaling time we can, without loss of \ngenerality, assume that Ro = 1. \n\n4 A Simple Example \n\nThe simplest example in which we can consider whether such systems will lead a sparse \nrepresentation is a system with a single neuron and a 1 dimensional input, which is uni(cid:173)\n\nformly distributed between 0 and 1. Assume that the unit has output rate r = \u00b0 when the \n\ninput satisfies s < I - P and rate l/p if s > I - p. Because the neuron is either 'on' \nor 'off' , maximal information about its state can be obtained by checking whether it fired \neither one or more spikes or did not fire at all in the time-window over which the neuron \nwas observed. If the neuron is 'on', the probability that it does not spike in a time T is \n1- exp( -T /p), otherwise it is 1. Thus the probability distribution is \n\np(O, s) = I - e-T/P0(s - I + p), p(l+, s) = e-T/Pi0(s - 1+ p), \n\n(6) \n\n\f0.5 \n\n0.4 \n\n0.3 \n\nPm \n\n0.2 \n\n0.1 \n\n00 \n\n2 T \n\n3 \n\n4 \n\n5 \n\nFigure 1: Pm, the value of P that maximizes the mutual information as function of the \nmeasuring time-window T. \n\nwhere I have used p(l+, s) to denote the probability of 1 or more spikes and an input s. \nThe mutual information satisfies \n\nIM(n, SiP) = p(l - e-T / p ) log(l - e-T / p ) - pe- T / p 10g(P) -\n\n(1 - p(l - e- T / p )) log(l - p(l - e- T / p )). \n\n(7) \n\nFigure 1 shows Pm, the value of p that maximizes the mutual information, as a function \nof the time T over which the neuron is observed. For small T, Pm is small, this reflects \nthe fact that the reliability of the neuron is increased if the rate in the 'on' state (l/p) is \nmade maximal. For large T, Pm approaches 1/2, the value for which the entropy of the \noutput rate T is maximized. We thus see a trade-off between the reliability which wants to \nmake p as small as possible, and the capacity, which pushes p to 112. For time intervals that \nare smaller than or on the order of the mean 1: inter-spike interval the former dominates \nand leads to an optimal solution in which the neuron is, with a high probability, quiet, or, \nwith a low probability, fires vigorously. Thus in this system the neurons fire sparsely if the \nmeasuring time is sufficiently short. \n\n5 A More Interesting Example \n\nSomewhat closer to the problem of optimal encoding in VI, but still tractable, is the fol(cid:173)\nlowing example. A two-dimensional input 8 is drawn from a distribution p( 8) given by \n\np(Sl, S2) = ~ (8(sl)e-ls21/2 + e-ISll/28(S2)) . \n\nThis input is encoded by four neurons, the inputs into these neurons are given by \ncS10'ns((~)) ( SS21 ) , \n\n1 \n\n( U1 ) _ \n-\n\nU2 \n\nI cos(\u00a2)I + I sin(\u00a2)I \n\n(cos(\u00a2) \n- sin(\u00a2) \n\n'I' \n\n(8) \n\n(9) \n\nU3 = -U1, and U4 = -U2. The rates Ti satisfy Ti = (Ui)+ == (Ui + IUil)/2, the threshold \nlinear function. Due to the symmetry of the problem, rotation by a multiple of 7l' /2 leads to \nthe same rates, up to a permutation. Thus we can restrict ourselves to 0 ::::; \u00a2 < 7l' /2. \nIt is straightforward to show that Li < ni >= 4, and that sparseness of the activity, here \ndefined by Li\u00ab nf > - < ni >2)/ < ni >2, has its minimum for \u00a2 = 7l'/4, and \n\n\fo.7.------,.-----r-----r-------, \n\n0.6 \n\n0.5 \n\n0.2 \n\n0.1 \n\n00~----~0~.4,----r0~.8,--~--~1r..2~----r1.6 \n\nFigure 2: Mutual information 1M as function of . The time-window T was chosen to be T = 0.5, \nand (3 was gradually increased from its initial value of (3 = 1 to (3 = 10. The coupling \nmatrix was updated using \u20ac = 10-4 , and L = 10. \n\nFigure 3 shows the some of the receptive fields that were obtained after the system had \napproached the fixed point, e.i. tlle running average of the mutual information no longer \nincreased. These receptive fields look rather similar to those obtained from simple cells \nin the striate cortex. However a more thorough analysis of these receptive field and the \nsparseness of the rate distribution still has to be undertaken. \n\n\f, \n\n~--\n\nr \n\n-2 \n\n\" \u2022 If A - . ': e'I \nIl, \"~ =P1 \u2022 ~ \n- I ~ ,~ \n\n~. c\u00b7_.....;.,:J-\n\n.#' \n\nl\\ '1 ~\\, l!~ \n\n\" \n\nr:l \n\u2022\u2022 \n.-\n\n~ 'I \n~ \n\\ ~I \n\n~L\u00b7 :::::I \n\n\u2022 ~ , -~ \n\nFigure 3: Forty examples of receptive fields that show clear Gabor like structure. \n\n7 Discussion \n\nI have shown why optimal encoding using Gaussian channels naturally leads to highly \ndegenerate solutions necessitating extra constraints. Using the biologically more plausible \nnoise model which describes the neurons by Poisson processes naturally leads to a sparse \nrepresentation, when optimal encoding is enforced, for two analytically tractable models. \nFor a model of the striate cortex Poisson statistics also leads to a network in which the \nreceptive fields of the neurons mimic those of VI simple cells, without the need to impose \nsparseness. This leads to the conclusion that sparseness is not an independent constraint \nthat is imposed by evolutionary pressure, but rather is a consequence of optimal encoding. \n\nReferences \n\n[1] Baddeley, R., Hancock, P., and Foldiak:, P. (eds.) (2000) Information Theory and the \n\nBrain. (Cambridge University Press, Cambridge). \n\n[2] Olshausen, B.A. and Field, D.J. (1996) Nature 381 :607; (1998) Vision Research \n\n37:3311. \n\n[3] Bell, A.j. and Sejnowski, T.J. (1997) Vision Res. 37:3327; van Hateren, lH. and van \n\nder Schaaf, A. (1998) Proc. R. Soc. Lond. 265:359. \n\n[4] Cover, T.M. and Thomas, lA. (1991) Information Theory (Whiley and Sons, New \n\nYork). \n\n[5] Richmond, B.J., Optican, L.M., and Spitzer, H. (1990) J. NeurophysioI. 64:351; \nRolls, E.T., Critchley, H.D., and Treves, A. (1996) J. Neurophysiol. 75: 1982; Dean, \nA.F. (1981) Exp. Brain. Res. 44:437. \n\n[6] Smith, w.L. (1951) Biometrica 46:1. \n\n[7] van Hateren, lH. and van der Schaaf, A. (1998) Proc.R.Soc.Lond. B 265:359-366. \n\n\f", "award": [], "sourceid": 1884, "authors": [{"given_name": "Carl", "family_name": "van Vreeswijk", "institution": null}]}