{"title": "Statistically Efficient Estimations Using Cortical Lateral Connections", "book": "Advances in Neural Information Processing Systems", "page_first": 97, "page_last": 103, "abstract": null, "full_text": "Statistically Efficient Estimation Using \n\nCortical Lateral Connections \n\nAlexandre Pouget \n\nalex@salk.edu \n\nKechen Zhang \nzhang@salk.edu \n\nAbstract \n\nCoarse codes are widely used throughout the brain to encode sen(cid:173)\nsory and motor variables. Methods designed to interpret these \ncodes, such as population vector analysis, are either inefficient, i.e., \nthe variance of the estimate is much larger than the smallest possi(cid:173)\nble variance, or biologically implausible, like maximum likelihood. \nMoreover, these methods attempt to compute a scalar or vector \nestimate of the encoded variable. Neurons are faced with a simi(cid:173)\nlar estimation problem . They must read out the responses of the \npresynaptic neurons, but, by contrast, they typically encode the \nvariable with a further population code rather than as a scalar. \nWe show how a non-linear recurrent network can be used to per(cid:173)\nform these estimation in an optimal way while keeping the estimate \nin a coarse code format. This work suggests that lateral connec(cid:173)\ntions in the cortex may be involved in cleaning up uncorrelated \nnoise among neurons representing similar variables. \n\n1 \n\nIntroduction \n\nMost sensory and motor variables in the brain are encoded with coarse codes, i.e., \nthrough the activity of large populations of neurons with broad tuning to the vari(cid:173)\nables. For instance, direction of visual motion is believed to be encoded in visual \narea MT by the responses of a large number of cells with bell-shaped tuning, as \nillustrated in figure I-A. \n\nNeurophysiological recordings have shown that, in response to an object moving \nalong a particular direction, the pattern of activity across such a population would \nlook like a noisy hill of activity (figure I-B) . On the basis of this activity, A, the \nbest that can be done is to recover the conditional probability of the direction of \nmotion , (), given the activity, p( (}IA). A slightly less ambitious goal is to come up \nwith a good \"guess\", or estimate, 0, of the direction, (), given the activity. Because \nof the stochastic nature of the noise, the estimator is a random variable, i.e, for \n\no AP is at the Institute for Computational and Cognitive Sciences, Georgetown Univer(cid:173)\n\nsity, Washington, DC 20007 and KZ is at The Salk Institute, La Jolla, CA 92037 . This \nwork was funded by McDonnell-Pew and Howard Hughes Medical Institute. \n\n\f98 \n\nA \n\nA. Pouget and K. Zhang \n\ni \\ \n\\ \ni \n\n1 \n\n\\ \n\nB \n\n3 \n2.5 \n.~ 2 \n.\"5 \u00ab 1.5 \n\nOL-----------------------J \n\n100 \n\n200 \n\nDirection (deg) \n\n300 \n\n100 \n\n200 \n\n300 \n\nPreferred Direction (deg) \n\nFigure 1: A- Tuning curves for 16 direction tuned neurons. B- Noisy pattern of \nactivity (0) from 64 neurons when presented with a direction of 1800 \u2022 The ML \nestimate is found by moving an \"expected\" hill of activity (dotted line) until the \nsquared distance with the data is minimized (solid line) \n\nthe same image, B will vary from trial to trial. A good estimator should have \nthe smallest possible variance across those trials because the variance determines \nhow well two similar directions can be discriminated using this estimator. The \nCramer-Rao bound provides an analytical lower bound for this variance given the \nnoise in the system and the unit tuning curves [5] . Typically, computationally \nsimple estimators, such as optimum linear estimator (OLE), are very inefficient; \ntheir variances are several times the bound. By contrast, Bayesian or maximum \nlikelihood (ML) estimators (which are equivalent for the case under consideration \nin this paper) can reach this bound but require more complex calculations [5]. \n\nThese decoding technics are valuable for a neurophysiologist interested in reading \nout the population code but they are not directly relevant for understanding how \nneural circuits perform estimation. In particular, they all provide the estimate in a \nformat which is incompatible with what we know of sensory representations in the \ncortex. For example, cells in V 4 are estimating orientation from the noisy responses \nof orientation tuned VI cells, but, unlike ML or OLE which provide a scalar esti(cid:173)\nmate, V4 neurons retain orientation in a coarse code format, as demonstrated by \nthe fact that V4 cells are just as broadly tuned to orientation as VI neurons. \n\nTherefore, it seems that a theory of estimation in biological networks should have \ntwo critical characteristics: 1- it should preserve the estimate in a coarse code and \n2- it should be efficient, i.e., the variance should be close to the Cramer-Rao bound. \nWe explore in this paper various network architectures for performing estimations \nwith coarse code using lateral connections. We start by briefly describing several \nclassical estimators such as OLE or ML. Then, we consider linear and non-linear \nrecurrent networks and compare their performances with the classical estimators. \n\n2 Classical Methods \n\nThe simplest estimators are linear of the form BOLE = WT A. Better performance \ncan be obtained with a center of mass estimator (COM), BeoM = 2:i Biai/ 2:i ai; \nhowever, in the case of a periodic variable, such as direction of motion, the best \none-shot method known is the complex estimator (CaMP), BeoMP = phase(z) \nwhere z = 2::=1 akei91c [5]. This estimator consists in fitting a cosine through \nthe pattern of activity, like the one shown in figure I-B, and using the phase of \n\n\fStatistically Efficient Estimations Using CorticaL LateraL Connections \n\n99 \n\nA \n\nB \n\n40 \n\nActivity over Time \n\no \n\n200 \n\n300 \n\npret~~~d Direction (deg) \n\nFigure 2: A- Circular network of 64 units. Only the connections originating from \none unit are shown. B- Activity over time in the non-linear network when initialized \nwith a random pattern at t = O. The activity of the units are plotted as a function \nof their position along the circle which is equivalent to their preferred direction of \nmotion with appropriate choice of weights. \n\nthe best cosine fit as the estimate of direction. This method is suboptimal if the \ndata were not generated by cosine tuning functions as in the case illustrated in \nfigure I-A. It is possible to obtain optimum performance by fitting the curve that \nwas actually used to generate the data, i.e, the actual tuning curves of the units. \nA maximum likelihood estimate, defined as being the direction maximizing p(AIO), \ninvolves exactly this type of curve fitting, a process illustrated in figure 1-B [5]. The \nestimate is computed by finding first the \"expected\" hill- the hill that would be \nobtained in a noise free system- minimizing the distance with the data. In the case \nof gaussian noise, the appropriate distance measure to minimize is the euclidian \nsquared distance. The final position of the peak of the hill corresponds to the \nmaximum likelihood estimate, OML. \n\n3 Recurrent Networks \n\nConsider a circular network of 64 units fully connected like the one depicted in \nfigure 2-A. With an appropriate choice of weights and activation function, this \nnetwork will develop a hill-shaped pattern of activity in response to a transient \ninput as illustrated in figure 2-B. If we initialize this networks with activity patterns \nA = {ad corresponding to the responses of 64 direction tuned units (figure 1), we \ncan use the final position of the hill across the neuronal array after relaxation as \nan estimate of the direction, O. The variance of this estimator will depend on the \nexact choice of activation function and weights. \n\n3.1 Linear Network \n\nWe first consider a network of 64 units whose dynamics is governed by the following \ndifference equation: \n\nThe dynamics of such networks is well understood [3]. If each unit receives the \nsame weight vector 'Iii, then the weight matrix W is symmetric. In this case, the \n\n(1) \n\n\f100 \n\nA. Pouget and K. Zhang \n\nnetwork dynamics amplifies or suppresses the Fourier component of the initial input \npattern, {ad, independently by a factors equal to the corresponding component of \nthe Fourier transform, ;];, of w. For example, if the first component of;]; is more \nthan one (resp. less than one) the first Fourier component of the initial pattern of \nactivity will be amplified (resp. suppressed). \n\nThus, we can choose W such that the network amplifies selectively the first Fourier \ncomponent of the data while suppressing the others. The network would be unstable \nbut if we stop after a large, yet fixed, number of iterations, the activity pattern would \nlook like a cosine function of direction with a phase corresponding to the phase of \nthe first Fourier components of the data. In other words, the network would end \nup fitting a cosine function in the data which is equivalent to the CaMP method \ndescribed above. A network for orientation selectivity proposed by Ben-Yishai et \nal [1] is closely related to this linear network. \n\nAlthough this method keeps the estimate in a coarse code format, it suffers two \nproblems: it is unclear how it could be extended to non periodic variables, such as \ndisparity, and it is suboptimal since it is equivalent to the CaMP estimator. \n\n3.2 Non-Linear Network \n\nWe consider next a network of 64 units fully connected whose dynamics is governed \nby the following difference equations: \n\nOi(t) = g( Ui(t\u00bb = 6.3 (log ( 1 + e5+1 0U,(t\u00bb) ) 0.8 \n\nu,( t Ht) = u, (t) Ht ( -u,( t) + t, W'jOj (t) ) \n\n(2) \n\n(3) \n\nZhang (1996) has demonstrated that with appropriate symmetric weights, {Wij}, \nthis network develops a stable hill of activity in response to an arbitrary transient \ninput pattern {Id(figure 2-B). The shape of the hill is fully specified by the weights \nand activation function whereas, by contrast, the final position of the hill on the \nneuronal array depends only on the initial input. Therefore, like ML, the network \nfits an \"expected\" function through the data. We first present a set of simulations \nin which we investigated whether ML and the network place the hill at the same \nlocation. \n\nMethods: The simulations consisted estimating the value of the direction of a \nmoving bar based on the activity, A = {ad, of 64 input units with hill-shaped \ntuning to direction corrupted by noise. We used circular normal functions like the \nones showed in figure I-A to model the mean activities, fiCO): \nfiCO) = 3exp(7(cos(O - Od - 1\u00bb + 0.3 \n\n(4) \nThe value 0.3 corresponds to the mean spontaneous activity of each unit. The peak, \nOJ, of the circular normal functions were uniformly spread over the interval [0\u00b0,360\u00b0]. \nThe activities, {ad, depended on the noise distribution. We used two types of noise, \nnormally distributed with fixed variance, O'~ = 1 and Poisson distributed: \n\nP(ai = ale) = \n\n1 \n\nJ27r0'2 \nn \n\nexp \n\n( \n\n-\n\n(a - f'(e\u00bb2) \n\n' \n20'2 \nn \n\n, \n\nP(ai = kle) = J, \n\n1.(O)k -f,(9) \n\n(5) \n\n:! \n\nOur results compare the standard deviation offour estimators, OLE, COM, CaMP \nand ML to the non-linear recurrent network (RN) with transient inputs (the input \npatterns are shown on the first iteration only). In the case of ML, we used the \n\n\fStatistically Efficient Estimations Using Cortical Lateral Connections \n\n101 \n\nNoise with Normal Distribution \n\nNoise with Poisson Distribution \n\nOLE COM COMP ML \n\nRN \n\nFigure 3: Histogram of the standard deviations of the estimate for all five methods \n\nCramer-Rao bound to compute the standard deviation as described in Seung and \nSompolinsky (1993). The weights in the recurrent network were chosen such that \nthe final pattern of activity in the network have a profile very similar to the tuning \nfunction fi(O). \n\nResults: Since the preferred direction of two consecutive units in the network \nare more than 50 apart, we first wonder whether RN estimates exhibit a bias(cid:173)\na difference between the mean estimate and the true direction- in particular for \ndirections between the peaks of two consecutive units. Our simulations showed no \nsignificant bias for any of the orientations tested (not shown). Next, we compared \nstandard deviations of the estimates for all five methods and for the two types \nof noise. The RN method was found to outperform the OLE, COM and COMP \nestimators in both cases and to match the Cramer-Rao bound for gaussian noise \n(figure 3) as suggested by our analysis. For noise with Poisson distribution, the \nstandard deviation for RN was only 0.3440 above ML (figure 3). \n\nWe also estimated numerically -lJORN j lJai 19=1700, the derivative of the RN estimate \nwith respect to the initial activity of each of 64 units for an orientation of 1700 \u2022 This \nderivative in the case of ML matches closely the derivative of the cell tuning curve, \n/,(0). In other words, in ML, units contribute to the estimate according to the \namplitude of the derivative of the tuning curve. As shown in figure the same is true \nfor RN, -lJORN jlJai 19=1700 matches closely the derivative of the units tuning curves. \nIn contrast, the same derivatives for the COMP estimate, (dotted line), or the \nCOM estimate, (dash-dotted line), do not match the profile of /'(0). In particular, \nunits with preferred direction far away from 1700 , i.e. units whose activity is just \nnoise, end up contributing to the final estimate, hindering the performance of the \nestimator. \n\nWe also looked at the standard deviation of the RN as a function of time, i.e., \nthe number of iterations. Reaching a stable state can take up to several hundred \niterations which could make the RN method too slow for any practical purpose. \nWe found however that the standard deviation decreases very rapidly over the first \n5-6 iterations and reaches asymptotic values after around 20 iterations (figure 4-B). \nTherefore, there is no need to wait for a perfectly stable pattern of activity to obtain \nminimum standard deviation. \n\nAnalysis: One way to determine which factors control the final position of the \nhill is to find a function, called a Lyapunov function, which is minimized over time \nby the network dynamics. Cohen and Grossberg (1983) have shown that network \ncharacterized by the dynamical equation above and in which the input pattern {sIi} \n\n\f102 \n\nA \n\n1 , \n\nCD \n> \n~ 0.5 \n'c \nCD o \nu \n16 \n~ E -0.5 \no z \n\n, . ' \n: , , \n\n01----'-\" \n\nA. Pouget and K. Zhang \n\nB \n\n20.-~--~--~-~--, \n\nRN \nCOMP \nCOM \n\n, \n\n\" \n\n-1L-__ ~ __ ~_'_\" _' ~--J' \n\n' . \n\n100 \n\n200 \n\n300 \n\nPreferred Direction (deg) \n\nOL-~--~-~--~~ \n\no \n\n20 \n\n40 \n\n60 \n\nTime (# of iterations) \n\nFigure 4: A- Comparison of g'(B) (solid line), -oO/oai!9=1700 for RN, CaMP and \nCOM. All functions have been normalized to one. B- Standard deviation as a \nfunction of the number of iterations for RN. \n\nis clamped, minimizes a Lyapunov function of the form : \n\n(6) \n\nThe last term is the dot product between the input pattern, {sIi }, and the current \nactivity pattern, {g( Ui)}, on the neuronal array. Here s is simply a scaling factor \nfor the input pattern. The dynamics of the network will therefore tend to minimize \n- Li Iig( ud, or equivalently, to maximize the overlap between the stable pattern \nand the input pattern . The other terms however are also dependent on Ii because \nthe shape of the final stable activity profile depends on the input pattern. Therefore \nthe network will settle into a compromise between maximizing overlap and getting \nthe right profile given the clamped input. \n\nWe can show however that, for small input (i.e., as the scaling factor s -+ 0), \nthe dominant term in the Lyapunov function is the dot product. To see this, we \nconsider the Taylor expansion of Lyapunov function L with respect to s. First, let \n{Ui } denote the profile of the stable activity {ud in the limit of zero input (s -+ 0), \nand then write the corresponding value of the Lyapunov function at zero input as \nLa. Now keeping only the first-order terms of s in the Taylor expansion, we obtain: \n\n(7) \n\nThis means that the dot product is the only first order term of s, and disturbances \nto the shape of the final activity profile contribute only to higher order terms of \ns, which are negligible when s is small. Notice that in the limit of zero input, the \nshape of the activity profile {Ui} is fixed, and the only thing unknown is its peak \nposition. Because La is a constant, the global minimum of the Lyapunov function \nhere should correspond to a peak position which maximizes the dot product. The \ndifference between Ui and Ui is negligible for sufficiently small input because, by \ndefinition, Ui -+ Ui as s -+ O. Consequently, for small input, the network will \nconverge to a solution maximizing primarily Li Iig( Ui), which is mathematically \nequivalent to minimizing the square distance between the input and the output \npattern. \nTherefore, if we use an activity pattern, A = {ad, as the input to this network, \nthe stable hill should have its peak at a position very close to the direction corre-\n\n\fStatistically Efficient Estimations Using Cortical Lateral Connections \n\n103 \n\nsponding to the maximum likelihood estimate (under the assumption of gaussian \nnoise), provided the network is not attracted into a local minimum of the Liapunov \nfunction. This result is valid when using a small clamped input but our simulations \nshow that a transient input is sufficient to reach the Cramer-Rao bound. \n\n4 Discussion \n\nOur results demonstrate that it is possible to perform efficient unbiased estimation \nwith coarse coding using a neurally plausible architecture. Our model relies on \nlateral connections to implement a prior expectation on the profile of the activity \npatterns. As a consequence, units determine their activation according to their \nown input and the activity of their neighbors. This approach shows that one of \nthe advantages of coarse code is to provide a representation which simplifies the \nproblem of cleaning up uncorrelated noise within a neuronal population. \n\nUnlike OLE, COM and CaMP, the RN estimate is not the result of a voting process \nin which units vote from their preferred direction, (Ji. Instead, units turn out to \ncontribute according to the derivatives of their tuning curves, If( (J), as in the case \nof ML. This feature allows the network to ignore background noise, that is to say, \nresponses due to other factors beside the variable of interest. This property also \npredicts that discrimination of directions around the vertical (90\u00b0) would be most \naffected by shutting off the units tuned at 60\u00b0 and 120\u00b0. This prediction is consistent \nwith psychophysical experiments showing that discrimination around the vertical \nin human is affected by prior adaptation to orientations displaced from the vertical \nby \u00b1300 [4]. \n\nOur approach can be readily extended to any other periodic sensory or motor vari(cid:173)\nables. For non periodic variables such as the disparity of a line in an image, our \nnetwork needs to be adapted since it currently relies on circular symmetrical weights. \nSimply unfolding the network will be sufficient to deal with values around the center \nof the interval under consideration, but more work is needed to deal with boundary \nvalues. We can also generalize this approach to arbitrary mapping between two \ncoarse codes for variables x and y where y is a function of x. Indeed, a coarse code \nfor x provides a set of radial basis functions of x which can be subsequently used to \napproximate arbitrary functions. It is even conceivable to use a similar approach \nfor one-to-many mappings, a common situation in vision or robotics, by adapting \nour network such that several hills can coexist simultaneously. \n\nReferences \n\n[1] R. Ben-Yishai, R . L. Bar-Or, and H. Sompolinsky. Proc. Natl. Acad. Sci. USA, \n\n92:3844-3848, 1995. \n\n[2] M. Cohen and S. Grossberg. IEEE Trans. SMC, 13:815-826, 1983. \n[3] M. Hirsch and S. Smale. Differential equations, dynamical systems and linear \n\nalgebra. Academic Press, New York, 1974. \n\n[4] D. M. Regan and K. 1. Beverley. J. Opt. Soc. Am., 2:147-155, 1985. \n[5] H. S. Seung and H. Sompolinsky. Proc. Natl. Acad. Sci. USA, 90:10749-10753, \n\n1993. \n\n\f", "award": [], "sourceid": 1312, "authors": [{"given_name": "Alexandre", "family_name": "Pouget", "institution": null}, {"given_name": "Kechen", "family_name": "Zhang", "institution": null}]}