{"title": "Position Variance, Recurrence and Perceptual Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 31, "page_last": 37, "abstract": null, "full_text": "Position Variance, Recurrence and Perceptual \n\nLearning \n\nZhaoping Li \nPeter Dayan \nGatsby Computational Neuroscience Unit \n\n17 Queen Square, London, England, WCIN 3AR. \n\nzhaoping @ga t s by.u c l. a c.u k \n\nda y a n @gat sby.u c l. ac .uk \n\nAbstract \n\nStimulus arrays are inevitably presented at different positions on the \nretina in visual tasks, even those that nominally require fixation. In par(cid:173)\nticular, this applies to many perceptual learning tasks. We show that per(cid:173)\nceptual inference or discrimination in the face of positional variance has a \nstructurally different quality from inference about fixed position stimuli, \ninvolving a particular, quadratic, non-linearity rather than a purely lin(cid:173)\near discrimination. We show the advantage taking this non-linearity into \naccount has for discrimination, and suggest it as a role for recurrent con(cid:173)\nnections in area VI, by demonstrating the superior discrimination perfor(cid:173)\nmance of a recurrent network. We propose that learning the feedforward \nand recurrent neural connections for these tasks corresponds to the fast \nand slow components of learning observed in perceptual learning tasks. \n\n1 Introduction \n\nThe field of perceptual learning in simple, but high precision, visual tasks (such as vernier \nacuity tasks) has produced many surprising results whose import for models has yet to be \nfully felt. A core of results is that there are two stages of learning, one fast, which happens \nover the first few trials, and another slow, which happens over multiple sessions, may in(cid:173)\nvolve REM sleep, and can last for months or even years (Fable, 1994; Karni & Sagi, 1993; \nFahle, Edelman, & Poggio 1995). Learning is surprisingly specific, in some cases being \ntied to the eye of origin of the input and rarely admitting generalisation across wide areas of \nspace or between tasks that appear extremely similar, even involving the same early-stage \ndetectors (eg Fahle, Edelman, & Poggio 1995; Fable, 1994). For instance, improvement \nthrough learning on an orientation discrimination task does not lead to improvement on a \nvernier acuity task (Fable 1997), even though both tasks presumably use the same orienta(cid:173)\ntion selective striate cortical cells to process inputs. \n\nOf course, learning in human psychophysics is likely to involve plasticity in a large num(cid:173)\nber of different parts of the brain over various timescales. Previous studies (Poggio, Fable, \n& Edelman 1992, Weiss, Edelman, & Fable 1993) proposed phenomenological models of \nlearning in a feedforward network architecture. In these models, the first stage units in \nthe network receive the sensory inputs through the medium of basis functions relevant for \nthe perceptual task. Over learning, a set of feedforward weights is acquired such that the \nweighted sum of the activities from the input units can be used to make an appropriate bi(cid:173)\nnary decision, eg using a threshold. These models can account for some, but not all, obser(cid:173)\nvations on perceptual learning (Fable et al 1995). Since the activity of VI units seems not \nto relate directly to behavioral decisions on these visual tasks, the feedforward connections \n\n\fA \n\n--=---------:-----~ X \n\n-l+y y+E \n\nl+y \n\nFigure 1: Mid-point discrimination. A) Three bars are presented at x-, Xo and x+. The task is to \nreport which of the outer bars is closer to the central bar. y represents the variable placement of the \nstimulus array. B) Population activities in cortical cells evoked by the stimulus bars -\nthe activities \nai is plotted against the preferred location Xi of the cells. This comes from Gaussian tuning curves \n(k = 20; T = 0.1) and Poisson noise. There are 81 units whose preferred values are placed at regular \nintervals of ~x = 0.05 between X = -2 and x = 2. \n\nx \n\nmust model processing beyond VI. The lack of generalisation between tasks that involve \nthe same visual feature samplers suggests that the basis functions, eg the orientation selec(cid:173)\ntive primary cortical cells that sample the inputs, do not change their sensitivity and shapes, \neg their orientation selectivity or tuning widths. However, evidence such as the specificity \nof learning to the eye of origin and spatial location strongly suggest that lower visual ar(cid:173)\neas such as VI are directly involved in learning. Indeed, VI is a visual processor of quite \nsome computational power (performing tasks such as segmentation, contour-integration, \npop-out, noise removal) rather than being just a feedforward, linear, processing stage (eg \nLi, 1999; Pouget et aII998). \n\nHere, we study a paradigmatic perceptual task from a statistical perspective. Rather than \nsuggest particular learning rules, we seek to understand what it is about the structure of \nthe task that might lead to two phases of learning (fast and slow), and thus what compu(cid:173)\ntational job might be ascribed to VI processing, in particular, the role of lateral recurrent \nconnections. We agree with the general consensus that fast learning involves the feedfor(cid:173)\nward connections. However, by considering positional invariance for discrimination, we \nshow that there is an inherently non-linear component to the overall task, which defeats \nfeedforward algorithms. \n\n2 The bisection task \n\nFigure IA shows the bisection task. Three bars are presented at horizontal positions Xo = \ny + E, x_ = -1 + Y and x+ = 1 + y, where -1 \u00ab E \u00ab 1. Here y is a nuisance \nrandom number with zero mean, reflecting the variability in the position of stimulus array \ndue to eye movements or other uncontrolled factors. The task for the subject is to report \nwhich of the outer bars is closer to the central bar, ie to report whether E is greater than \nor less than O. The bars create a population-coded representation in VI cells preferring \nvertical orientation. In figure IB, we show the activity of cells ai as a function of preferred \ntopographic location Xi of the cell; and, for simplicity, we ignore activities from other VI \ncells which prefer orientations other than vertical. \n\nWe assume that the cortical response to the bars is additive, with mean \nai(E,y) = f(Xi - xo) + f(Xi - x_) + f(Xi - x+) \n\n(1) \n(we often drop the dependence on E, y and write ai, or, for all the components, a) where \nf is, say, a Gaussian, tuning curve with height k and tuning width T, f(x) = ke- x2 /2T2, \nusually with T \u00ab 1. The net activity is ai = ai + ni, where ni is a noise term. We assume \nthat ni comes from a Poisson distribution and is independent across the units, and E and y \nhave mean zero and are uniformly distributed in their respective ranges. \n\nThe subject must report whether E is greater or less than 0 on the basis of the activities a. \n\n\fA normative way to do this is to calculate the probability P[Ela] of E given a, and report by \nmaximum likelihood (ML) that E > 0 if 1,>0 dE P[Ela] > 0.5. Without prior information \nabout E, y, and with Poisson noise ni = ai -\n\niii, we have \n\n3 Fixed position stimulus array \nWhen the stimulus array is in a fixed position y = 0, analysis is easy, and is very similar to \nthat carried out by Seung & Sompolinsky (1993). Dropping y, we calculate log P[Ela] and \napproximate it by Taylor expansion about E = 0 to second order in E: \n\n10gP[aIE] .-vconstant+E t,logP[aIE]I,=o+ ~t-IogP[aIE]I,=o \n\n(3) \n\nignoring higher order terms. Provided that the last term is negative (which it indeed is, \nalmost surely), we derive an approximately Gaussian distribution \n\nwith variance u; = [-t-IogP[aIE]I,=o]-l and mean \u00a3= u; t, logP[aIE]IE=o. Thus the \nsubject should report that E > 0 or E < 0 if the test t(a) = t, 10gP[aIE]I,=0 is greater or \n\nless than zero respectively. For the Poisson noise case we consider, log P[aIE] = constant+ \nl:i ai log iii ( E) since l:i iii (E) is a constant, independent of E. Thus, \n\n(4) \n\n(5) \n\nTherefore, maximum likelihood discrimination can be implemented by a linear feedforward \n\nnetwork mapping inputs ai through feedforward weights Wi = t, log iii to calculate as the \noutput t(a) = l:i Wiai . A threshold of 0 on t(a) provides the discrimination E > 0 if \nt(a) > 0 and E < 0 for t(a) < O. The task therefore has an essentially linear character. \nNote that if the noise corrupting the activities is Gaussian, the weights should instead be \nWi = alai. \nFigure 2A shows the optimal discrimination weights for the case of independent Poisson \nnoise. The lower solid line in figure 2C shows optimal performance as a function of Eo The \nerror rate drops precipitately from 50% for very small (and thus difficult) E to almost 0, \nlong before E approaches the tuning width T. \n\na -\n\nIt is also possible to learn weights in a variety of ways (eg Poggio, Fable & Edelman, 1992; \nWeiss, Edelman & Fable, 1993; Fable, Edelman & Poggio 1995;) Figure 2B shows dis(cid:173)\ncrimination weights learned using a simple error-correcting learning procedure, which are \nalmost the same as the optimal weights and lead to performance that is essentially optimal \n(the lower dashed line in figure 2C) . We use error-correcting learning as a comparison \ntechnique below. \n\n4 Moveable stimulus array \n\nIf the stimulus array can move around, ie if y is not necessarily 0, then the discrimination \ntask gets considerably harder. The upper dotted line in figure 2C shows the (rather unfair) \ntest of using the learned weights in figure 2B when y E [- .2, .2] varies uniformly. Clearly \nthis has a highly detrimental effect on the quality of discrimination. Looking at the weight \nstructure in figure 2A;B suggests an obvious reason for this - the weights associated with \nthe outer bars are zero since they provide no information about E when y = 0, and the \n\n\fML weights \n\nlearned weights \n\nperformance \n\nVl \n..... \no ..... \n..... \nQ) \n\n-2 \n\n2 X \n\n-2 \n\no \n\n2 X \n\nOL-------'-\"'\"-------::-. \n1) \n\nE \n\n0.05 \n\nFigure 2: A) The ML optimal discrimination weights w = f. log a (plotted as Wi vs. Xi) for \ndeciding if \u20ac > 0 when y = O. B) The learned discrimination weights w for the same deci(cid:173)\nsion. During on line learning, random examples were selected with \u20ac E -2[-r, r] uniformly, \nr = 0.1, and the weights were adjusted online to maximise the log probability of generating \nthe correct discrimination under a model in which the probability of declaring that \u20ac > 0 is \nO'(~i Wiai) = 1/(1 + exp( - ~i Wiai)). C) Performance of the networks with ML (lower solid \nline) and learned (lower dashed line) weights as a function of \u20ac. Performance is measured by drawing \na randomly given \u20ac and y, and assessing the %'age of trials the answer is incorrect. The upper dotted \nline shows the effect of drawing y E [-0.2,0.2] uniformly, yet using the ML weights in (B) that \nassume y = O. \n\nweights are finely balanced about 0, the mid-point of the outer bars, giving an unbiased or \nbalanced discrimination on E. If the whole array can move, this balance will be destroyed, \nand all the above conclusions change. \nThe equivalent of equation (3) when y f:. 0 is \n\nThus, to second-order, a Gaussian distribution can approximate P[ E, Y I a]. Figure 3A shows \nthe high quality of this approximation. Here, E and yare anti-correlated given activities a, \nbecause the information from the center stimulus bar only constrains their sum E + y. Of \ninterest is the probability P[Ela] = f dy log prE, yla], which is approximately Gaussian \nwith mean (3 p; and variance p;, where, under Poisson noise ni = ai - ai, \n_ \ne,y-O \n\n(3 = [a\u00b7 810gB _ (a. 8210gB)(a. 810g B)j(a . 8210g B)]1 \n\n8y8e \n\n8y2 \n\n8y \n\n8e \n\nP-2 = [(a. 82IogB)2j(a. 8210gB) _ a. 8210gB]1 \n\n8e 2 \n\n_ \ne,y-O \n\n8y8e \n\ne \n\n8y2 \n\nSince -a . 82d~~ B (which is the inverse variance of the Gaussian distribution of y that we \nintegrated out) is positive, the appropriate test for the sign of E is \n\nt(a) = [(a. 82IogB)(a. 810gB) _ (a . 8IogB)(a. 8210~B)]1 _ \n\n8e \n\n8y \n\ne,y-O \n\n8y8e \n\n8y \n\n(6) \n\nIf t( a) > 0 then we should report E > 0, and conversely. Interestingly, t( a) is a very simple \nquadratic form \n\nt(a) = a . Q . a == \" .. a.a . [( 82 log iii) (8 log iii ) _ (8 log iii) (82 lo~ iii)] I _ \n\n0tJ t J \n\n8y8e \n\n8y \n\n8e \n\n8y \n\ne,y-O \n\n(7) \n\nTherefore, the discrimination problem in the face of positional variance has a precisely \nquantifiable non-linear character. The quadratic test t(a) cannot be implemented by a \nlinear feedforward architecture only, since the optimal boundary t(a) = 0 to separate the \nstate space a for a decision is now curved. Writing t(a) = a\u00b7 Q . a where the symmetric \n\n\fA \n\no. \n\n0.1 \n\n0.1 \n\nexact \n\nB \n\n-0 \n\n-0.02 0.02 \n\n\u00a3 0.08 -0.02 \n\n0.02 \n\n\u00a3 0.08 \n\nC \n\n- 1 \n\n'W: ,Wd \n,+ ,+ \n\n-0.05 \n\n0.05 \n\n-2 \n\n-2 \n\nFigure 3: Varying y. A) Posterior distribution prE, yla]. Exact (left) prE, yla] for a particular a \nwith true values E = 0.27, Y = 1.57 (with 7 = 0.1) and its bivariate Gaussian approximation (right). \nOnly the relevant region of (E, y) space is shown - outside this, the probability mass is essentially 0 \n(and the contour values are the same). B) The quadratic form Q, Qij vs. X i and Xj. C) The four \neigenvectors of Q with non-zero eigenvalues (shown above). The eigenvalues come in \u00b1 pairs; the \nassociated eigenvectors come in antisymmetric pairs. The absolute scale of Q and its eigenvalues is \narbitrary. \n\nA \n\nML errors \n\nB \n\nc \n\nlinear/ ML errors \n\nFigure 4: y =1= O. A) Performance of the approximate inference based on the quadratic form of \nfigure 3B in terms of % 'age error as a function of Iyl and lEI (7 = 0.1). B) Feedforward weights, Wi \nvs. Xi , learned using the same procedure as in figure 2B, but with y E [-.2 , .2] chosen uniformly \nat random. C) Ratio of error rates for the linear (weights from B) to the quadratic discrimination. \nValues that would be infinite are pegged at 20. \n\noy \n\nf,y=O, \n\nOf \n\n. \n\nf ,y=O, \n\nI b \n\nd b 4 \n\n0 log a I \n\n0 log a I \n\n0 2 log a I \n\ny vectors OYOf \n\nform Qij = (Q~j + Qji)/2, we find Q only has four non-zero eigenvalues, for the 4-\nd\u00b7 \nd \nImenSlOna su -space spanne \nf,y=O, an \n02d~~ a ky=o. Q and its eigenvectors and eigenvalues are shown in Figure 3B;C. Note that \nif Gaussian rather than Poisson noise is used for ni = ai - ai, the test t( a) is still quadratic. \nUsing t(a) to infer E is sound for y up to two standard deviations (7) of the tuning curve \nf(x) away from 0, as shown in Figure 4A. By comparison, a feedforward network, of \nweights shown in figure 4B and learned using the same error-correcting learning procedure \nas above, gives substantially worse performance, even though it is better than the feedfor(cid:173)\nward net of Figure 2A;B. Figure 4C shows the ratio of the error rates for the linear to the \nquadratic decisions. The linear network is often dramatically worse, because it fails to take \nproper account of y. \n\nWe originally suggested that recurrent interactions in the form of horizontal intra-cortical \nconnections within VI might be the site of the longer term improvement in behavior. Fig(cid:173)\nure 5 demonstrates the plausibility of this idea. Input activity (as in figure IB) is used \nto initialise the state u at time t = 0 of a recurrent network. The recurrent weights are \n\n\fA \n\nB \n\nrecurrent weights \n\nC recu rrent error \n\no lin/rec error \n\nDecision \n\nInput a \n\ny \n\nFigure 5: Threshold linear recurrent network, its weights, and performance. See text. \n\nsymmetric and shown in figure 5B. The network activities evolve according to \n\ndui/dt = -Ui + Lj Jijg(Uj) + ai \n\n(8) \nwhere Jij are the recurrent weight from unit j to i, g(u) = U if U > 0 and g(u) = 0 \n:::; O. The network activities finally settle to an equilibrium u(t -+ 00) (note that \nfor U \nUi (t -+ 00) = ai when J = 0). The activity values u( t -+ 00) of this equilibrium are fed \nthrough feed forward weights w, that are trained for this recurrent network just as for the \npure feedforward case, to reach a decision Li WiUi(t -+ 00). Figure 5C shows that using \nthis network gives results that are almost invariant to y (as for the quadratic discriminator) ; \nand figure 5D shows that it generally outperforms the optimal linear discriminator by a large \nmargin, albeit performing slightly worse than the quadratic form. The recurrent weight \nmatrix is subject to three influences: (1) a short range interaction Jij for IXi - Xj I ;S T to \nstablize activities ai induced by a single bar in the input; (2) a longer range interaction Jij \nfor IXi - Xj I '\" 1 to mediate interaction between neighboring stimulus bars, amplifying the \neffects of the displacement signal \u00a3, and (3) a slight local interaction Jij for lXii, IXj I ;S \nT. The first two interaction components are translation invariant in the spatial range of \nXi, Xj E [-2,2] where the stimulus array appears, in order to accommodate the positional \nvariance in y. The last component is not translation invariant and counters variations in y. \n\n5 Discussion \n\nThe problem of position invariant discrimination is common to many perceptual learning \ntasks, including hyper-acuity tasks such as the standard line vernier, three-dot vernier, cur(cid:173)\nvature vernier, and orientation vernier tasks (Fahle et al 1995, Fahle 1997). Hence, the \nissues we address and analyze here are of general relevance. In particular, our mathemat(cid:173)\nical formulation, derivations, and thus conclusions, are general and do not depend on any \nparticular aspect of the bisection task. One essential problem in many of these tasks is \nto discriminate a stimulus variable \u00a3 that depends only on the relative positions between \nthe stimulus features, while the absolute position y of the whole stimulus array can vary \nbetween trials by an amount that is much larger than the discrimination threshold (or acu(cid:173)\nity) on \u00a3. The positional variable y may not have to correspond to the absolute position \nof the stimulus array, but merely to the error in the estimation of the absolute position of \nthe stimulus by other neural areas. Our study suggests that although when y = 0 is fixed, \nthe discrimination is easy and soluble by a linear, feedforward network, whose weights \ncan be learnt in a straight-forward manner, when y is not fixed, optimal discrimination of \n\u00a3 is based on an approximately quadratic function of the input activities, which cannot be \nimplemented using a linear feedforward net. \n\nWe also showed that a non-linear recurrent network, which is a close relative of a line at(cid:173)\ntractor network, can perform much better than a pure feedforward network on the bisection \ntask in the face of position variance. There is experimental evidence that lateral connections \nwithin VI change after learning the bisection task (Gilbert 2000), although we have yet to \nconstruct an appropriate learning rule. We suggest that learning the recurrent weights for \n\n\fthe nonlinear transform corresponds to the slow component in perceptual learning, while \nlearning the feedforward weights corresponds to the fast component. The desired recurrent \nweights are expected to be much more difficult to learn, in the face of nonlinear transforms \nand (the easily unstable) recurrent dynamics. Further, the feedforward weights need to be \nadjusted further as the recurrent weights change the activities on which they work. \n\nThe precise recurrent interactions in our network are very specific to the task and its pa(cid:173)\nrameters. In particular, the range of the interactions is completely determined by the scale \nof spacing between stimulus bars; and the distance-dependent excitation and inhibition in \nthe recurrent weights is determined by the nature of the bisection task. This may be why \nthere is little transfer of learning between tasks, when the nature and the spatial scale of the \ntask change, even if the same input units are involved. However, our recurrent interaction \nmodel does predict that transfer is likely when the spacing between the two outer bars (here \nat ~x = 2) changes by a small fraction. Further, since the signs of the recurrent synapses \nchange drastically with the distance between the interacting cells, negative transfer is likely \nbetween two bisection tasks of slightly different spatial scales. We are planning to test this \nprediction. \n\nAchieving selectivity at the same time as translation invariance is a very basic requirement \nfor position-invariant object recognition (see Riesenhuber & Poggio 1999 for a recent dis(cid:173)\ncussion), and arises in a pure form in this bisection task. Note, for instance, that trying \nto cope with different values of y by averaging spatially shifted versions of the optimal \nweights for y = 0 (figure 2A) would be hopeless, since this would erase (or at very least \nblur) the precise spatial positioning of the peaks and troughs which underlies the discrim(cid:173)\nination power. It would be possible to scan the input for the value of y that fits the best \nand then apply the discriminator centered about that value, and, indeed, this is conceptu(cid:173)\nally what the neocognitron (Fukushima 1980) and the MAX-model (Riesenhuber & Poggio \n1999) do using layers of linear and non-linear combination. In our case, we have shown, at \nleast for fairly small y, that the optimal non-linearity for the task is a simple quadratic. \n\nAcknowledgements \n\nFunding is from the Gatsby Charitable Foundation. We are very grateful to Shimon Edel(cid:173)\nman, Manfred Fable and Maneesh Sahani for discussions. \n\nReferences \n\n[1] Karni A and Sagi D. Nature 365 250-252,1993. \n[2] Fahle M. Edelman S. and Poggio T. Vision Res. 35 3003-3013, 1995. \n[3] Fahle M. Perception 23 411-427, (1994). And also Fahle M. Vis. Res. 37(14) 1885-1895, (1997). \n[4] Poggio T. Fahle M. and Edelman S. Science 2561018-1021, 1992. \n[5] Weiss Y. Edelman S. and Fahle M. Neural Computation 5 695-718, 1993. \n[6] Li, Zhaoping Network: Computation in Neural Systems 10(2) 187-212, 1999. \n[7] Pouget A, Zhang K, Deneve S, Latham PE. Neural Comput. 10(2):373-401, 1998. \n[8] Seung HS , Sompolinsky H. Proc Natl Acad Sci USA . 90(22):10749-53, 1993 \n[9] Koch C. Biophysics of computation. Oxford University Press, 1999. \n[10] Gilbert C. Presentation at the Neural Dynamics Workshop, Gatsby Unit, 2/2000. \n[11] Riesenhuber M, Poggio T. Nat Neurosci. 2(11):1019-25, 1999. \n[12] Fukushima, K. BioI. Cybem. 36193-202, 1980. \n\n\f", "award": [], "sourceid": 1883, "authors": [{"given_name": "Zhaoping", "family_name": "Li", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}