{"title": "Learning visual motion in recurrent neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1322, "page_last": 1330, "abstract": "We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate-inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a specific pattern of recurrent connections learned by the model.", "full_text": "Learning visual motion in recurrent neural networks\n\nMarius Pachitariu, Maneesh Sahani\nGatsby Computational Neuroscience Unit\n\nUniversity College London, UK\n\n{marius, maneesh}@gatsby.ucl.ac.uk\n\nAbstract\n\nWe present a dynamic nonlinear generative model for visual motion based on a\nlatent representation of binary-gated Gaussian variables. Trained on sequences of\nimages, the model learns to represent different movement directions in different\nvariables. We use an online approximate inference scheme that can be mapped\nto the dynamics of networks of neurons. Probed with drifting grating stimuli and\nmoving bars of light, neurons in the model show patterns of responses analogous\nto those of direction-selective simple cells in primary visual cortex. Most model\nneurons also show speed tuning and respond equally well to a range of motion\ndirections and speeds aligned to the constraint line of their respective preferred\nspeed. We show how these computations are enabled by a speci\ufb01c pattern of\nrecurrent connections learned by the model.\n\n1\n\nIntroduction\n\nPerhaps the most striking property of biological visual systems is their ability to ef\ufb01ciently cope with\nthe high bandwidth data streams received from the eyes. Continuous sequences of images represent\ncomplex trajectories through the high-dimensional nonlinear space of two dimensional images. The\nsurvival of animal species depends on their ability to represent these trajectories ef\ufb01ciently and to\ndistinguish visual motion on a fast time scale. Neurophysiological experiments have revealed com-\nplicated neural machinery dedicated to the computation of motion [1]. In primates, the classical\npicture of the visual system distinguishes between an object-recognition-focused ventral pathway\nand an equally large dorsal pathway for object localization and visual motion. In this paper we pro-\npose a model for the very \ufb01rst cortical computation in the dorsal pathway: that of direction-selective\nsimple cells in primary visual cortex [2]. We continue a line of models which treats visual motion\nas a general sequence learning problem and proposes asymmetric Hebbian rules for learning such\nsequences [3, 4]. We reformulate these earlier models in a generative probabilistic framework which\nallows us to train them on sequences of natural images. For inference we use an online approximate\n\ufb01ltering method which resembles the dynamics of recurrently-connected neural networks.\nPrevious low-level generative models of image sequences have mostly treated time as a third dimen-\nsion in a sparse coding problem [5]. These approaches have thus far been dif\ufb01cult to map to neural\narchitecture as they have been implemented with noncausal inference algorithms. Furthermore, the\nspatiotemporal sensitivity of each learned variable is determined by a separate three-dimensional ba-\nsis function, requiring many variables to encode all possible orientations, directions of motion and\nspeeds. Cortical architecture points to a more distributed formation of motion representation, with\ntemporal sensitivity determined by the interaction of neurons with different spatial receptive \ufb01elds.\nAnother major line of generative models of video analyzes the slowly changing features of visual\ninput and proposes complex cells as such slow feature learners [6], [7]. However, these models are\nnot expressive enough to encode visual motion and are more speci\ufb01cally designed to discover image\ndimensions invariant in time.\n\n1\n\n\fA recent hierarchical generative model for mid-level visual motion separates the phases and ampli-\ntudes of complex coef\ufb01cients applied to complex spatial basis functions [8]. This separation makes\nit possible to build a second layer of variables that speci\ufb01es a distribution on the phase coef\ufb01cients\nalone. This second layer learns to pool together \ufb01rst layer neurons with similar preferred directions.\nThe introduction of real and imaginary parts in the basis functions is inspired by older energy-based\napproaches where pairs of neurons with receptive \ufb01elds in quadrature phase feed their outputs with\ndifferent time delays to a higher-order neuron which thus acquires direction selectivity. The model\nof [8], and models based on motion energy in general, do not however reproduce direction-selective\nsimple cells.\nIn this paper we propose a network in which local motion is computed in a more\ndistributed fashion than is postulated by feedforward implementations of energy models.\n\n1.1 Recurrent Network Models for Neural Sequence Learning.\n\nAnother view of the development of visual motion processing sees it as a special case of the general\nproblem of sequence learning [4]. Many structures in the brain seem to show various forms of se-\nquence learning, and recurrent networks of neurons can naturally produce learned sequences through\ntheir dynamics [9, 10]. Indeed, it has been suggested that the reproduction of remembered sequences\nwithin the hippocampus has an important navigational role. Similarly, motor systems must be able\nto generate sequences of control signals that drive appropriate muscle activity. Thus many neural\nsequence models are fundamentally generative. By contrast, it is not evident that V1 should need\nto reproduce the learnt sequences of retinal input that represent visual motion. Although generative\nmodelling provides a powerful mathematical device for the construction of inferential sensory rep-\nresentations, the role of actual generation has been debated. Is there really a potential connection\nthen, to the generative sequence reproduction models developed for other areas?\nOne possible role for explicit sequence generation in a sensory system is for prediction. Predictive\ncoding has indeed been proposed as a central mechanism to visual processing [11] and even as\na more general theory of cortical responses [12]. More speci\ufb01cally as a visual motion learning\nmechanism, sequence learning forms the basis of an earlier simple toy but biophysically realistic\nmodel based on STDP at the lateral synapses of a recurrently connected network [4]. In another\nbiophysically realistic model, recurrent connections are set by hand rather than learned but they\nproduce direction selectivity and speed tuning in simulations of cat primary visual cortex [13]. Thus,\nthe recurrent mechanisms of sequence learning may indeed be important. In the following section\nwe de\ufb01ne mathematically a probabilistic sequence modelling network which can learn patterns of\nvisual motion in an unsupervised manner from 16 by 16 patches with 512 latent variables connected\ndensely to each other in a nonlinear dynamical system.\n\n. . .\n\nzt\u22121\n\n\u03c4x\n\nxt\n\nzt\n\nyt\n\nR, b\n\nht\n\nt\n\n(b)\n\nht+1\n\n. . .\n\nW, \u03c4y\n\n(a)\n\nFigure 1: a. Toy sequence learning model with biophysically realistic neurons from [4]. Neurons\nN1 and N2 have the same RF as indicated by the dotted line, but after STDP learning of the recur-\nrent connections with other neurons in the chain, N1 and N2 learn to \ufb01re only for rightward and\nrespectively leftward motion. b Graphical model representation of the bgG-RNN. The square box\nrepresents that the variable zt is not random, but is given by zt = xt \u25e6 ht.\n\n2\n\n\f2 Probabilistic Recurrent Neural Networks\n\nIn this section we introduce the binary-gated Gaussian recurrent neural network as a generative\nmodel of sequences of images. The model belongs to the class of nonlinear dynamical systems.\nInference methods in such models typically require expensive variational [14] or sampling based\napproximations [15], but we found that a low cost online \ufb01ltering method works suf\ufb01ciently well to\nlearn an interesting model. We begin with a description of binary-gated Gaussian sparse coding for\nstill images and then describe how to de\ufb01ne the dependencies in time between variables.\n\n2.1 Binary Gated Gaussian Sparse Coding (bgG-SC).\n\nBinary-gated Gaussian sparse coding ([16], also called spike-and-slab sparse coding [17]1) may\nbe seen as a limit of sparse coding with a mixture of Gaussians priors [18] where one mixture\ncomponent has zero variance. Mathematically, the data yt is obtained by multiplying together a\nmatrix W of basis \ufb01lters with a vector ht\u25e6xt, where \u25e6 denotes the operation of Hadamard or element-\nwise product, xt \u2208 RN is Gaussian and spherically distributed with standard deviation \u03c4x and\nht \u2208 {0, 1}N is a vector of independent Bernoulli-distributed elements with success probabilities p.\nFinally, small amounts of isotropic Gaussian noise with standard deviation \u03c4y are added to produce\nyt. For notational consistency with the proposed dynamic version of this model, the t superscript\nindexes time. The joint log-likelihood is\n\nSC = \u2212 (cid:107)yt \u2212 W \u00b7 (ht \u25e6 xt)(cid:107)2/2\u03c4 2\nLt\nj log pj + (1 \u2212 ht\n\n(cid:0)ht\n\ny \u2212 (cid:107)xt(cid:107)2/2\u03c4 2\nx +\n\nj) log (1 \u2212 pj)(cid:1) + const,\n\nN(cid:88)\n\n+\n\n(1)\n\nj=1\n\nk for which ht\n\nwhere N is the number of basis \ufb01lters in the model. By using appropriately small activation prob-\nabilities p, the effective prior on ht \u25e6 xt can be made arbitrarily sparse. Probabilistic inference in\nsparse coding is intractable but ef\ufb01cient variational approximation methods exist. We use a very fast\napproximation to MAP inference, the matching pursuit algorithm (MP) [19]. Instead of using MP\nto extract a \ufb01xed number of coef\ufb01cients per patch as usual, we extract coef\ufb01cients for as long as the\njoint log-likelihood increases. Patches with more complicated structure will naturally require more\ncoef\ufb01cients to code. Once values for xt and ht are \ufb01lled in, the gradient of the joint log likelihood\nwith respect to the parameters is easy to derive. Note that xt\nk = 0 can be integrated\nout in the likelihood, as they receive no contribution from the data term in (1). Due to the MAP\napproximation, only the W can be learned. Therefore, we set \u03c4 2\ny to reasonable values, both on\nthe order of the data variance. We also adapted pk during learning so that each \ufb01lter was selected\nby the MP process a roughly equal number of times. This helped to stabilise learning, avoiding a\ntendency to very unequal convergence rates otherwise encountered.\nWhen applied to whitened small patches from images, the algorithm produced localized Gabor-like\nreceptive \ufb01elds as usual for sparse coding, with a range of frequencies, phases, widths and aspect\nratios. The same model was shown in [16] to reproduce the diverse shapes of V1 receptive \ufb01elds,\nunlike standard sparse coding [20]. We found that when we varied the average number of coef\ufb01cients\nrecruited per image, the receptive \ufb01elds of the learned \ufb01lters varied in size. For example with only\none coef\ufb01cient per image, a large number of \ufb01lters represented edges extending from one end of the\npatch to the other. With a large number of coef\ufb01cients, the \ufb01lters concentrated their mass around\njust a few pixels. With even more coef\ufb01cients, the learned \ufb01lters gradually became Fourier-like.\nDuring learning, we gradually adapted the average activation of each variable ht\nk by changing the\nprior activation probabilities pk. For 16x16 patches in a twice overcomplete SC model (number\nof \ufb01lters = twice the number of pixels), we found that learning with 10-50 coef\ufb01cients on average\nprevented the \ufb01lters from becoming too much or too little localized in space.\n\nx , \u03c4 2\n\n1Although less evocative, we prefer the term \u2018binary-gated Gaussian\u2019 to \u2018spike-and-slab\u2019 partly because our\nslab is really more of a hump, and partly because the spike refers to a feature seen only in the density of the\nproduct hixi, rather than in either the values or distributions of the component variables.\n\n3\n\n\f2.2 Binary-Gated Gaussian Recurrent Neural Network (bgG-RNN).\nTo obtain a dynamic hidden model for sequences of images {yt} we specify the following condi-\ntional probabilities between hidden chains of variables ht, xt:\n\nP(cid:0)xt+1, ht+1|xt, ht(cid:1) = P(cid:0)xt+1(cid:1) P(cid:0)ht+1|ht \u25e6 xt(cid:1)\nP(cid:0)ht+1|ht \u25e6 xt(cid:1) = \u03c3(cid:0)R \u00b7(cid:0)ht \u25e6 xt(cid:1) + b(cid:1) ,\n\nP(cid:0)xt+1(cid:1) = N(cid:0)0, \u03c4 2\nxI(cid:1)\n\n(2)\nwhere R is a matrix of recurrent connections, b is a vector of biases and \u03c3 is the standard sigmoid\nfunction \u03c3(a) = 1/(1 + exp(\u2212a)). Note how the xt are always drawn independently while the\nconditional probability for ht+1 depends only on ht \u25e6 xt. We arrived at these designs based on a few\nobservations. First, similar to inference in bgG-RNN, the conditional dependence on ht \u25e6 xt, allows\nus to integrate out variables xt, xt+1 for which the respective gates in ht, ht+1 are 0. Second, we\nobserved that adding Gaussian linear dependencies between xt+1 and xt \u25e6 ht did not modify qual-\n\nitatively the results reported here. However, dropping P(cid:0)ht+1|ht \u25e6 xt(cid:1) in favor of P(cid:0)xt+1|ht \u25e6 xt(cid:1)\n\nresulted in a model which could no longer learn a direction-selective representation. For simplic-\nity we chose the minimal model speci\ufb01ed by (2). The full log likelihood for the bgG-RNN is\nLbgG-RNN =\n\nLt\nbgG-RNN where\n\n(cid:88)\n\nt\n\nLt\nbgG-RNN = const \u2212 (cid:107)yt \u2212 W(xt \u25e6 ht)(cid:107)2/2\u03c4 2\n\nN(cid:88)\nN(cid:88)\n\nj=1\n\n+\n\n+\n\nht\n\ny \u2212 (cid:107)xt(cid:107)2/2\u03c4 2\n\nj log \u03c3(cid:0)R(cid:0)ht\u22121 \u25e6 xt\u22121(cid:1) + b(cid:1)\nj) log (1 \u2212 \u03c3(cid:0)R(cid:0)ht\u22121 \u25e6 xt\u22121(cid:1) + b(cid:1)\n\nx\n\nj\n\n(1 \u2212 ht\n\nj),\n\n(3)\n\nwhere x0 = 0 and h0 = 0 are both de\ufb01ned to be vectors of zeros.\n\nj=1\n\n2.3 Inference and learning of bgG-RNN.\n\nThe goal of inference is to set the values of \u02c6xt, \u02c6ht for all t in such a way as to minimize the objective\nset by (3). Assuming we have already set \u02c6xt, \u02c6ht for t = 1 to T , we propose to obtain \u02c6xT +1, \u02c6hT +1\nexclusively from \u02c6xT , \u02c6hT . This scheme might be called greedy \ufb01ltering. In greedy \ufb01ltering, inference\nis causal and Markov with respect to time. At step T + 1 we only need to solve a simple SC problem\ngiven by the slice LT +1\nbgG-RNN of the likelihood (3), where xT , hT have been replaced with the estimates\n\u02c6xT , \u02c6hT . The greedy \ufb01ltering algorithm proposed here scales linearly with the number of time steps\nconsidered and is well suited for online inference. The algorithm might not produce very accurate\nestimates of the global MAP settings of the hidden variables, but we found it was suf\ufb01cient for\nlearning a complex bgG-RNN model. In addition, its simplicity coupled with the fast MP algorithm\nin each Lt\nbgG-RNN slice, resulted in very fast inference and consequently fast learning. We usually\nlearned models in under one hour on a quad core workstation.\nDue to our approximate inference scheme, some parameters in the model had to be set manually.\nThese are \u03c4 2\ny , which control the relative strengths in the likelihood of three terms: the data\nlikelihood, the smallness prior on the Gaussian variables and the interaction between sets of xt, ht\nconsecutive in time. In our experiments we set \u03c4 2\ny . We\nfound that such large levels of expected observation noise were necessary to drive robust learning in\nR.\nFor learning we initialized parameters randomly to small values and \ufb01rst learned W exclusively.\nOnce the \ufb01lters converge, we turn on learning for R. W does not change very much beyond this\npoint. We found learning of R was sensitive to learning rate. We set the learning rate to 0.05 per\nbatch, used a momentum term of 0.75 and batches of 30 sets of 100 frame sequences. We stabilized\nthe mean activation probability of each neuron individually by actively and quickly tuning the biases\n\ny equal to the data variance and \u03c4 2\n\nx = 2\u03c4 2\n\nx and \u03c4 2\n\n4\n\n\f=(cid:0)ht\u22121\n\nk xt\u22121\n\nk\n\n\u2202Lt\n\nbgG-RNN\n\u2202Rjk\n\n(cid:1) \u00b7(cid:16)\n\nj \u2212 \u03c3(cid:0)R(cid:0)ht \u25e6 xt(cid:1) + b(cid:1)\n\nht\n\n(cid:17)\n\nj\n\nb during learning. We whitened images with a center-surround \ufb01lter and standardized the whitened\npixel values.\nGradients required for learning R show similarities to the STDP learning rule used in [3] and [4].\n\n.\n\n(4)\n\nWe will assume for neural interpretation that the positive and negative values of xt \u25e6 ht are encoded\nby different neurons. If for a given neuron xt\u22121\nis always positive, then the gradient (4) is only\nstrictly positive when ht\u22121\nj = 0. In\nother words, the connection Rjk is strengthened when neuron k appears to cause neuron j to activate\nand inhibited if neuron k fails to activate neuron j. A similar effect can be observed for the negative\npart of xt\u22121\n. This kind of Hebbian rule is widespread in cortex for long term learning and is used in\nprevious computational models of neural sequence learning that partly motivated our work [4].\n\nj = 1 and strictly negative when ht\u22121\n\nk = 1 and ht\n\nk = 1 and ht\n\nk\n\nk\n\n3 Results\n\nFor data, we selected about 100 short 100 frame long clips from a high resolution BBC wild life\ndocumentary. Clips were chosen only if they seemed on visual inspection to have suf\ufb01cient motion\nenergy over the 100 frames. The clips chosen ended up being mostly panning shots and close-ups\nof animals in their natural habitats (the closer the camera is to a moving object, the faster it appears\nto move).\nThe results presented below measure the ability of the model to produce responses similar to those of\nneurons recorded in primate experiments. The stimuli used in these experiments are typically of two\nkinds: drifting gratings presented inside circular or square apertures or translating bars of various\nlengths. These two kinds of stimuli produce very clear motion signals, unlike motion produced\nby natural movies.\nIn fact, most patches we used in training contained a wide range of spatial\norientations, most of which were not orthogonal to the direction of local translation. After comparing\nmodel responses to neural data, we \ufb01nish with an analysis of the network connectivity pattern that\nunderlies the responses of model neurons.\n\n3.1 Measuring responses in the model.\n\nWe needed to deal with the potential negativity of the variables in the model, since neural responses\nare always positive quantities. We decided to separate the positive and negative parts of the Gaussian\nvariables into two distinct sets of responses. This interpretation is relatively common for sparse\ncoding models and we also found that in many units direction selectivity was enhanced when the\npositive and negative parts of xt were separated (as opposed to taking ht as the neural response).\nThe enhancement was supported by a particular pattern of network connectivity which we describe\nin a later subsection.\nSince our inference procedure is deterministic it will produce the exact same response to the same\nstimulus every time. We added Gaussian noise to the spatially whitened test image sequences, partly\nto capture the noisy environments in cortex and partly to show robustness of direction selectivity to\nnoise. The amount of noise added was about half the expected variance of the stimulus.\n\n3.2 Direction selectivity and speed tuning.\nDirection selectivity is measured with the following index: DI = 1 \u2212 Ropp/Rmax. Here Rmax\nrepresents the response of a neuron in its preferred direction, while Ropp is the response in the\ndirection opposite to that preferred. This selectivity index is commonly used to characterize neural\ndata. To de\ufb01ne a neuron\u2019s preferred direction, we inferred latent coef\ufb01cients over many repetitions\nof square gratings drifting in 24 directions, at speeds ranging from 0 to 3 pixels/frame in 0.25\nsteps. The periodicity of the stimulus was twice the patch size, so that motion locally appeared as\nan advancing long edge. The neuron\u2019s preferred direction was de\ufb01ned as the direction in which it\nresponded most strongly, averaged over all speeds. Once a preferred direction was established, we\nde\ufb01ned the neuron\u2019s preferred speed, as the speed at which it responded most strongly in its preferred\n\n5\n\n\fdirection. Finally, at this preferred speed and direction, we calculated the DI of the neuron. Similar\nresults were obtained if we averaged over all speed conditions.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: a. Speed tuning of 16 randomly chosen neurons. Note that some neurons only respond\nweakly without motion, some are inhibited in the non-preferred direction compared to static re-\nsponses and most have a clear peak in the preferred direction at speci\ufb01c speeds. b. top: Histogram\nof direction selectivity indices. bottom: Histogram of preferred speeds. c. For each of the 10\nstrongest excitatory connections per neuron we plot a dot indicating the orientation selectivity of\npre and post-synaptic units. Note that most of the points are within \u03c0/4 from the diagonal, an area\nmarked by the black lines. Notice also the relatively increased frequency of horizontal and vertical\nedges.\nWe found that most neurons in the model had sharp tuning curves and direction-selective responses.\nWe cross validated the value of the direction index with a new set of responses (\ufb01xing the preferred\ndirection) to obtain an averaged DI of 0.65, with many neurons having a DI close to 1 (see \ufb01gure\n2(a)). This distribution is similar to those shown in [21] for real V1 neurons. 714 of 1024 neurons\nwere classi\ufb01ed as direction-selective, on the basis of having DI > 0.5. Distributions of direction\nindices and optimal speeds are shown in \ufb01gure 2(a). A neuron\u2019s preferred direction was always\nclose to orthogonal to the axis of its Gabor receptive \ufb01eld, except for a few degenerate cases around\nthe edges of the patch. We de\ufb01ned the population tuning curve as the average of the tuning curves\nof individual neurons, each aligned by their preferred direction of motion. The DI of the population\nwas 0.66. Neurons were also speed tuned, in that responses could vary greatly and systematically as\na function of speed and DI was non-constant as a function of speed (see \ufb01gure 2(b)). Usually at low\nand high speeds the DI was 0, but in between a variety of responses were observed. Speed tuning is\nalso present in recorded V1 neurons [22], and could form the basis for global motion computation\nbased on the intersection of constraints method [23].\n\n3.3 Vector velocity tuning.\n\nTo get a more detailed description of single-neuron tuning, we investigated responses to different\nstimulus velocities. Since drifting gratings only contain motion orthogonal to their orientation, we\nswitched to small (1.25pix x 2pix) drifting Gabors for these experiments. We tested the network\u2019s\nbehavior with a full set of 24 Gabor orientations, drifting in a full set of 24 directions with speeds\nranging from 0.25 pixels/frame to 3 pixels/frame, for a total of 6912 = 24 x 24 x 12 conditions\nwith hundreds of repetitions of each condition. For each neuron we isolated its responses to drifting\nGabors of the same orientation travelling at the 12 different speeds in the 24 different directions.\nWe present these for several neurons in polar plots in \ufb01gure 3(b). Note responses tend to be high to\nvector velocities lying on a particular line.\n\n3.4 Connectomics in silico\n\nWe had anticipated that the network would learn direction selectivity via speci\ufb01c patterns of recur-\nrent connection, in a fashion similar to the toy model studied in [4]. We now show that the pattern\nof connectivity indeed supports this computation.\n\n6\n\n0123050100150NumberofunitsPreferredspeed(pix/frame)DirectionIndex00.510100200Speed(pix/frame)Response(a.u.)preferrednon-preferred03\fThe most obvious connectivity pattern, clearly visible for single neurons in \ufb01gure 3(a), shows that\nneurons in the model excite other neurons in their preferred direction and inhibit neurons in the\nopposite direction. This asymmetric wiring naturally supports direction selectivity.\nAsymmetry is not suf\ufb01cient for direction selectivity to emerge. In addition, strong excitatory pro-\njections have to connect together neurons with similar preferred orientations and similar preferred\ndirections. Only then will direction information propagate in the network in the identities of the ac-\ntive variables (and the signs of their respective coef\ufb01cients xt). We considered for each neuron its 10\nstrongest excitatory outputs and calculated the expected deviation between the orientation of these\noutputs and the orientation of the root neuron. The average deviation was 23\u25e6, half the expected de-\nviation if connections were random. Figure 2(c) shows a raster plot of the pairs of orientations. The\nsame pattern held when we considered the strongest excitatory inputs to a given neuron with an ex-\npected deviation of orientations of 24\u25e6. We could not directly measure if neurons connected together\naccording to direction selectivity because of the sign ambiguity of xt variables. One can visually\nassess in \ufb01gure 3(a) that neurons connected asymmetrically with respect to their RF axis, but did\nthey also respond to motion primarily in that direction? As can be seen in \ufb01gure 3(b), which shows\nthe same neurons as \ufb01gure 3(a), they did indeed. Direction tuning is a measure of the incoming\nconnections to the neuron, while \ufb01gure 3(a) shows the outgoing connections. We can qualitatively\nassess recurrence primarily connected together neurons with similar direction preferences.\n\n(a)\n\n(b)\n\nFigure 3: a. Each plot is based on the outgoing connections of a random set of direction-selective\nneurons. The centers of the Gabor \ufb01ts to the neurons\u2019 receptive \ufb01elds are shown as circles on a square\nrepresenting the 16 by 16 image patch. The root neurons are shown as \ufb01lled black circles. Filled\nred/blue circles show neurons to which the root neurons have strong positive/negative connections,\nwith a cutoff at one fourth of the maximal absolute connection. The width of the connecting lines\nand the area of the \ufb01lled circles are proportional to the strength of the connection. A dynamic version\nof this plot during learning is shown as a movie in the supplementary material. b. The polar plots\nshow the responses of neurons presented in a to small, drifting Gabors that match their respective\norientations. Neurons are aligned in exactly the same manner on the 4 by 4 grid. Every small\ndisc in every polar plot represents one combination of speed and direction and the color of the disc\nrepresents the magnitude of the response, with intense red being maximal and dark blue minimal.\nThe vector from the center of the polar plot to the center of each disc is proportional to the vector\ndisplacement of each consecutive frame in the stimulus sequence. Increasing disc sizes at faster\nspeeds are used for display purposes. The very last polar plot shows the average of the responses of\nthe entire population, when all neurons are aligned by their preferred direction.\nWe also observed that neurons mostly projected strong excitatory outputs to other neurons that were\naligned parallel to the root neuron\u2019s main axis (visible in \ufb01gures 3(a)). We think this is related to the\nfact that locally all edges appear to translate parallel to themselves. A neuron X with a preferred\ndirection v and preferred speed s has a so-called constraint line (CL), parallel to the Gabor\u2019s axis.\n\n7\n\n\fWhen the neuron is activated by an edge E, the constraint line is formed by all possible future\nlocations of edge E that are consistent with global motion in the direction v with speed s. Due to\nthe presence of long contours in natural scenes, the activation of X can predict at the next time step\nthe activations of other neurons with RFs aligned on the CL. Our likelihood function encourages\nthe model to learn to make such predictions as well as it can. To quantify the degree to which\nconnections were made along a CL, for each neuron we \ufb01t a 2D Gaussian to the distribution of\nRF positions of the 20 most strongly connected neurons (the \ufb01lled red circles in \ufb01gure 3(a)), each\nfurther weighted by its strength. The major axis of the Gaussians represent the constraint lines of\nthe root neuron and are in 862 out of 1024 neurons less than 15\u25e6 away from perfectly parallel to the\nroot neurons\u2019 axis. The distance of each neuron to their constraint line was on average 1.68 pixels.\nYet perhaps the strongest manifestation of the CL tuning property of neurons in the model can be\nseen in their responses to small stimuli drifting with different vector velocities. Many of the neurons\nin \ufb01gure 3(b) respond best when the velocity vector ends on the constraint line and a similar trend\nholds for the aligned population average.\nIt is already known from experiments of axon mappings simultaneous with dye-sensitive imaging\nthat neurons in V1 are more likely to connect with neurons of similar orientations situated as far away\nas 4 mm / 4-8 minicolumns away [24]. The model presented here makes three further predictions:\nthat neurons connect more strongly to neurons in their preferred direction, that connected neurons\nlie on the constraint line and that they have similar preferred directions to the root neuron.\n\n4 Discussion\n\nWe have shown that a network of recurrently-connected neurons can learn to discriminate motion\ndirection at the level of individual neurons. Online greedy \ufb01ltering in the model is a suf\ufb01cient\napproximate-inference method to produce direction-selective responses. Fast, causal and online\ninference is a necessary requirement for practical vision systems (such as the brain) but previous\nvisual-motion models did not provide such an implementation of their inference algorithms. Another\nshortcoming of these previous models is that they obtain direction selectivity by having variables\nwith different RFs at different time lags, effectively treating time as a third spatial dimension. A\ndynamic generative model may be more suited for online inference with methods such as particle\n\ufb01ltering, assumed density \ufb01ltering, or the far cheaper method employed here of greedy \ufb01ltering.\nThe model neurons can be interpreted as predicting the motion of the stimulus. The lateral inputs\nthey receive are however not suf\ufb01cient in themselves to produce a response, the prediction also has to\nbe consistent with the bottom-up input. When the two sources of information disagree, the network\ncompromises but not always in favor of the bottom-up input, as this source of information might be\nnoisy. This is re\ufb02ected by the decrease in reconstruction accuracy from 80% to 60 % after learning\nthe recurrent connections. It is tempting to think of V1 direction selective neurons as not only edge\ndetectors and contour predictors (through the nonclassical RF) but also predictors of future edge\nlocations, through their speci\ufb01c patterns of connectivity.\nThe source of direction selectivity in cortex is still an unresolved question, but note that in the\nretina of non-primate mammals it is known with some certainty that recurrent inhibition in the\nnon preferred direction is largely responsible for the direction selectivity of retinal ganglion cells\n[25]. It is also known that unlike orientation and ocular dominance, direction selectivity requires\nvisual experience to develop [26], perhaps because direction selectivity depends on a speci\ufb01c pattern\nof lateral connectivity unlike the largely feedforward orientation and binocular tuning. Another\nexperiment showed that after many exposures to the same moving stimulus, the sequence of spikes\ntriggered in different neurons along the motion trajectory was also triggered in the complete absence\nof motion, again indicating that motion signals in cortex may be generated internally from lateral\nconnections [27].\nThus, we see a number of reasons to propose that direction selectivity in the cortex may indeed\ndevelop and be computed through a mechanism analagous to the one we have developed here. If so,\nthen experimental tests of the various predictions developed above should prove to be revealing.\n\n8\n\n\fReferences\n[1] A Mikami, WT Newsome, and RH Wurtz. Motion selectivity in macaque visual cortex. II. Spatiotemporal\n\nrange of directional interactions in MT and V1. Journal of Neurophysiology, 55(6):1328\u20131339, 1986.\n\n[2] MS Livingstone. Mechanisms of direction selectivity in macaque V1. Neuron, 20:509\u2013526, 1998.\n[3] LF Abbott and KI Blum. Functional signi\ufb01cance of long-term potentiation for sequence learning and\n\nprediction. Cerebral Cortex, 6:406\u2013416, 1996.\n\n[4] RPN Rao and TJ Sejnowski. Predictive sequence learning in recurrent neocortical circuits. Advances in\n\nNeural Information Processing, 12:164\u2013170, 2000.\n\n[5] B Olshausen. Learning sparse, overcomplete representations of time-varying natural images. IEEE Inter-\n\nnational Conference on Image Processing, 2003.\n\n[6] P Berkes, RE Turner, and M Sahani. A structured model of video produces primary visual cortical\n\norganisation. PLoS Computational Biology, 5, 2009.\n\n[7] L Wiskott and TJ Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Com-\n\nputation, 14(4):715\u2013770, 2002.\n\n[8] C Cadieu and B Olshausen. Learning transformational invariants from natural movies. Advances in\n\nNeural Information Processing, 21:209\u2013216, 2009.\n\n[9] D Barber. Learning in spiking neural assemblies. Advances in Neural Information Processing, 15, 2002.\n[10] J Brea, W Senn, and JP P\ufb01ster. Sequence learning with hidden units in spiking neural networks. Advances\n\nin Neural Information Processing, 24, 2011.\n\n[11] RP Rao and Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some\n\nextra-classical receptive-\ufb01eld effects. Nature Neuroscience, 2(1):79\u201387, 1999.\n\n[12] K Friston. A theory of cortical responses. Phil. Trans. R. Soc. B, 360(1456):815\u2013836, 1999.\n[13] RJ Douglas, C Koch, M Mahowald, KA Martin, and HH Suarez. Recurrent excitation in neocortical\n\ncircuits. Science, 269(5226):981\u2013985, 1995.\n\n[14] TP Minka. Expectation propagation for approximate Bayesian inference. UAI\u201901 Proceedings of the\n\nSeventeenth conference on Uncertainty in arti\ufb01cial intelligence, pages 362\u2013369, 2001.\n\n[15] A Doucet, N Freitas, K Murphy, and S Russell. Rao-blackwellised particle \ufb01ltering for dynamic Bayesian\nnetworks. UAI\u201900 Proceedings of the Sixteenth conference on Uncertainty in arti\ufb01cial intelligence, pages\n176\u2013183, 2000.\n\n[16] M Rehn and FT Sommer. A network that uses few active neurones to code visual input predicts the diverse\n\nshapes of cortical receptive \ufb01elds. Journal of Computational Neuroscience, 22:135\u2013146, 2007.\n\n[17] IJ Goodfellow, A Courville, and Y Bengio. Spike-and-slab sparse coding for unsupervised feature dis-\n\ncovery. arXiv:1201.3382v2, 2012.\n\n[18] BA Olshausen and KJ Millman. Learning sparse codes with a mixture-of-Gaussians prior. Advances in\n\nNeural Information Processing, 12, 2000.\n\n[19] SG Mallat and Z Zhang. Matching pursuits with time-frequency dictionaries.\n\nSignal Processing, 41(12):3397\u20133415, 1993.\n\nIEEE Transactions on\n\n[20] BA Olshausen and DJ Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code\n\nfor natural images. Nature, 381:607\u2013609, 1996.\n\n[21] MR Peterson, B Li, and RD Freeman. The derivation of direction selectivity in the striate cortex. Journal\n\nof Neuroscience, 24:35833591, 2004.\n\n[22] GA Orban, H Kennedy, and J Bullier. Velocity sensitivity and direction selectivity of neurons in areas V1\n\nand V2 of the monkey: in\ufb02uence of eccentricity. Journal of Neurophysiology, 56(2):462\u2013480, 1986.\n\n[23] EP Simoncelli and DJ Heeger. A model of neuronal responses in visual area MT. Vision Research,\n\n38(5):743\u2013761, 1998.\n\n[24] WH Bosking, Y Zhang, B Scho\ufb01eld, and D FitzPatrick. Orientation selectivity and the arrangement of\n\nhorizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17(6):2112\u20132127, 1997.\n\n[25] SI Fried, TA Munch, and FS Werblin. Directional selectivity is formed at multiple levels by laterally\n\noffset inhibition in the rabbit retina. Neuron, 46(1):117\u2013127, 2005.\n\n[26] Y Li, D FitzPatrick, and LE White. The development of direction selectivity in ferret visual cortex requires\n\nearly visual experience. Nature Neuroscience, 9(5):676\u2013681, 2006.\n\n[27] S Xu, W Jiang, M Poo, and Y Dan. Activity recall in a visual cortical ensemble. Nature Neuroscience,\n\n15:449\u2013455, 2012.\n\n9\n\n\f", "award": [], "sourceid": 639, "authors": [{"given_name": "Marius", "family_name": "Pachitariu", "institution": null}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": null}]}