{"title": "Code-specific policy gradient rules for spiking neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 1741, "page_last": 1749, "abstract": "Although it is widely believed that reinforcement learning is a suitable tool for describing behavioral learning, the mechanisms by which it can be implemented in networks of spiking neurons are not fully understood. Here, we show that different learning rules emerge from a policy gradient approach depending on which features of the spike trains are assumed to influence the reward signals, i.e., depending on which neural code is in effect. We use the framework of Williams (1992) to derive learning rules for arbitrary neural codes. For illustration, we present policy-gradient rules for  three different example codes - a spike count code, a spike timing code and the most general ``full spike train code - and test them on simple model problems. In addition to classical synaptic learning, we derive learning rules for intrinsic parameters that control the excitability of the neuron. The spike count learning rule has structural similarities with established Bienenstock-Cooper-Munro rules. If the distribution of the relevant spike train features  belongs to the natural exponential family, the learning rules have a characteristic shape that raises interesting prediction problems.", "full_text": "Code-speci\ufb01c policy gradient rules for\n\nspiking neurons\n\nHenning Sprekeler\u2217 Guillaume Hennequin Wulfram Gerstner\n\nLaboratory for Computational Neuroscience\n\u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne\n\n1015 Lausanne\n\nAbstract\n\nAlthough it is widely believed that reinforcement learning is a suitable tool for\ndescribing behavioral learning, the mechanisms by which it can be implemented\nin networks of spiking neurons are not fully understood. Here, we show that dif-\nferent learning rules emerge from a policy gradient approach depending on which\nfeatures of the spike trains are assumed to in\ufb02uence the reward signals, i.e., de-\npending on which neural code is in effect. We use the framework of Williams\n(1992) to derive learning rules for arbitrary neural codes. For illustration, we\npresent policy-gradient rules for three different example codes - a spike count\ncode, a spike timing code and the most general \u201cfull spike train\u201d code - and test\nthem on simple model problems. In addition to classical synaptic learning, we\nderive learning rules for intrinsic parameters that control the excitability of the\nneuron. The spike count learning rule has structural similarities with established\nBienenstock-Cooper-Munro rules. If the distribution of the relevant spike train\nfeatures belongs to the natural exponential family, the learning rules have a char-\nacteristic shape that raises interesting prediction problems.\n\n1 Introduction\n\nNeural implementations of reinforcement learning have to solve two basic credit assignment prob-\nlems: (a) the temporal credit assignment problem, i.e., the question which of the actions that were\ntaken in the past were crucial to receiving a reward later and (b) the spatial credit assignment prob-\nlem, i.e., the question, which neurons in a population were important for getting the reward and\nwhich ones were not.\nHere, we argue that an additional credit assignment problem arises in implementations of reinforce-\nment learning with spiking neurons. Presume that we know that the spike pattern of one speci\ufb01c\nneuron within one speci\ufb01c time interval was crucial for getting the reward (that is, we have already\nsolved the \ufb01rst two credit assignment problems). Then, there is still one question that remains:\nWhich feature of the spike pattern was important for the reward? Would any spike train with the\nsame number of spikes yield the same reward or do we need precisely timed spikes to get it? This\ncredit assignment problem is in essence the question which neural code the output neuron is (or\nshould be) using. It becomes particularly important, if we want to change neuronal parameters like\nsynaptic weights in order to maximize the likelihood of getting the reward again in the future. If\nonly the spike count is relevant, it might not be very effective to spend a lot of time and energy on\nthe dif\ufb01cult task of learning precisely timed spikes.\nThe most modest and probably most versatile way of solving this problem is not to make any as-\nsumption on the neural code but to assume that all features of the spike train were important. In\n\n\u2217E-Mail: henning.sprekeler@epfl.ch\n\n1\n\n\fthis case, neuronal parameters are changed such that the likelihood of repeating exactly the same\nspike train for the same synaptic input is maximized. This approach leads to a learning rule that\nwas derived in a number of recent publications [3, 5, 13]. Here, we show that a whole class of\nlearning rules emerges when prior knowledge about the neural code at hand is available. Using a\npolicy-gradient framework, we derive learning rules for neural parameters like synaptic weights or\nthreshold parameters that maximize the expected reward.\nOur aims are to (a) develop a systematic framework that allows to derive learning rules for arbitrary\nneural parameters for different neural codes, (b) provide an intuitive understanding how the resulting\nlearning rules work, (c) derive and test learning rules for speci\ufb01c example codes and (d) to provide\na theoretical basis why code-speci\ufb01c learning rules should be superior to general-purpose rules.\nFinally, we argue that the learning rules contain two types of prediction problems, one related to\nreward prediction, the other to response prediction.\n\n2 General framework\n\n2.1 Coding features and the policy-gradient approach\n\nThe basic setup is the following: let there be a set of different input spike trains X \u00b5 to a single post-\nsynaptic neuron, which in response generates stochastic output spike trains Y \u00b5. In the language of\npartially observable Markov decision processes, the input spike trains are observations that provide\ninformation about the state of the animal and the output spike trains are controls that in\ufb02uence the\naction choice. Depending on both of these spike trains, the system receives a reward. The goal is to\nadjust a set of parameters \u03b8i of the postsynaptic neuron such that it maximizes the expectation value\nof the reward.\nOur central assumption is that the reward R does not depend on the full output spike train, but only\non a set of coding features Fj(Y ) of the output spike train: R = R(F, X). Which coding features F\nthe reward depends on is in fact a choice of a neural code, because all other features of the spike\ntrain are not behaviorally relevant. Note that there is a conceptual difference to the notion of a neural\ncode in sensory processing, where the coding features convey information about input signals, not\nabout the output signal or rewards.\n\nThe expectation value of the reward is given by (cid:104)R(cid:105) = (cid:80)\n\nF,X R(F, X)P (F|X, \u03b8)P (X), where\nP (X) denotes the probability of the presynaptic spike trains and P (F|X, \u03b8) the conditional proba-\nbility of generating the coding feature F given the input spike train X and the neuronal parameters\n\u03b8. Note that the only component that explicitly depends on the neural parameters \u03b8i is the condi-\ntional probability P (F|X, \u03b8). The reward is conditionally independent of the neural parameters \u03b8i\ngiven the coding feature F. Therefore, if we want to optimize the expected reward by employing a\ngradient ascent method, we get a learning rule of the form\n\n(cid:88)\n(cid:88)\n\nF,X\n\n\u2202t\u03b8i = \u03b7\n\n= \u03b7\n\nR(F, X)P (X)\u2202\u03b8iP (F|X, \u03b8)\n\nP (X)P (F|X, \u03b8)R(F, X)\u2202\u03b8i ln P (F|X, \u03b8) .\n\n(1)\n\n(2)\n\nF,X\n\nIf we choose a small learning rate \u03b7, the average over presynaptic patterns X and coding features\nF can be replaced by a time average. A corresponding online learning rule therefore results from\ndropping the average over X and F :\n\n\u2202t\u03b8i = \u03b7R(F, X)\u2202\u03b8i ln P (F|X, \u03b8) .\n\n(3)\n\nThis general form of learning rule is well known in policy-gradient approaches to reinforcement\nlearning [1, 12].\n\n2.2 Learning rules for exponentially distributed coding features\n\ndistributions P (F|X) = (cid:81)\n\nThe joint distribution of the coding features Fj can always be factorized into a set of conditional\ni P (Fi|X; F1, ..., Fi\u22121). We now make the assumption that the con-\nditional distributions belong to the natural exponential family (NEF): P (Fi|X; F1, ..., Fi\u22121, \u03b8) =\n\n2\n\n\fh(Fi) exp(CiFi \u2212 A(Ci)), where the Ci are parameters that depend on the input spike train X, the\ncoding features F1, ..., Fi\u22121 and the neural parameters \u03b8i. h(Fi) is a function of Fi and Ai(Ci) is\nfunction that is characteristic for the distribution and depends only on the parameters Ci. Note that\nthe NEF is a relatively rich class of distributions, which includes many canonical distributions like\nthe Poisson, Bernoulli and the Gaussian distribution (the latter with \ufb01xed variance).\nUnder these assumptions, the learning rule (3) takes a characteristic shape:\n\n\u2202t\u03b8i = \u03b7R(F, X)(cid:88)\n\nFj \u2212 \u00b5j\n\n\u03c32\nj\n\n\u2202\u03b8i\u00b5j ,\n\n(4)\n\nj\n\nand \u03c32\ni\n\nare\n\nthe\n\nthe mean and the variance of\n\nwhere \u00b5i\nconditional distribution\nP (Fi|X, F1, ..., Fi\u22121, \u03b8) and therefore also depend on the input X, the coding features F1, ..., Fi\u22121\nand the parameters \u03b8. Note that correlations between the coding features are implicitly accounted\nfor by the dependence of \u00b5i and \u03c3i on the other features. The summation over different coding\nfeatures arises from the factorization of the distribution, while the speci\ufb01c shape of the summands\nrelies on the assumption of normal exponential distributions [for a proof, cf. 12].\nThere is a simple intuition why the learning rule (4) performs gradient ascent on the mean reward.\nThe term Fj \u2212 \u00b5j \ufb02uctuates around zero on a trial-to-trial basis. If these \ufb02uctuations are positively\ncorrelated with the trial \ufb02uctuations of the reward R, i.e., (cid:104)R(Fj \u2212 \u00b5j)(cid:105) > 0, higher values of Fj\nlead to higher reward, so that the mean of the coding feature should be increased. This increase is\nimplemented by the term \u2202\u03b8i\u00b5j, which changes the neural parameter \u03b8i such that \u00b5j increases.\n\n3 Examples for Coding Features\n\nIn this section, we illustrate the framework by deriving policy-gradient rules for different neural\ncodes and show that they can solve simple computational tasks.\nThe neuron type we are using is a simple Poisson-type neuron model where the postsynaptic \ufb01ring\nrate is given by a nonlinear function \u03c1(u) of the membrane potential u. The membrane potential u,\nin turn, is given by the sum of the EPSPs that are evoked by the presynaptic spikes, weighted with\nthe respective synaptic weights:\n\nu(t) =(cid:88)\n\ni ) , =:(cid:88)\n\nwi\u0001(t \u2212 tf\n\nwiPSPi(t) ,\n\n(5)\n\ni,f\n\ni\n\ni denote the time of the f-th spike in the i-th presynaptic neuron. \u0001(t \u2212 tf\n\nwhere tf\ni ) denotes the\nshape of the postsynaptic potential evoked by a single presynaptic spike at time tf\ni . For future use,\nwe have introduced PSPi as the postsynaptic potential that would be evoked by the i-th presynaptic\nspike train alone, if the synaptic weight were unity.\nThe parameters that one could optimize in this neuron model are (a) the synaptic weights and (b) pa-\nrameters in the dependence of the \ufb01ring rate \u03c1 on the membrane potential. The \ufb01rst case is the\nstandard case of synaptic plasticity, the second corresponds to a reward-driven version of intrinsic\nplasticity [cf. 10].\n\n3.1 Spike Count Codes: Synaptic plasticity\n\nLet us \ufb01rst assume that the coding feature is the number N of spikes within a given time win-\ndow [0, T ] and that the reward is delivered at the end of this period. The probability distribution for\nthe spike count is a Poisson distribution P (N) = \u00b5N exp(\u2212\u00b5)/N! with a mean \u00b5 that is given by\nthe integral of the \ufb01ring rate \u03c1 over the interval [0, T ]:\n\n\u00b5 =\n\n\u03c1(t(cid:48)) dt(cid:48) .\n\n(6)\n\nThe dependence of the distribution P (N) on the presynaptic spike trains X and the synaptic\nweights wi is hidden in the mean spike count \u00b5, which naturally depends on those factors through\nthe postsynaptic \ufb01ring rate \u03c1.\n\n0\n\n3\n\n(cid:90) T\n\n\fBecause the Poisson distribution belongs to the NEF, we can derive a synaptic learning rule by using\nequation (4) and calculating the particular form of the term \u2202wi\u00b5:\n\n\u2202twi = \u03b7R\n\n[\u2202u\u03c1](t(cid:48))PSPi(t(cid:48)) dt(cid:48) .\n\n(7)\n\n(cid:90) T\n\nN \u2212 \u00b5\n\n\u00b5\n\n0\n\nThis learning rule has structural similarities with the Bienenstock-Cooper-Munro (BCM) rule [2]:\nThe integral term has the structure of an eligibility trace that is driven by a simple Hebbian learning\nrule. In addition, learning is modulated by a factor that compares the current spike count (\u201crate\u201d)\nwith the expected spike count (\u201csliding threshold\u201d in BCM theory). Interestingly, the functional role\nof this factor is very different from the one in the original BCM rule: It is not meant to introduce\nselectivity [2], but rather to exploit trial \ufb02uctuations around the mean spike count to explore the\nstructure of the reward landscape.\nWe test the learning rule on a 2-armed bandit task (Figure 1A). An agent has the choice between\ntwo actions. Depending on which of two states the agent is in, action a1 or action a2 is rewarded\n(R = 1), while the other action is punished (R = \u22121). The state information is encoded in the rate\npattern of 100 presynaptic neurons. For each state, a different input pattern is generated by drawing\nthe \ufb01ring rate of each input neuron independently from an exponential distribution with a mean of\n10Hz. In each trial, the input spike trains are generated anew from Poisson processes with these\nneuron- and state-speci\ufb01c rates. The agent chooses its action stochastically with probabilities that\nare proportional to the spike counts of two output neurons: p(ak|s) = Nk/(N1 + N2). Because\nthe spike counts depend on the state via the presynaptic \ufb01ring rates, the agent can choose different\nactions for different states. Figure 1B and C show that the learning rule learns the task by suppressing\nactivity in the neuron that encodes the punished action.\nIn all simulations throughout the paper, the postsynaptic neurons have an exponential rate function\ng(u) = exp (\u03b3(u \u2212 u0)), where the threshold is u0 = 1. The sharpness parameter \u03b3 is set to either\n\u03b3 = 1 (for the 2-armed bandit task) or \u03b3 = 3 (for the spike latency task). Moreover, the postsynaptic\nneurons have a membrane potential reset after each spike (i.e., relative refractoriness), so that the\nassumption of a Poisson distribution for the spike counts is not necessarily ful\ufb01lled. It is worth\nnoting that this did not have an impeding effect on learning performance.\n\n3.2 Spike Count Codes: Intrinsic plasticity\nLet us now assume that the rate of the neuron is given by a function \u03c1(u) = g (\u03b3(u \u2212 u0)) which\ndepends on the threshold parameters u0 and \u03b3. Typical choices for the function g would be an\nexponential (as used in the simulations), a sigmoid or a threshold linear function g(x) = ln(1 +\nexp(x)).\nBy intrinsic plasticity we mean that the parameters u0 and \u03b3 are learned instead of or in addition\nto the synaptic weights. The learning rules for these parameters are essentially the same as for the\nsynaptic weights, only that the derivative of the mean spike count is taken with respect to u0 and \u03b3,\nrespectively:\n\n\u2202tu0 = \u03b7\n\n\u2202t\u03b3 = \u03b7\n\nN \u2212 \u00b5\n\n\u00b5\n\nN \u2212 \u00b5\n\n\u00b5\n\n\u2202u0\u00b5 = \u2212\u03b7\n\nN \u2212 \u00b5\n\n\u00b5\nN \u2212 \u00b5\n\n\u00b5\n\n0\n\n\u2202\u03b3\u00b5 = \u03b7\n\n(cid:90) T\n(cid:90) T\n\n0\n\n\u03b3g(cid:48)(\u03b3(u(t) \u2212 u0)) dt\n\ng(cid:48)(\u03b3(u(t) \u2212 u0))(u(t) \u2212 u0) dt .\n\n(8)\n\n(9)\n\nHere, g(cid:48) = \u2202xg(x) denotes the derivative of the rate function g with respect to its argument.\n\n3.3 First Spike-Latency Code: Synaptic plasticity\n\nAs a second coding scheme, let us assume that the reward depends only on the latency \u02c6t of the \ufb01rst\nspike after stimulus onset. More precisely, we assume that each trial starts with the onset of the\npresynaptic spike trains X and that a reward is delivered at the time of the \ufb01rst spike. The reward\ndepends on the latency of that spike, so that certain latencies are favored.\n\n4\n\n\fFigure 1: Simulations for code-speci\ufb01c learning rules. A 2-armed bandit task: The agent has to choose among\ntwo actions a1 and a2. Depending on the state (s1 or s2), a different action is rewarded (thick arrows). The\ninput states are modelled by different \ufb01ring rate patterns of the input neurons. The probability of choosing the\nactions is proportional to the spike counts of two output neurons: p(ak(cid:124) s) = Nk(cid:47) (N1 + N2). B Learning\ncurves of the 2-armed bandit. Blue: Spike count learning rule (7), Red: Full spike train rule (16). C Evolution\nof the spike count in response to the two input states during learning. Both rewards (panel B) and spike counts\n(panel C) are low-pass \ufb01ltered with a time constant of 4000 trials. D Learning of \ufb01rst spike latencies with\nthe latency rule (11). Two different output neurons are to learn to \ufb01re their \ufb01rst spike at given target latencies\nL1(cid:47) 2. We present one of two \ufb01xed input spike train patterns (\u201cstimuli\u201d) to the neurons in randomly interleaved\ntrials. The input spike train for each input neuron is drawn separately for each stimulus by sampling once from\na Poisson process with a rate of 10Hz. Reward is given by the negative squared difference between the target\nlatency (stimulus 1: L1 = 10ms, L2 = 30ms, stimulus 2: L1 = 30ms, L2 = 10ms) and the actual latency\nof the trial, summed over the two neurons. The colored curves show that the \ufb01rst spike latencies of neurons 1\n(green, red) and neuron 2 (purple, blue) converge to the target latencies. The black curve (scale on the right\naxis) shows the evolution of the reward during learning.\n\nThe probability distribution of the spike latency is given by the product of the \ufb01ring probability at\ntime (cid:136)t and the probability that the neuron did not \ufb01re earlier:\n\n(cid:29)\n\n(cid:31)\n\n(cid:31)\n\n(cid:30)(cid:136)t\n(cid:30)(cid:136)t\n\n0\n\n(cid:31)\n\n0\n\nP ((cid:136)t) = (cid:31)((cid:136)t) exp\n\n(cid:31)(t(cid:31)) dt(cid:31)\n\n(cid:46)\n\nUsing eq. (3) for this particular distribution, we get the synaptic learning rule:\n\n(cid:30)twi = (cid:29)R\n\n[(cid:30)u(cid:31)]((cid:136)t)PSPi((cid:136)t)\n\n(cid:31)((cid:136)t)\n\n[(cid:30)u(cid:31)](t(cid:31))PSPi(t(cid:31)) dt(cid:31)\n\n(cid:31)\n\n(cid:29)\n\n(10)\n\n(cid:46)\n\n(11)\n\nIn Figure 1D, we show that this learning rule can learn to adjust the weights of two neurons such\nthat their \ufb01rst spike latencies approximate a set of target latencies.\n\n3.4 The Full Spike Train Code: Synaptic plasticity\n\nFinally, let us consider the most general coding feature, namely, the full spike train. Let us start with\na time-discretized version of the spike train with a discretization that is suf\ufb01ciently narrow to allow at\nmost one spike per time bin. In each time bin [t(cid:44) t+(cid:31)t], the number of spikes Yt follows a Bernoulli\ndistribution with spiking probability pt, which depends on the input and on the recent history of the\nneuron. Because the Bernoulli distribution belongs to the NEF, the associated policy-gradient rule\ncan be derived using equation (4):\n\n(cid:30)twi = (cid:29)R\n\nYt (cid:31) pt\npt(1 (cid:31) pt) (cid:30)wi pt (cid:46)\n\n(12)\n\n(cid:28)\n\nt\n\n5\n\n\fThe \ufb01ring probability pt depends on the instantaneous \ufb01ring rate \u03c1t: pt = 1\u2212exp(\u2212\u03c1t\u2206t), yielding:\n\n\u2202twi = \u03b7R\n\n= \u03b7R\n\n(cid:88)\n(cid:88)\n\nt\n\nt\n\nYt \u2212 pt\npt(1 \u2212 pt)\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n[\u2202\u03c1pt]\n\n=\u2206t(1\u2212pt)\n\n[\u2202wi\u03c1t]\n\n(Yt \u2212 pt) \u2202u\u03c1t\npt\n\nPSPi(t)\u2206t\n\n(13)\n\n(14)\n\n(cid:88)\n\n(cid:18) Yt\n\n\u2206t\n\nt\n\n(cid:19) \u2202u\u03c1t\n\n\u03c1t\n[\u2202u\u03c1](t)\n\n\u2212 \u03c1t\n\n(cid:90)\n\nThis is the rule that should be used in discretized simulations. In the limit \u2206t \u2192 0, pt can be\napproximated by pt \u2192 \u03c1\u2206t, which leads to the continuous time version of the rule:\n\n\u2202twi = \u03b7R lim\nt\u21920\n\nPSPi(t)\u2206t\n\n(15)\n\n(Y (t) \u2212 \u03c1(t))\n\n= \u03b7R\n\nHere, Y (t) =(cid:80)\n\n(16)\n\u03b4(t\u2212 ti) is now a sum of \u03b4-functions. Note that the learning rule (16) was already\nproposed by Xie and Seung [13] and Florian [3] and, slightly modi\ufb01ed for supervised learning, by\nP\ufb01ster et al. [5].\nFollowing the same line, policy gradient rules can also be derived for the intrinsic parameters of the\nneuron, i.e., its threshold parameters (see also [3]).\n\nPSPi(t) dt .\n\n\u03c1(t)\n\nti\n\n4 Why use code-speci\ufb01c rules when more general rules are available?\n\nObviously, the learning rule (16) is the most general in the sense that it considers the whole spike\ntrain as a coding feature. All possible other features are therefore captured in this learning rule. The\nnatural question is then: what is the advantage of using rules that are specialized for one speci\ufb01c\ncode?\nSay, we have a learning rule for two coding features F1 and F2, of which only F1 is correlated with\nreward. The learning rule for a particular neuronal parameter \u03b8 then has the following structure:\n\n\u2202t\u03b8 = \u03b7R(F1)\n\n(cid:32)\n\n(cid:18)(F1 \u2212 \u00b51)\n(cid:12)(cid:12)(cid:12)(cid:12)\u00b51\n(cid:12)(cid:12)(cid:12)(cid:12)\u00b51\n\n\u03c32\n1\nR(\u00b51) + \u2202R\n\u2202F1\n\n(F1 \u2212 \u00b51)2\n\n\u03c32\n1\n\n\u2202R\n\u2202F1\n\n\u2248 \u03b7\n\n= \u03b7\n\n+ F2 \u2212 \u00b52\n\n\u2202\u00b51\n\u2202\u03b8\n(F1 \u2212 \u00b51)\n\n\u03c32\n2\n\n\u2202\u00b52\n\u2202\u03b8\n\n(cid:19)\n(cid:33)(cid:18) F1 \u2212 \u00b51\n(cid:12)(cid:12)(cid:12)(cid:12)\u00b51\n\n\u2202R\n\u2202F1\n\n\u2202\u00b51\n\u2202\u03b8\n\n+ \u03b7\n\n+\u03b7R(\u00b51) F1 \u2212 \u00b51\n\n\u03c32\n1\n\n\u2202\u00b51\n\u2202\u03b8\n\n+ \u03b7R(\u00b51) F2 \u2212 \u00b52\n\n\u03c32\n2\n\n\u03c32\n2\n\u2202\u00b52\n\u2202\u03b8\n\n\u2202\u00b51\n\u2202\u03b8\n\n+ F2 \u2212 \u00b52\n\u03c32\n1\n(F1 \u2212 \u00b51)(F2 \u2212 \u00b52)\n\n\u03c32\n2\n\n(cid:19)\n\n\u2202\u00b52\n\u2202\u03b8\n\n\u2202\u00b52\n\u2202\u03b8\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\nOf the four terms in lines (19-20), only the \ufb01rst term has non-vanishing mean when taking the trial\naverage. The other terms are simply noise and therefore more hindrance than help when trying to\nmaximize the reward. When using the full learning rule for both features, the learning rate needs to\nbe decreased until an agreeable signal-to-noise ratio between the drift introduced by the \ufb01rst term\nand the diffusion caused by the other terms is reached. Therefore, it is desirable for faster learning\nto reduce the effects of these noise terms. This can be done in two ways:\n\n\u2022 The terms in eq. (20) can be reduced by reducing R(\u00b51). This can be achieved by subtract-\ning a suitable reward baseline from the current reward. Ideally, this should be done in a\nstimulus-speci\ufb01c way (because \u00b51 depends on the stimulus), which leads to the notion of a\nreward prediction error instead of a pure reward signal. This approach is in line with both\nstandard reinforcement learning theory [4] and the proposal that neuromodulatory signals\nlike dopamine represent reward prediction error instead of reward alone.\n\n6\n\n\f\u2022 The term in eq. (20) can be removed by skipping those terms in the original learning that are\nrelated to coding feature F2. This corresponds to using the learning rule for those features\nthat are in fact correlated with reward while suppressing those that are not correlated with\nreward. The central argument for using code-speci\ufb01c learning rules is therefore the signal-\nto-noise ratio. In extreme cases, where a very general rule is used for a very speci\ufb01c task,\na very large number of coding dimensions may merely give rise to noise in the learning\ndynamics, while only one is relevant and causes systematic changes.\n\nThese considerations suggest that the spike count rule (7) should outperform the full spike train\nrule (16) in tasks where the reward is based purely on spike count. Unfortunately, we could not\nyet substantiate this claim in simulations. As seen in Figure 1B, the performance of the two rules\nis very similar in the 2-armed bandit task. This might be due to a noise bottleneck effect: there\nare several sources of noise in the learning process, the strongest of which limits the performance.\nUnless the \u201ccode-speci\ufb01c noise\u201d is dominant, code-speci\ufb01c learning rules will have about the same\nperformance as general purpose rules.\n\n5 Inherent Prediction Problems\n\nAs shown in section 4, the policy-gradient rule with a reduced amount of noise in the gradient\nestimate is one that takes only the relevant coding features into account and subtracts the trial mean\nof the reward:\n\n\u2202t\u03b8 = \u03b7(R \u2212 R(\u00b51, \u00b52, ...))(cid:88)\n\n\u2202\u03b8\u00b5j\n\n(21)\n\nFj \u2212 \u00b5j\n\n\u03c32\nj\n\nj\n\nThis learning rule has a conceptually interesting structure: Learning takes place only when two con-\nditions are ful\ufb01lled: the animal did something unexpected (Fj \u2212 \u00b5i) and receives an unexpected\nreward (R \u2212 R(\u00b51, \u00b52, ...)). Moreover, it raises two interesting prediction problems: (a) the predic-\ntion of the trial average \u00b5j of the coding feature conditioned on the stimulus and (b) the reward that\nis expected if the coding feature takes its mean value.\n\n5.1 Prediction of the coding feature\n\nIn the cases where we could derive the learning rule analytically, the trial average of the coding\nfeature could be calculated from intrinsic properties of the neuron like its membrane potential. Un-\nfortunately, it is not clear a priori that the information necessary for calculating this mean is always\navailable. This should be particularly problematic when trying to extend the framework to coding\nfeatures of populations, where the population would need to know, e.g., membrane properties of its\nmembers.\nAn interesting alternative is that the trial mean is calculated by a prediction system, e.g., by top-\ndown signals that use prior information or an internal world model to predict the expected value\nof the coding feature. Learning would in this case be modulated by the mismatch of a top-down\nprediction of the coding feature - represented by \u00b5j(X) - and the real value of Fj, which is calculated\nby a \u201cbottom-up\u201d approach. This interpretation bears interesting parallels to certain approaches in\nsensory coding, where the interpretation of sensory information is based on a comparison of the\nsensory input with an internally generated prediction from a generative model [cf. 6]. There is also\nsome experimental evidence for neural stimulus prediction even in comparably low-level systems as\nthe retina [e.g. 8].\nAnother prediction system for the expected response could be a population coding scheme, in which\na population of neurons is receiving the same input and should produce the same output. Any\nneuron of the population could receive the average population activity as a prediction of its own\nmean response.\nIt would be interesting to study the relation of such an approach with the one\nrecently proposed for reinforcement learning in populations of spiking neurons [11].\n\n5.2 Reward prediction\n\nThe other quantity that should be predicted in the learning rule is the reward one would get when\nthe coding feature would take the value of its mean.\nIf the distribution of the coding feature is\n\n7\n\n\fR(\u00b5) \u2248 (cid:104)R(F )(cid:105)F|X\n\nsuf\ufb01ciently narrow so that in the range F takes for a given stimulus, the reward can be approximated\nby a linear function, the reward R(\u00b5) at the mean is simply the expectation value of the reward given\nthe stimulus:\n\n(22)\nThe relevant quantity for learning is therefore a reward prediction error R(F ) \u2212 (cid:104)R(F )(cid:105)F|X. In\nclassical reinforcement learning, this term is often calculated in an actor-critic architecture, where\nsome external module - the critic - learns the expected future reward either for states alone or for\nstate-action pairs. These values are then used to calculate the expected reward for the current state\nor state-action pair. The difference between the reward that was really received and the predicted\nreward is then used as a reward prediction error that drives learning. There is evidence that dopamine\nsignals in the brain encode prediction error rather than reward alone [7].\n\n6 Discussion\n\nWe have presented a general framework for deriving policy-gradient rules for spiking neurons and\nshown that different learning rules emerge depending on which features of the spike trains are as-\nsumed to in\ufb02uence the reward signals. Theoretical arguments suggest that code-speci\ufb01c learning\nrules should be superior to more general rules, because the noise in the estimate of the gradient\nshould be smaller. More simulations will be necessary to check if this is indeed the case and in\nwhich applications code-speci\ufb01c learning rules are advantageous.\nFor exponentially distributed coding features, the learning rule has a characteristic structure, which\nallows a simple intuitive interpretation. Moreover, this structure raises two prediction problems,\nwhich may provide links to other concepts: (a) the notion of using a reward prediction error to reduce\nthe variance in the estimate of the gradient creates a link to actor-critic architectures [9] and (b) the\nnotion of coding feature prediction is reminiscent of combined top-down\u2013bottom-up approaches,\nwhere sensory learning is driven by the mismatch of internal predictions and the sensory signal [6].\nThe fact that there is a whole class of code-speci\ufb01c policy-gradient learning rules opens the interest-\ning possibility that neuronal learning rules could be controlled by metalearning processes that shape\nthe learning rule according to what neural code is in effect. From the biological perspective, it would\nbe interesting to compare spike-based synaptic plasticity in different brain regions that are thought\nto use different neural codes and see if there are systematic differences.\n\nReferences\n[1] Baxter, J. and Bartlett, P. (2001). In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial Intelli-\n\ngence Research, 15(4):319\u2013350.\n\n[2] Bienenstock, E., Cooper, L., and Munroe, P. (1982). Theory of the development of neuron selectivity:\norientation speci\ufb01city and binocular interaction in visual cortex. Journal of Neuroscience, 2:32\u201348. reprinted\nin Anderson and Rosenfeld, 1990.\n\n[3] Florian, R. V. (2007). Reinforcement learning through modulation of spike-timing-dependent synaptic\n\nplasticity. Neural Computation, 19:1468\u20131502.\n\n[4] Greensmith, E., Bartlett, P., and Baxter, J. (2004). Variance reduction techniques for gradient estimates in\n\nreinforcement learning. The Journal of Machine Learning Research, 5:1471\u20131530.\n\n[5] P\ufb01ster, J.-P., Toyoizumi, T., Barber, D., and Gerstner, W. (2006). Optimal spike-timing dependent plasticity\n\nfor precise action potential \ufb01ring in supervised learning. Neural Computation, 18:1309\u20131339.\n\n[6] Rao, R. P. and Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of\n\nsome extra-classical receptive-\ufb01eld effects. Nature Neuroscience, 2(1):79\u201387.\n\n[7] Schultz, W., Dayan, P., and Montague, R. (1997). A neural substrate for prediction and reward. Science,\n\n275:1593\u20131599.\n\n[8] Schwartz, G., Harris, R., Shrom, D., and II, M. (2007). Detection and prediction of periodic patterns by\n\nthe retina. Nature Neuroscience, 10:552\u2013554.\n\n[9] Sutton, R. and Barto, A. (1998). Reinforcement learning. MIT Press, Cambridge.\n[10] Triesch, J. (2007). Synergies between intrinsic and synaptic plasticity mechanisms. Neural computation,\n\n19:885 \u2013909.\n\n8\n\n\f[11] Urbanczik, R. and Senn, W. (2009). Reinforcement learning in populations of spiking neurons. Nat\n\nNeurosci, 12(3):250\u2013252.\n\n[12] Williams, R. (1992). Simple statistical gradient-following methods for connectionist reinforcement learn-\n\ning. Machine Learning, 8:229\u2013256.\n\n[13] Xie, X. and Seung, H. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical\n\nReview E, 69(4):41909.\n\n9\n\n\f", "award": [], "sourceid": 595, "authors": [{"given_name": "Henning", "family_name": "Sprekeler", "institution": null}, {"given_name": "Guillaume", "family_name": "Hennequin", "institution": null}, {"given_name": "Wulfram", "family_name": "Gerstner", "institution": null}]}