{"title": "Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity", "book": "Advances in Neural Information Processing Systems", "page_first": 881, "page_last": 888, "abstract": null, "full_text": "Theoretical Analysis of Learning with\n\nReward-Modulated Spike-Timing-Dependent\n\nPlasticity\n\nRobert Legenstein, Dejan Pecevski, Wolfgang Maass\n\nInstitute for Theoretical Computer Science\n\nGraz University of Technology\n\nA-8010 Graz, Austria\n\n{legi,dejan,maass}@igi.tugraz.at\n\nAbstract\n\nReward-modulated spike-timing-dependent plasticity (STDP) has\nrecently\nemerged as a candidate for a learning rule that could explain how local learning\nrules at single synapses support behaviorally relevant adaptive changes in com-\nplex networks of spiking neurons. However the potential and limitations of this\nlearning rule could so far only be tested through computer simulations. This ar-\nticle provides tools for an analytic treatment of reward-modulated STDP, which\nallow us to predict under which conditions reward-modulated STDP will be able\nto achieve a desired learning effect. In particular, we can produce in this way\na theoretical explanation and a computer model for a fundamental experimental\n\ufb01nding on biofeedback in monkeys (reported in [1]).\n\n1 Introduction\n\nA major puzzle for understanding learning in biological organisms is the relationship between ex-\nperimentally well-established learning rules for synapses (such as STDP) on the microscopic level\nand adaptive changes of the behavior of biological organisms on the macroscopic level. Neuromod-\nulatory systems which send diffuse signals related to reinforcements (rewards) and behavioral state\nto several large networks of neurons in the brain, have been identi\ufb01ed as likely intermediaries that\nrelate these two levels of learning. It is well-known that the consolidation of changes of synaptic\nweights in response to pre- and postsynaptic neuronal activity requires the presence of such third\nsignals [2]. Corresponding spike-based learning rules of the form\n\ndwji(t)\n\ndt\n\n= cji(t)d(t),\n\n(1)\n\nhave been proposed in [3], where wji is the weight of a synapse from neuron i to neuron j, cji(t) is\nan eligibility trace of this synapse which collects proposed weight changes resulting from a learning\nrule such as STDP, and d(t) = h(t) \u2212 \u00afh is a neuromodulatory signal with mean \u00afh (where h(t) might\nfor example represent reward prediction errors, encoded through the concentration of dopamine in\nthe extra-cellular \ufb02uid). We will consider in this article only cases where the reward prediction\nerror is equal to the current reward. We will refer to d(t) simply as the reward signal. Obviously\nsuch learning scheme (1) faces a large credit-assignment problem, since not only those synapses\nfor which weight changes would increase the chances of future reward receive the top-down signal\nd(t), but billions of other synapses too. Nevertheless the brain is able to solve this credit-assignment\nproblem, as has been shown in one of the earliest (but still among the most amazing) demonstrations\nof biofeedback in monkeys [1]. The spiking activity of single neurons (in area 4 of the precentral\ngyrus) was recorded, the current \ufb01ring rate of this neuron was made visible to the monkey in the\n\n\fform of an illuminated meter, and the monkey received food rewards for increases (or in alternating\ntrials for decreases) of the \ufb01ring rate of this neuron from its average level. The monkeys learnt quite\nreliably (on the time scale of 10\u2019s of minutes) to change the \ufb01ring rate of this neuron in the currently\nrewarded direction1. Obviously the existence of learning mechanisms in the brain which are able to\nsolve this dif\ufb01cult credit assignment problem is fundamental for understanding and modeling many\nother learning features of the brain. We present in section 3 and 4 of this abstract a learning theory for\n(1), where the eligibility trace cij (t) results from standard forms of STDP, which is able to explain\nthe success of the experiment in [1]. This theoretical model is con\ufb01rmed by computer simulations\n(see section 4.1). In section 5 we leave this concrete learning experiment and investigate under what\nconditions neurons can learn through trial and error (via reward-modulated STDP) associations of\nspeci\ufb01c \ufb01ring patterns to speci\ufb01c patterns of input spikes. The resulting theory leads to predictions\nof speci\ufb01c parameter ranges for STDP that support this general form of learning. These were tested\nthrough computer experiments, see 5.1.\n\nOther interesting results of computer simulations of reward-modulated STDP in the context of neural\ncircuits were recently reported in [3] and [4] (we also refer to these articles for reviews of preceding\nwork by Seung and others).\n\n2 Models for neurons and synaptic plasticity\n\nThe spike train of a neuron i which \ufb01res action potentials at times t(1)\n\nby a sum of Dirac delta functions Si(t) = Pt(n)\n\n, . . . is formalized\n). We assume that positive and negative\nweight changes suggested by STDP for all pairs of pre- and postsynaptic spikes (according to the\ntwo integrals in (2)) are collected in an eligibility trace cji(t), where the impact of a spike pairing\nwith the second spike at time t \u2212 s on the eligibility trace at time t is given by some function fc(s)\nfor s \u2265 0:\n\n\u03b4(t \u2212 t(n)\n\n, t(2)\n\n, t(3)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ncji(t) =Z \u221e\n\n0\n\ndsfc(s)(cid:20)Z \u221e\n\n0\n\ndr W (r)Spost\n\nj\n\n(t \u2212 s)Spre\n\ni\n\n(t \u2212 s \u2212 r)\n\n+Z \u221e\n\n0\n\ndr W (\u2212r)Spost\n\nj\n\n(t \u2212 s \u2212 r)Spre\n\ni\n\n(t \u2212 s)(cid:21) .\n\n(2)\n\nIn our simulations, fc(s) is a function of the form fc(s) = s\n\u03c4e\ntime constant \u03c4e = 0.5s. W (r) denotes the standard exponential STDP learning window\n\n\u03c4e if s \u2265 0 and 0 otherwise, with\n\ne\u2212 s\n\nW (r) =(cid:26) A+e\u2212r/\u03c4+\n\n\u2212A\u2212er/\u03c4\u2212\n\n,\n,\n\nif r \u2265 0\nif r < 0\n\n,\n\n(3)\n\nj\n\n, Spost\n\nwhere the positive constants A+ and A\u2212 scale the strength of potentiation and depression, \u03c4+ and\n\u03c4\u2212 are positive time constants de\ufb01ning the width of the positive and negative learning window, and\nSpre\nare the spike trains of the presynaptic and postsynaptic neuron respectively. The actual\ni\nweight change is the product of the eligibility trace with the reward signal as de\ufb01ned by equation (1).\nWe assume that weights are clipped at the lower boundary value 0 and an upper boundary wmax.\nWe use a linear Poisson neuron model whose output spike train Spost\n(t) is a realization of a Poisson\nprocess with the underlying instantaneous \ufb01ring rate Rj(t). The effect of a spike of presynaptic\nneuron i at time t\u2032 on the membrane potential of neuron j is modeled by an increase in the instan-\ntaneous \ufb01ring rate by an amount wji(t\u2032)\u01eb(t \u2212 t\u2032), where \u01eb is a response kernel which models the\ntime course of a postsynaptic potential (PSP) elicited by an input spike. Since STDP according to\n[3] has been experimentally con\ufb01rmed only for excitatory synapses, we will consider plasticity only\nfor excitatory connections and assume that wji \u2265 0 for all i and \u01eb(s) \u2265 0 for all s. Because the\nsynaptic response is scaled by the synaptic weights, we can assume without loss of generality that\n0 ds \u01eb(s) = 1. In this linear model, the contributions of all\n\nj\n\ninputs are summed up linearly:\n\nthe response kernel is normalized to R \u221e\nXi=1Z \u221e\n\nRj(t) =\n\nn\n\n0\n\nds wji(t \u2212 s) \u01eb(s) Si(t \u2212 s) ,\n\n(4)\n\n1Adjacent neurons tended to change their \ufb01ring rate in the same direction, but also differential changes of\ndirections of \ufb01ring rates of pairs of neurons are reported in [1] (when these differential changes were rewarded).\n\n\fwhere S1, . . . , Sn are the n presynaptic spike trains.\n\n3 Theoretical analysis of the resulting weight changes\n\nWe are interested in the expected weight change over some time interval T (see [5]), where the\nexpectation is over realizations of the stochastic input- and output spike trains as well as a stochastic\nrealization of the reward signal, denoted by the ensemble average h\u00b7iE\n\nhwji(t + T ) \u2212 wji(t)iE\n\nT\n\n=\n\n1\n\nT *Z t+T\n\nt\n\nd\ndt\n\nwji(t\u2032)dt\u2032+E\n\n=(cid:28)(cid:28) d\n\ndt\n\nwji(t)(cid:29)T(cid:29)E\n\n,\n\n(5)\n\nhwji(t + T ) \u2212 wji(t)iE\n\nwhere we used the abbreviation hf (t)iT = T \u22121R t+T\ndr W (r)Z \u221e\ndr W (r)Z \u221e\n\n= Z \u221e\n+ Z 0\n\n\u2212\u221e\n\n|r|\n\nT\n\n0\n\n0\n\nt\n\nf (t\u2032) dt\u2032. Using equation (1), this yields\n\nds fc(s) hDji(t, s, r) \u03bdji(t \u2212 s, r)iT\n\nds fc(s + r) hDji(t, s, r) \u03bdji(t \u2212 s, r)iT ,(6)\n\nwhere Dji(t, s, r) = hd(t)| Neuron j spikes at t \u2212 s, and neuron i spikes at t \u2212 s \u2212 riE is the\naverage reward at time t given a presynaptic spike at time t \u2212 s \u2212 r and a postsynaptic spike at\ntime t \u2212 s, and \u03bdji(t, r) = hSj (t)Si(t \u2212 r)iE describes correlations between pre- and postsynaptic\nspike timings (see [6] for the derivation). We see that the expected weight change depends on how\nthe correlations between the pre- and postsynaptic neurons correlate with the reward signal. If these\ncorrelations are varying slowly with time, we can exploit the self-averaging property of the weight\nvector. Analogously to [5], we can drop the ensemble average on the left hand side and obtain:\n\nd\ndt\n\nhwji(t)iT = Z \u221e\n+ Z 0\n\n0\n\n\u2212\u221e\n\ndr W (r)Z \u221e\ndr W (r)Z \u221e\n\n|r|\n\n0\n\nds fc(s) hDji(t, s, r) \u03bdji(t \u2212 s, r)iT\n\nds fc(s + r) hDji(t, s, r) \u03bdji(t \u2212 s, r)iT .\n\n(7)\n\nIn the following, we will always use the smooth time-averaged vector hwji(t)iT , but for brevity, we\nwill drop the angular brackets. If one assumes for simplicity that the impact of a pre-post spike pair\non the eligibility trace is always triggered by the postsynaptic spike, one gets (see [6] for details):\n\ndwji(t)\n\ndt\n\n=Z \u221e\n\n0\n\nds fc(s)Z \u221e\n\n\u2212\u221e\n\ndr W (r) hDji(t, s, r) \u03bdji(t \u2212 s, r)iT .\n\n(8)\n\nThis assumption (which is common in STDP analysis) will introduce a small error for post-before\npre spike pairs, since if a reward signal arrives at some time dr after the pairing, the weight update\nwill be proportional to fc(dr) instead of fc(dr + r). For the analyses presented in this article, the\nsimpli\ufb01ed equation (8) is a good approximation for the learning dynamics (see [6]). Equation (8)\nshows that if the reward signal does not depend on pre- and postsynaptic spike statistics, the weight\nwill change according to standard STDP scaled by a constant proportional to the mean reward.\n\n4 Application to biofeedback experiments\n\nWe now apply our theoretical approach to the biofeedback experiments by Fetz and Baker [1] that\nwe have sketched in the introduction. The authors showed that it is possible to increase and decrease\nthe \ufb01ring rate of a randomly chosen neuron by rewarding the monkey for its high (respectively low)\n\ufb01ring rates. We assume in our model that a reward is delivered to all neurons in the simulated\nrecurrent network with some delay dr every time a speci\ufb01c neuron k in the network produces an\naction potential\n\nd(t) =Z \u221e\n\n0\n\ndr Spost\n\nk\n\n(t \u2212 dr \u2212 r)\u01ebr(r).\n\n(9)\n\nwhere \u01ebr(r) is the shape of the reward pulse corresponding to one postsynaptic spike of the rein-\n0 dr \u01ebr(r) = 0. In\n\nforced neuron. We assume that the reward kernel \u01ebr has zero mass, i.e., \u00af\u01ebr = R \u221e\n\n\four simulations, this reward kernel will have a positive bump in the \ufb01rst few hundred milliseconds,\nand a long tailed negative bump afterwards. With the linear Poisson neuron model (see Section 2),\nthe correlation of the reward with pre-post spike pairs of the reinforced neuron is (see [6])\n\nDki(t, s, r) = wkiZ \u221e\n\n0\n\ndr\u2032 \u01ebr(r\u2032)\u01eb(s + r \u2212 dr \u2212 r\u2032) + \u01ebr(s \u2212 dr) \u2248 \u01ebr(s \u2212 dr).\n\n(10)\n\nThe last approximation holds if the impact of a single input spike on the membrane potential is\nsmall. The correlation of the reward with pre-post spike pairs of non-reinforced neurons is\n\nDji(t, s, r) =Z \u221e\n\n0\n\ndr\u2032 \u01ebr(r\u2032)\n\n\u03bdkj(t \u2212 dr \u2212 r\u2032, s \u2212 dr \u2212 r\u2032) + wkiwji\u01eb(s + r \u2212 dr \u2212 r\u2032)\u01eb(r)\n\n.\n\n\u03bdj(t \u2212 s) + wji\u01eb(r)\n\n(11)\nIf the contribution of a single postsynaptic potential to the membrane potential is small, we can\nneglect the impact of the presynaptic spike and write\n\nDji(t, s, r) \u2248Z \u221e\n\n0\n\ndr\u2032 \u01ebr(r\u2032)\n\n\u03bdkj (t \u2212 dr \u2212 r\u2032, s \u2212 dr \u2212 r\u2032)\n\n\u03bdj(t \u2212 s)\n\n.\n\n(12)\n\nHence, the reward-spike correlation of a non-reinforced neuron depends on the correlation of this\nneuron with the reinforced neuron. The mean weight change for weights to the reinforced neuron is\ngiven by\n\nd\ndt\n\nwki(t) =Z \u221e\n\n0\n\nds fc(s + dr)\u01ebr(s)Z \u221e\n\n\u2212\u221e\n\ndr W (r) h\u03bdki(t \u2212 dr \u2212 s, r)iT .\n\n(13)\n\nThis equation basically describes STDP with a learning rate that is proportional to the eligibility\nfunction in the time around the reward-delay. The mean weight change of neurons j 6= k is given by\n\nd\ndt\n\nwji(t) =Z \u221e\n\nds fc(s)Z \u221e\n\ndr W (r)Z \u221e\n\n\u2212\u221e\n\n0\n\n0\n\ndr\u2032\u01ebr(r\u2032)(cid:28) \u03bdkj (t \u2212 dr \u2212 r\u2032, s \u2212 dr \u2212 r\u2032)\n\n\u03bdj(t \u2212 s)\n\n\u03bdji(t \u2212 s, r)(cid:29)T\n\n(14)\n\nIf the output of neurons j and k are uncorrelated, this evaluates to approximately zero (see [6]).\nThe result can be summarized as follows. The reinforced neuron is trained by STDP. Other neurons\nare trained by STDP with a learning rate proportional to their correlation with the reinforced neuron.\nIf a neuron is uncorrelated with the reinforced neuron, the learning rate is approximately zero.\n\n4.1 Computer simulations\n\nIn order to test the theoretical predictions for the experiment described in the previous section, we\nhave performed a computer simulation with a generic neural microcircuit receiving a global reward\nsignal. This global reward signal increases its value every time a speci\ufb01c neuron (the reinforced\nneuron) in the circuit \ufb01res. The circuit consists of 1000 leaky integrate-and-\ufb01re (LIF) neurons (80%\nexcitatory and 20% inhibitory), which are interconnected by conductance based synapses. The short\nterm dynamics of synapses was modeled in accordance with experimental data (see [6]). Neurons\nwithin the recurrent circuit were randomly connected with probabilities pee = 0.08, pei = 0.08,\npie = 0.096 and pii = 0.064 where the ee, ei, ie, ii indices designate the type of the presynaptic\nand postsynaptic neurons (excitatory or inhibitory). To reproduce the synaptic background activity\nof neocortical neurons in vivo, an Ornstein-Uhlenbeck (OU) conductance noise process modeled\naccording to ([7]) was injected in the neurons, which also elicited spontaneous \ufb01ring of the neurons\nin the circuit with an average rate of 4Hz. In half of the neurons part of the noise was substituted\nwith random synaptic connections from the circuit, in order to observe how the learning mecha-\nnisms work when most of the input conductance in the neuron comes from a larger number of input\nsynapses which are plastic, instead of a static noise process. The function fc(t) from equation (2)\nhad the form fc(t) = t\n\u03c4e if t \u2265 0 and 0 otherwise, with time constant \u03c4e = 0.5s. The reward\n\u03c4e\nsignal during the simulation was computed according to eq. (9), with the following shape for \u01ebr(t)\n\ne\u2212 t\n\n\u01ebr(t) = A+\nr\n\nt\n\u03c4 +\nr\n\n\u2212 t\nr \u2212 A\u2212\n\u03c4 +\nr\n\ne\n\nt\n\u03c4 \u2212\nr\n\ne\n\n\u2212 t\n\n\u03c4\n\n\u2212\n\nr .\n\n(15)\n\nThe parameter values for \u01ebr(t) were chosen such as to produce a positive reward pulse with a peak\n0 dt \u01ebr(t) = 0.\n\ndelayed 0.5s from the spike that caused it, and a long tailed negative bump so thatR \u221e\n\n\fA\n\n]\nz\nH\n[\n \ne\nt\na\nr\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n)\nx\na\nm\nw\n/\nw\n(\n\nB\n0.70\n\n0.65\n\n0.60\n\n0.55\n\n0.50\n\n0.45\n\n0\n\ns\nt\nh\ng\ne\nw\n\ni\n\n \n.\ng\nv\na\n\n5\n\n10\n\n15\n\n20\n\ntime [min]\n\nC\n\nbefore\nlearning\n\nafter\n\nlearning\n\n5\n\n10\n\n15\n\n20\n\n0\n\n1\n\n2\n\ntime [min]\n\n4\n\n3\n5\ntime [sec]\n\n6\n\n7\n\n8\n\nFigure 1: Computer simulation of the experiment by Fetz and Baker [1]. A) The \ufb01ring rate of the\nreinforced neuron (solid line) increases while the average \ufb01ring rate of 20 other randomly chosen\nneurons in the circuit (dashed line) remains unchanged. B) Evolution of the average synaptic weight\nof excitatory synapses connecting to the reinforced neuron (solid line) and to other neurons (dashed\nline). C) Spike trains of the reinforced neuron at the beginning and at the end of the simulation.\n\nFor values of other model parameters see [6]. The learning rule (1) was applied to all synapses in the\ncircuit which have excitatory presynaptic and postsynaptic neurons. The simulation was performed\nfor 20 min simulated biological time with a simulation time step of 0.1ms.\nFig. 1 shows that the \ufb01ring rate and synaptic weights of the reinforced neuron increase within a few\nminutes of simulated biological time, while those of the other neurons remain largely unchanged.\nNote that this reinforcement learning task is more dif\ufb01cult than that of the \ufb01rst computer experiment\nof [3], where postsynaptic \ufb01ring within 10 ms after presynaptic \ufb01ring of a randomly chosen synapse\nwas rewarded, since the relationship between synaptic activity (and hence with STDP) is less direct\nin this setup. Whereas a very low spontaneous \ufb01ring rate of 1 Hz was required in [3], this simulation\nshows that reinforcement learning is also feasible at rate levels which correspond to those reported\nin [1].\n\n5 Rewarding spike-timings\n\nIn order to explore the limits of reward-modulated STDP, we have also investigated a substantially\nmore demanding reinforcement learning scenario. The reward signal d(t) was given in dependence\non how well the output spike train Spost\nof the neuron j matched some rather arbitrary spike train S\u2217\nthat was produced by some neuron that received the same n input spike trains as the trained neuron\nn)T , w\u2217\nwith arbitrary weights w\u2217 = (w\u2217\ni \u2208 {0, wmax}, but in addition n\u2032 \u2212 n further\n1, . . . , w\u2217\nspike trains Sn+1, . . . , Sn\u2032 with weights w\u2217\ni = wmax. This setup provides a generic reinforcement\nlearning scenario, when a quite arbitrary (and not perfectly realizable) spike output is reinforced, but\nsimultaneously the performance of the learner can be evaluated quite clearly according to how well\nits weights w1, . . . , wn match those of the target neuron for those n input spike trains which both of\nthem receive. The reward d(t) at time t is given by\n\nj\n\nd(t) =Z \u221e\n\n\u2212\u221e\n\ndr \u03ba(r)Spost\n\nj\n\n(t \u2212 dr)S\u2217(t \u2212 dr \u2212 r),\n\n(16)\n\nwhere the function \u03ba(r) with \u00af\u03ba = R \u221e\n\n\u2212\u221e ds \u03ba(s) > 0 describes how the reward signal depends\non the time difference between a postsynaptic spike and a target spike and dr > 0 is the delay\nof the reward. Our theoretical analysis below suggests that this reinforcement learning task can\nin principle be solved by reward-modulated STDP if some constraints are ful\ufb01lled. The analysis\nalso reveals which reward kernels \u03ba are suitable for this learning setup. The reward correlation for\nsynapse i is (see [6])\n\nDji(t, s, r) =Z \u221e\n\n\u2212\u221e\n\ndr\u2032\u03ba(r\u2032)(cid:2)\u03bdpost\n\nj\n\n(t \u2212 dr) + \u03b4(s \u2212 dr) + wji(s + r \u2212 dr)\u01eb(s + r \u2212 dr)(cid:3)\n\n[\u03bd\u2217(t \u2212 dr \u2212 r\u2032) + w\u2217\n\ni \u01eb(s + r \u2212 dr \u2212 r\u2032)] ,\n\n(17)\n\nwhere \u03bdpost\n(t)iE denotes the mean rate of the trained neuron at time t, and \u03bd\u2217(t) =\nhS\u2217(t)iE denotes the mean rate of the target spike train at time t. Since weights are changing very\n\n(t) = hSpost\n\nj\n\nj\n\n\fslowly, we have wji(t \u2212 s \u2212 r) = wji(t). In the following, we will drop the dependence of wji on\nt for brevity. For simplicity, we assume that input rates are stationary and uncorrelated. In this case\n(since the weights are changing slowly), also the correlations between inputs and outputs can be\nassumed stationary, \u03bdji(t, r) = \u03bdji(r). We assume that the eligibility function fc(dr) \u2248 fc(dr + r)\nif |r| is on a time scale of a PSP, the learning window, or the reward kernel, and that dr is large\ncompared to these time scales. Then, for uncorrelated Poisson input spike trains of rate \u03bdpre\nand the\nlinear Poisson neuron model, the weight change at synapse ji is given by\n\ni\n\ndwji(t)\n\ndt\n\n\u2248 \u00af\u03ba \u00affc\u03bd\u2217\u03bdpre\n\n\u03bdpost\nj\n+\u00af\u03bafc(dr)\u03bdpre\n\ni\n\ni\n\n+fc(dr)w\u2217\n\ni \u03bdpre\n\ni\n\nj\n\nj\n\nj\n\n(cid:2)\u03bdpost\n\u00afW + wji \u00afW\u01eb(cid:3)\n(cid:2)\u03bdpost\n\u00afW + wji \u00afW\u01eb(cid:3)(cid:2)\u03bd\u2217 + \u03bd\u2217wji + w\u2217\nZ \u221e\ndr W (r)\u01eb\u03ba(r) + wjiZ \u221e\n(cid:20)\u03bdpost\n\u00afW + wji \u00afW\u01eb(cid:3)Z \u221e\n(cid:2)\u03bdpost\n\u2212\u221e dr W (r), \u01eb\u03ba(r) = R \u221e\n\n\u2212\u221e\n\n0\n\nj\n\ni\n\n\u2212\u221e\n\ni \u03bdpost\n\nj\n\n(cid:3)\ndr W (r)\u01eb(r)\u01eb\u03ba(r)(cid:21)\n\n+fc(dr)w\u2217\n\ni wji\u03bdpre\n\ndr \u01eb(r)\u01eb\u03ba(r),\n\n(18)\n\n\u2212\u221e dr\u2032 \u03ba(r\u2032)\u01eb(r \u2212 r\u2032) is the con-\nvolution of the reward kernel with the PSP is the integral over the STDP learning window, and\n\n0 dr fc(r), \u00afW = R \u221e\n\nwhere \u00affc = R \u221e\n\u00afW\u01eb =R \u221e\n\n\u2212\u221e dr \u01eb(r)W (r).\n\ni = wmax and for synapses\nWe will now bound the expected weight change for synapses ji with w\u2217\njk with w\u2217\njk = 0. In this way we can derive conditions for which the expected weight change for the\nformer synapses is positive, and that for the latter type is negative. First, we assume that the integral\nover the reward kernel is positive. In this case, the weight change is negative for synapses i with\n\u00afW > wji \u00afW\u01eb. In the worst case, wji is wmax and \u03bdpost\ni = 0 if and only if \u03bdpre\nw\u2217\nis small. We have to guarantee some minimal output rate \u03bdpost\nmin such that even if wji = wmax, this\ninequality is ful\ufb01lled. This could be guaranteed by some noise current. For synapses i with w\u2217\ni =\nwmax, we obtain two more conditions (see [6] for a derivation). The conditions are summarized in\ninequalities (19)-(21). If these inequalities are ful\ufb01lled and input rates are positive, then the weight\nvector converges on average from any initial weight vector to w\u2217.\n\ni > 0, and \u2212\u03bdpost\n\nj\n\nj\n\n\u2212\u03bdpost\nmin\n\n\u00afW > wmax \u00afW\u01eb\n\nZ \u221e\n\n\u2212\u221e\n\ndr W (r)\u01eb(r)\u01eb\u03ba(r) \u2265 \u2212\u03bdpost\nmax\n\n\u00afW Z \u221e\ndr W (r)\u01eb\u03ba(r) > \u2212 \u00afW \u00af\u03ba(cid:20) \u03bd\u2217\u03bdpost\n\nmax\nwmax\n\nZ \u221e\n\n\u2212\u221e\n\n0\n\ndr \u01eb(r)\u01eb\u03ba(r)\n\n\u00affc\n\nfc(dr)\n\n+\n\n\u03bd\u2217\n\nwmax\n\n+ \u03bd\u2217 + \u03bdpost\n\nmax(cid:21) ,\n\n(19)\n\n(20)\n\n(21)\n\nwhere \u03bdpost\nmax is the maximal output rate. The second condition is less severe, and should be easily\nful\ufb01lled in most setups. If this is the case, the \ufb01rst condition (19) ensures that weights with w\u2217 = 0\nare depressed while the third condition (21) ensures that weights with w\u2217 = wmax are potentiated.\nOptimal reward kernels: From condition (21), we can deduce optimal reward kernels \u03ba. The\n\u2212\u221e dr W (r)\u01eb\u03ba(r) is large, while the integral over \u03ba is small\n(but positive). Hence, \u01eb\u03ba(r) should be positive for r > 0 and negative for r < 0. In the following\nexperiments, we use a simple kernel which satis\ufb01es the aforementioned constraints:\n\nkernel should be such that the integralR \u221e\n\n\u03ba(r) =( A\u03ba\n\n+(e\n\u2212A\u03ba\n\n\u2212 t\u2212t\u03ba\n\u03c4 \u03ba\n1 \u2212 e\n\n\u2212 t\u2212t\u03ba\n\u03c4 \u03ba\n2 )\n\n,\n\nt\u2212t\u03ba\n\nt\u2212t\u03ba\n\nif t \u2212 t\u03ba \u2265 0\notherwise\n\n\u03c4 \u03ba\n2 )\n\u2212 are positive scaling constants, \u03c4 \u03ba\n1 and \u03c4 \u03ba\n\n\u03c4 \u03ba\n1 \u2212 e\n\n\u2212(e\n\n,\n\n+ and A\u03ba\n\n2 de\ufb01ne the shape of the two double-\nwhere A\u03ba\nexponential functions the kernel is composed of, and t\u03ba de\ufb01nes the offset of the zero-crossing from\nthe origin. The optimal offset from the origin is negative and in the order of tens of milliseconds\nfor usual PSP-shapes \u01eb. Hence, reward is positive if the neuron spikes around the target spike or\nsomewhat later, and negative if the neuron spikes much too early.\n\n5.1 Computer simulations\n\nIn the computer simulations we explored the learning rule in a more biologically realistic setting,\nwhere we used a leaky integrate-and-\ufb01re (LIF) neuron with input synaptic connections coming from\n\n\fA\n\n1.0\n\n)\nx\na\nm\nw\n/\nw\n(\n\ni\n\ns\nt\nh\ng\ne\nw\n \ne\ng\na\nr\ne\nv\na\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0\n\nB\n\nbefore learning\n\n\u2217\n\nS\ntarget \n(= rewarded\nspike times)\n\nrealizable part \n\n\u2217\n\nof target \n\nS\n\nafter learning\n\n30\n\n60\n\n90\n\n120\n\n0\n\n1\n\n2\n\n3\n\n4\n\ntime [min]\n\ntime [sec]\n\nFigure 2: Reinforcement learning of spike times. A) Synaptic weight changes of the trained LIF\nneuron, for 5 different runs of the experiment. The curves show the average of the synaptic weights\nthat should converge to w\u2217\ni = 0 (dashed lines), and the average of the synaptic weights that should\ni = wmax (solid lines) with a different shading for each simulation run. B) Com-\nconverge to w\u2217\nparison of the output of the trained neuron before (upper trace) and after learning (lower trace; the\nsame input spike trains and the same noise inputs were used before and after training for 2 hours).\nThe second trace from above shows those spike times which are rewarded, the third trace shows the\ntarget spike train without the additional noise inputs.\n\nA\n\n\u2217\n\n)\nx\na\nm\nw\n=\nw\n(\nw\n\u2206\nB\n\n\u2217\n\n)\n0\n=\nw\n(\nw\n\u2206\n\n1.0\n\n0.5\n\n0.0\n\n-0.5\n\n0.5\n\n0.0\n\n-0.5\n\n-1.0\n\nExp.No.\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nFigure 3: Predicted average weight\nchange (black bars) calculated\nfrom equation (18), and the es-\ntimated average weight change\n(gray bars) from simulations, pre-\nsented for 6 different experiments\nwith different parameter settings\n(see Table 1).2 A) Weight change\nvalues for synapses with w\u2217\ni =\nwmax. B) Weight change values\nfor synapses with w\u2217\ni = 0. Cases\nwhere the constraints are not ful-\n\ufb01lled are shaded with gray color.\n\ni = wmax for 50 \u2264 i < 110.\n\na generic neural microcircuit composed of 1000 LIF neurons. The synapses were conductance\nbased exhibiting short term facilitation and depression. The trained neuron and the arbitrarily given\nneuron which produced the target spike train S\u2217 (\u201ctarget neuron\u201d) both were connected to the same\nrandomly chosen, 100 excitatory and 10 inhibitory neurons from the circuit. The target neuron\nhad 10 additional excitatory input connections (these weights were set to wmax), not accessible to\nthe trained neuron. Only the synapses of the trained neuron connecting from excitatory neurons\nwere set to be plastic. The target neuron had a weight vector with w\u2217\ni = 0 for 0 \u2264 i < 50 and\nThe generic neural microcircuit from which the trained and\nw\u2217\nthe target neurons receive the input had 80% excitatory and 20% inhibitory neurons interconnected\nrandomly with a probability of 0.1. The neurons received background synaptic noise as modeled in\n[7], which caused spontaneous activity of the neurons with an average \ufb01ring rate of 6.9Hz. During\nthe simulations, we observed a \ufb01ring rate of 10.6Hz for the trained, and 19Hz for the target neuron.\nThe reward was delayed by 0.5s, and we used the same eligibility trace function fc(t) as in the\nsimulations for the biofeedback experiment (see [6] for details). The simulations were run for two\nhours simulated biological time, with a simulation time step of 0.1ms. We performed 5 repetitions\nof the experiment, each time with different randomly generated circuits and different initial weight\nvalues for the trained neuron. In each of the 5 runs, the average synaptic weights of synapses with\ni = 0 approach their target values, as shown in Fig. 2A. In order to test how\nw\u2217\n\ni = wmax and w\u2217\n\n2The values in the \ufb01gure are calculated as \u2206w = w(tsim)\u2212w(0)\n\nwmax/2\n\nfor the simulations, and with \u2206w =\n\nhdw/dtitsim\n\nwmax /2\n\nfor the predicted value. w(t) is the average weight over synapses with the same value of w\u2217.\n\n\fEx. \u03c4\u01eb[ms] wmax \u03bd post\n10\n1\n5\n2\n6\n3\n4\n5\n6\n5\n6\n3\n\nmin [Hz] A+106 A\u2212\nA+\n16.62 1.05\n11.08 1.02\n5.54 1.10\n11.08 1.07\n20.77 1.10\n13.85 1.01\n\n0.012\n0.020\n0.010\n0.020\n0.015\n0.005\n\n10\n7\n20\n7\n10\n25\n\n\u03c4+,\u03c4 \u03ba\n\n2 [ms] A\u03ba\n+\n3.34\n4.58\n1.46\n4.67\n3.75\n3.34\n\n20,20\n15,16\n25,40\n25,16\n25,20\n25,20\n\ntsim [h]\n\n5\n10\n16\n13\n3\n13\n\nTable 1: Parameter values used\nfor the simulations in Figure\n3. Both cases where the con-\nstraints are satis\ufb01ed and not sat-\nis\ufb01ed were covered. PSPs were\nmodeled as \u01eb(s) = e(\u2212s/\u03c4\u01eb)/\u03c4\u01eb.\n\nclosely the learning neuron reproduces the target spike train S\u2217 after learning, we have performed\nadditional simulations where the same spiking input SI is applied to the learning neuron before and\nafter we conducted the learning experiment (results are reported in Fig. 2B).\n\nThe equations in section 5 de\ufb01ne a parameter space for which the trained neuron can learn the target\nsynapse pattern w\u2217. We have chosen 6 different parameter values encompassing cases with satis\ufb01ed\nand non-satis\ufb01ed constraints, and performed experiments where we compare the predicted average\nweight change from equation (18) with the actual average weight change produced by simulations.\nFigure 3 summarizes the results. In all 6 experiments, the suf\ufb01cient conditions (19)-(21) were cor-\nrect. In those cases where these conditions were not met, the weight moved in the opposite direction,\nsuggesting that the theoretically suf\ufb01cient conditions (19)-(21) might also be necessary.\n\n6 Discussion\n\nWe have developed in this paper a theory of reward-modulated STDP. This theory predicts that re-\ninforcement learning through reward-modulated STDP is also possible at biologically more realistic\nspontaneous \ufb01ring rates than the average rate of 1 Hz that was used (and argued to be needed) in the\nextensive computer experiments of [3]. We have also shown both analytically and through computer\nexperiments that the result of the fundamental biofeedback experiment in monkeys from [1] can be\nexplained on the basis of reward-modulated STDP. The resulting theory of reward-modulated STDP\nmakes concrete predictions regarding the shape of various functions (e.g. reward functions) that\nwould optimally support the speed of reward-modulated learning for the generic (but rather dif\ufb01-\ncult) learning tasks where a neuron is supposed to respond to input spikes with speci\ufb01c patterns of\noutput spikes, and only spikes at the right times are rewarded. Further work (see [6]) shows that\nreward-modulated STDP can in some cases replace supervised training of readout neurons from\ngeneric cortical microcircuit models.\nAcknowledgment: We would like to thank Gordon Pipa and Matthias Munk for helpful discussions.\nWritten under partial support by the Austrian Science Fund FWF, project # P17229, project #\nS9102 and project # FP6-015879 (FACETS) of the European Union.\n\nReferences\n\n[1] E. E. Fetz and M. A. Baker. Operantly conditioned patterns of precentral unit activity and correlated\n\nresponses in adjacent cells and contralateral muscles. J Neurophysiol, 36(2):179\u2013204, Mar 1973.\n\n[2] C. H. Bailey, M. Giustetto, Y.-Y. Huang, R. D. Hawkins, and E. R. Kandel. Is heterosynaptic modulation\n\nessential for stabilizing Hebbian plasticity and memory? Nature Reviews Neuroscience, 1:11\u201320, 2000.\n\n[3] E. M. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling.\n\nCerebral Cortex Advance Access, January 13:1\u201310, 2007.\n\n[4] R. V. Florian. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity.\n\nNeural Computation, 6:1468\u20131502, 2007.\n\n[5] W. Gerstner and W. M. Kistler. Spiking Neuron Models. Cambridge University Press, Cambridge, 2002.\n[6] R. Legenstein, D. Pecevski, and W. Maass. Theory and applications of reward-modulated spike-timing-\n\ndependent plasticity. in preparation, 2007.\n\n[7] J.M. Fellous A. Destexhe, M. Rudolph and T.J. Sejnowski. Fluctuating synaptic conductances recreate in\n\nvivo-like activity in neocortical neurons. Neuroscience, 107(1):13\u201324, 2001.\n\n\f", "award": [], "sourceid": 643, "authors": [{"given_name": "Dejan", "family_name": "Pecevski", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}, {"given_name": "Robert", "family_name": "Legenstein", "institution": null}]}