{"title": "Sequence learning with hidden units in spiking neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1422, "page_last": 1430, "abstract": "We consider a statistical framework in which recurrent networks of spiking neurons learn to generate spatio-temporal spike patterns. Given biologically realistic stochastic neuronal dynamics we derive a tractable learning rule for the synaptic weights towards hidden and visible neurons that leads to optimal recall of the training sequences. We show that learning synaptic weights towards hidden neurons significantly improves the storing capacity of the network. Furthermore, we derive an approximate online learning rule and show that our learning rule is consistent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly precedes a postynaptic spike, potentiation is induced and otherwise depression is elicited.", "full_text": "Sequence learning with hidden units\n\nin spiking neural networks\n\nJohanni Brea, Walter Senn and Jean-Pascal P\ufb01ster\n\nDepartment of Physiology\n\nUniversity of Bern\n\nB\u00a8uhlplatz 5\n\nCH-3012 Bern, Switzerland\n\n{brea, senn, pfister}@pyl.unibe.ch\n\nAbstract\n\nWe consider a statistical framework in which recurrent networks of spiking neu-\nrons learn to generate spatio-temporal spike patterns. Given biologically realistic\nstochastic neuronal dynamics we derive a tractable learning rule for the synaptic\nweights towards hidden and visible neurons that leads to optimal recall of the train-\ning sequences. We show that learning synaptic weights towards hidden neurons\nsigni\ufb01cantly improves the storing capacity of the network. Furthermore, we de-\nrive an approximate online learning rule and show that our learning rule is consis-\ntent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly\nprecedes a postynaptic spike, potentiation is induced and otherwise depression is\nelicited.\n\n1 Introduction\n\nLearning to produce temporal sequences is a general problem that the brain needs to solve. Move-\nments, songs or speech, all require the generation of speci\ufb01c spatio-temporal patterns of neural\nactivity that have to be learned. Early attempts to model sequence learning used a simple asym-\nmetric Hebbian learning rule [10, 20, 6] and succeeded to store sequences of random patterns, but\nperform poorly as soon as there are temporal correlations between the patterns [3].\n\nLater work on pattern storage or sequence learning recognized the need for matching the storage\nrule with the recall dynamics [2, 18, 12] and derived the optimal storage rule for a given recall\ndynamics [2, 18] or an optimal recall dynamics for a given storage rule [12], but didn\u2019t consider\nhidden neurons and therefore restricted the class of possible patterns to be learned. Other studies\n[14] included a reservoir of hidden neurons but assumed weights towards the hidden neurons to be\n\ufb01xed. Finally, Boltzmann machines [1] - which learn to produce a given distribution of patterns with\nvisible and hidden neurons - applied to sequence learning [9, 22, 21] are trained with Contrastive\nDivergence [8] and either an approximation that neglects the in\ufb02uence of the future or use a non-\nlocal and non-causal learning rule.\n\nHere we start by de\ufb01ning a stochastic neuronal dynamics - that can be arbitrarily complicated (e.g.\nwith non-Markovian dependencies). This stochastic dynamics de\ufb01nes the overall probability dis-\ntribution which is parametrized by the synaptic weights. The goal of learning is to adapt synaptic\nweights such that the model distribution approximates as good as possible the target distribution\nof temporal sequences. This can be seen as the extension of the maximum likelihood approach\nof Barber [2] where we add stochastic hidden neurons with plastic weights. In order to learn the\nweights, we implement a variant of the Expectation-Maximization (EM) algorithm [5] where we\nuse importance sampling in the expectation step in a way that makes the sampling procedure easy.\n\n1\n\n\fA\n\nB\n\nht\u22121\n\nht\n\nvt\u22121\n\nvt\n\nstochastic\nhidden\nneurons\n\nstochastic\nvisible\nneurons\n\nht\u22121\n\nht\n\nvt\u22121\n\nvt\n\nFigure 1: Graphical representation of the conditional dependencies of the joint distribution over\nvisible and hidden sequences. A Graphical model used for the derivation of the learning rule in\nsection 2 and the example in section 4. B Markovian model used in the example with binary neurons\nin section 3.\n\nThe resulting learning rule is local (but modulated by a global factor), causal and biologically rele-\nvant in the sense that it shares important features with Spike-Timing Dependent Plasticity (STDP).\nWe also derive an online version of the learning rule and show numerically that it performs almost\nequally well as the exact batch learning rule.\n\n2 Learning a distribution of sequences\n\nLet us consider temporal sequences v = {vt,i|t = 0 . . . T, i = 1 . . . Nv} of Nv visible neurons\nover the interval [0, T ]. We will use the notation vt = {vt,i|i = 1 . . . Nv} and vt1:t2 = {vt,i|t =\nt1 . . . t2, i = 1 . . . Nv} to denote parts of the sequence. Note that v = v0:T denotes the whole\nsequence. Those visible sequences v are drawn i.i.d. from a target distribution P \u2217(v) that must\nbe learned by a model which consists of Nv visible neurons and Nh hidden neurons. The model\n\ndistribution over those visible sequences is denoted by P\u03b8(v) =Ph P\u03b8(v, h) where \u03b8 denotes the\n\nmodel parameters, h = {ht,i|t = 0 . . . T, i = 1 . . . Nh} the hidden temporal sequence and P\u03b8(v, h)\nthe joint distribution over the visible and the hidden sequences. The natural way to quantify the\nmismatch between the target distribution P \u2217(v) and the model distribution P\u03b8(v) is given by the\nKullback-Leibler divergence:\n\nIf the joint model distribution P\u03b8(v, h) is differentiable with respect to the model parameters \u03b8, then\nthe sequence learning problem can be phrased as gradient descent on the KL divergence in Eq. (1):\n\nP \u2217(v) log\n\nP \u2217(v)\nP\u03b8(v)\n\n.\n\n(1)\n\nDKL(P \u2217(v)||P\u03b8(v)) =Xv\n\u2206\u03b8 = \u03b7(cid:28) \u2202 log P\u03b8(v, h)\n\n\u2202\u03b8\n\n(cid:29)P\u03b8 (h|v)P \u2217(v)\n\n,\n\n(2)\n\n\u2202\n\n\u2202\u03b8Ph P\u03b8(v, h) =\n\nwhere \u03b7 is the learning rate and we used the fact that \u2202\n\n\u2202\u03b8 log P\u03b8(v) = 1\n\nP\u03b8 (v)\n\nPh P\u03b8(h|v) \u2202\n\n\u2202\u03b8 log P\u03b8(v, h). Eq. (2) can be seen as a variant of the EM algorithm [5, 16, 3] where\nthe expectation h\u00b7iP\u03b8 (h|v)P \u2217(v) corresponds to the E step and the gradient of log P\u03b8(v, h) is related\nto the M step1.\nInstead of calculating analytically the true expectation in Eq. (2), it is possible to approximate it\nby sampling the visible sequences v from the target distribution P \u2217(v) and the hidden sequences\nfrom the posterior distribution P\u03b8(h|v) given the visible ones. Note that the posterior distribution\nP\u03b8(h|v) could be hard to sample from. Indeed, at a time t the posterior distribution over ht does not\nonly depend on the past visible activity but also on the future visible activity, since it is conditioned\non the whole visible activity v0:T from time step 0 to T . This renders a true challenge for on-\nline algorithms. In the case of Hidden Markov Model training, the forward-backward algorithm\n\n1Strictly speaking the M step of the EM algorithm directly calculates the solution \u03b8new for which\n\n\u2202\n\n\u2202\u03b8 log P\u03b8(v, h) = 0 whereas in Eq. (2) there is only one step done in the direction of the gradient.\n\n2\n\n\f[4, 19] combines information from the past (by forward \ufb01ltering) and from the future (by backward\nsmoothing) to calculate P\u03b8(h|v).\nIf the statistical model does not have the Markovian property, the problem of calculating P\u03b8(h|v)\n(or sampling from it) becomes even harder. Here, we propose an alternative solution that does not\nrequire to sample from P\u03b8(h|v) and does not require the Markovian assumption (see [11, 17] for\nother approaches on sampling P\u03b8(h|v)). We exploit that in all neuronal network models of interest,\nneuronal \ufb01ring at any time point is conditionally independent given the past activity of the network.\nUsing the chain rule this means that we can write the joint distribution P\u03b8(v, h) (see Fig. 1A) as\n\n|\n\nP\u03b8(v, h) = P\u03b8(v0)\n\nTYt=1\n\nNvYi=1\n\nP\u03b8(vt,i|v0:t\u22121, h0:t\u22121)!\n}\n{z\n\nR\u03b8 (v|h)\n\n P\u03b8(h0)\n|\n\nTYt=1\n\nNhYi=1\n\nP\u03b8(ht,i|v0:t\u22121, h0:t\u22121)!\n}\n{z\n\nQ\u03b8 (h|v)\n\n,\n\n(3)\nwhere R\u03b8(v|h) is easy to calculate (see below) and Q\u03b8(h|v) is easy to sample from. The sampling\ncan be accomplished by clamping the visible neurons to a target sequence v and let the hidden\ndynamics run, i.e. at time t, ht is sampled from P\u03b8(ht|v0:t\u22121h0:t\u22121). 2 From Eq. (3), the posterior\ndistribution P\u03b8(h|v) can be written as\n\nP\u03b8(h|v) =\n\nR\u03b8(v|h)Q\u03b8(h|v)\n\nP\u03b8(v)\n\n,\n\n(4)\n\nwhere the marginal distribution over the visible sequences v can be also expressed as P\u03b8(v) =\nhR\u03b8(v|h)iQ\u03b8 (h|v). As a consequence, by using Eq. (4), the learning rule in Eq. (2) can be rewritten\nas\n\n\u2206\u03b8 =Xv,h\n\nP \u2217(v)P\u03b8(h|v)\n\n\u2202 log P\u03b8(v, h)\n\n\u2202\u03b8\n\nP \u2217(v)Q\u03b8(h|v)\n\nR\u03b8(v|h)\nP\u03b8(v)\n\n\u2202 log P\u03b8(v, h)\n\n\u2202\u03b8\n\n= \u03b7*\n\nR\u03b8(v|h)\n\n\u2202 log P\u03b8(v, h)\n\nhR\u03b8(v|h\u2032)iQ\u03b8 (h\u2032|v)\n\n\u2202\u03b8\n\n+Q\u03b8 (h|v)P \u2217(v)\n\n.\n\n(5)\n\n=Xv,h\n\nInstead of calculating the true expectation, Eq. (5) can be evaluated by using N samples (see algo-\nrithm 1) where the factor \u03b3\u03b8(v, h) := R\u03b8(v|h)/ hR\u03b8(v|h\u2032)iQ\u03b8 (h\u2032|v) acts as the importance weight\n[15]. Note that in the absence of hidden neurons, this factor \u03b3\u03b8(v, h) is equal to one and the maxi-\nmum likelihood learning rule [2, 18] is recovered.\n\n2Note that for other conditional dependencies it might be reasonable to split P\u03b8(h|v) differently. For\nexample in models with the structure of Hidden Markov Models one could make use of the fact that\n\u03b8(ht+1|ht)\nP\u03b8(h|v) = QT \u22121\n\u03b8(ht+1|v0:t) P\u03b8(ht|v0:t) and take the product of \ufb01ltering\ndistributions Q\u03b8(h|v) = QT \u22121\nt=0 P\u03b8(ht|v0:t) to sample from and use the importance weights R\u03b8(v, h) =\n\u03b8 (ht+1|ht)\nQT \u22121\n\u03b8 (ht+1|v0:t) . Following the reasoning in the main text one \ufb01nds an alternative to the forward-backward\nt=0\nalgorithm [4, 19] that might be interesting to investigate further.\n\nt=0 P\u03b8(ht|v0:t, ht+1) = QT \u22121\n\nt=0\n\nP\n\nP\n\nP\n\nP\n\nAlgorithm 1 Sequence learning (batch mode)\n\nSet an initial \u03b8\nwhile \u03b8 not converged do\n\nv \u223c P \u2217(v)\n\u03b1(v) = 0, P\u03b8(v) = 0\nfor i = 1 . . . N do\n\nh \u223c Q\u03b8(h|v)\n\u03b1(v) \u2190 \u03b1(v) + R\u03b8(v|h) \u2202 log P\u03b8 (v,h)\n\u2202\u03b8\nP\u03b8(v) \u2190 P\u03b8(v) + N \u22121R\u03b8(v|h)\n\nend for\n\u03b8 \u2190 \u03b8 + \u03b7 \u03b1(v)\nP\u03b8 (v)\n\nend while\nreturn \u03b8\n\n3\n\n\fA\n\nB\n\nC\n\nG\n\nr\ne\nb\nm\nu\nn\n\nt\ni\nn\nu\n\n10\n\n20\n\n30\n\nr\ne\nb\nm\nu\nn\n\nt\ni\nn\nu\n\n20\n\n40\n\n60\n\n10 20 30\ntime step\n\n10 20 30\ntime step\n\n10 20 30\ntime step\n\nD\n\nE\n\nF\n\n10 20 30\ntime step\n\n10 20 30\ntime step\n\n10 20 30\ntime step\n\ne\nc\nn\na\nm\n\nr\no\nf\nr\ne\np\n\n1.\n0.9\n0.8\n0.7\n0.6\n0.5\n\n0\n\n7500\n\n15 000\n\nlearning step\n\nH\n\nI\n\nJ\n\nr\ne\nb\nm\nu\nn\n\nt\ni\nn\nu\n\n10\n20\n30\n\n10 20 30\ntim e step\n\n10\n20\n30\n40\n\n10 20 30\n\n10 20 30\n\ntime step\n\ntime step\n\nFigure 2: Learning a non-Markovian sequence of temporally correlated and linearly dependent states\nwith different learning rules. A The target distribution contained only this training pattern for 30\nvisible neurons and 30 time steps. B-F, H-J Overlay of 20 recalls after learning with 15 000 training\npattern presentations, B with only visible neurons and a simple asymmetric Hebb rule (see main\ntext) C only visible neurons and learning rule Eq. (5) D static weights towards 30 hidden neurons\n(Reservoir Computing) E learning rule Eq. (5), F online approximation Eq. (14). G Learning curves\nfor the training pattern in A for only visible neurons (black line), static weights towards hidden (blue\nline), online learning approximation (purple line) exact learning rule (red line). The performance\nwas measured in one minus average Hamming distance per neuron per time step (see main text).\nH A training pattern that exhibits a gap of 5 time-steps. I Recall with a network of 30 visible and\n10 hidden neurons without learning the weights towards hidden neurons. J Recall after training the\nsame network with learning rule Eq. (5).\n\n3 Binary neurons\n\nIn order to illustrate the learning rule given by Eq. (5), let us consider sequences of binary pat-\nterns. Let x denote the activity of the visible and hidden neurons, i.e. x = (v, h). Since the\nindividual neurons are binary xt,i \u2208 {\u22121, 1}, their distribution is given by P\u03b8(xt,i|x0:t\u22121) =\n(\u03c1t,i\u03b4t)(1+xt,i)/2(1 \u2212 \u03c1t,i\u03b4t)(1\u2212xt,i)/2, where the \ufb01ring rate \u03c1t,i of neuron i at time t is given by\na monotonically increasing (and non-linear) function g of its membrane potential ut,i, i.e.\n\n\u03c1t,i = g(ut,i) with ut,i =Xj\n\nwij xt\u22121,j .\n\n(6)\n\nNote that these assumptions lead to Markovian neuronal dynamics i.e. P\u03b8(xt,i|x0:t\u22121) =\nP\u03b8(xt,i|xt\u22121) (see Fig. 1B). Further calculations will be slightly simpli\ufb01ed, if we assume that the\nnon-linear function g is constraint by the following differential equation dg(u)/du = \u03b2g(u)(1 \u2212\ng(u)\u03b4t). Note that in the limit of \u03b4t \u2192 0, this function is an exponential, i.e. g(u) = g0 exp(\u03b2u) and\n\nfor \ufb01nite \u03b4t, it is a sigmoidal and takes the form g(u) = \u03b4t\u22121(cid:0)1 +(cid:0)(g0\u03b4t)\u22121 \u2212 1(cid:1) exp(\u2212\u03b2u)(cid:1)\u22121,\n\nwhere we constrained the solutions such that g(0) = g0 in order to be consistent with the case where\n\u03b4t \u2192 0.\nFor the distribution over the initial conditions P\u03b8(v0) and P\u03b8(h0) we choose delta distributions such\nthat v0 is equal to the \ufb01rst state of the training sequence and h0 is an arbitrary but \ufb01xed vector of\nbinary values. If we assume that the weights wij are the only adaptable parameters in this model,\n\n4\n\n\fA\n\ne\nc\nn\na\nm\n\nr\no\nf\nr\ne\np\n\n1.0\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n20\n\n40\n\n80\nnumber of hidden units\n\n60\n\nB\n\n1.0\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ne\nc\nn\na\nm\nr\no\nf\nr\ne\np\n\n100\n\n20\n\n60\n\n40\n80\nsequence length\n\n100\n\nFigure 3: Adding trainable hidden neurons leads to much better recall performance than having static\nhidden neurons or no hidden neurons at all. A Comparison of the performance after 20000 learning\ncycles between static (blue curve) and dynamic weights (red curve) towards hidden neurons for a\nnetwork with 30 visible and different numbers of hidden neurons in a training task with a uncorre-\nlated random pattern of length 60 time steps. For B we generated random, uncorrelated sequences of\ndifferent length and compared the performance after 20000 learning cycles for only visible neurons\n(black curve), static weights towards hidden (blue curve) and dynamic weights towards hidden (red\ncurve).\n\nwe have\n\n\u2202 log Pw(xt,i|x0:t\u22121)\n\n\u2202wij\n\n=\n\n1\n\n2(cid:18)(1 + xt,i)\n\ng\u2032(ut,i)\ng(ut,i)\n\n\u2212 (1 \u2212 xt,i)\n\nWith the above assumption on g(u) and Eq. (3) and (6) we \ufb01nd\n\ng\u2032(ut,i)\u03b4t\n\n1 \u2212 g(ut,i)\u03b4t(cid:19) \u2202ut,i\n\n\u2202wij\n\n.\n\n(7)\n\n\u2202 log Pw(x)\n\n\u2202wij\n\n=\n\n\u03b2\n2\n\nTXt=1\n\n(xt,i \u2212 hxt,iiP\u03b8 (xt,i|xt\u22121))xt\u22121,j ,\n\n(8)\n\nwhere hxt,iiP\u03b8 (xt,i|xt\u22121) = g(ut,i)\u03b4t \u2212 (1 \u2212 g(ut,i)\u03b4t) and the indices i and j run over all visible\nand hidden neurons. The factor Rw(v|h) can be expressed as\n\nRw(v|h) = exp 1\n\n2\n\nTXt=0\n\nNvXi=1\n\n(1 + vt,i) log(\u03c1t,i\u03b4t) + (1 \u2212 vt,i) log(1 \u2212 \u03c1t,i\u03b4t)! .\n\n(9)\n\nt }T\n\nt=0 v\u2217\n\nt+1,iv\u2217\n\n\ufb01cult pattern to learn with a simple asymmetric Hebb rule \u2206wij \u221dPT\nby one minus the Hamming distance per visible neuron and time step 1\u2212(T Nv)\u22121Pt,i |vt,i\u2212v\u2217\n\nLet us now consider a simple case (Fig. 2) where the distribution over sequences is a delta distribu-\ntion P \u2217(v) = \u03b4(v \u2212 v\u2217) around a single pattern v\u2217 (Fig. 2A) which is made of a set of temporally\nt=0, i.e. a non-Markovian pattern, thus making it a dif-\ncorrelated and linearly dependent states {v\u2217\nt,j (Fig. 2B) or only\nvisible neurons (Fig. 2C), which are both Markovian learning rules. The performance was measured\nt,i|/2\nbetween target pattern and recall pattern averaged over 100 runs. Adding hidden neurons without\nlearning the weights towards hidden neurons is similar to the idea used in the framework of Reser-\nvoir Computing (for a review see [13]): the visible states feed a \ufb01xed reservoir of neurons that\nreturns a non-linear transformation of the input. Only the readout from hidden to visible neurons\nand in our case the recurrent connections in the visible layer are trained. To assure a sensible distri-\nbution of weights towards hidden units, we used the weights that were obtained after learning with\nEq. (5) and reshuf\ufb02ed them. Obviously, without training the reservoir the performance is always\nworse compared to a system with an equal number of hidden neurons but dynamic weights (Fig. 2E\nand 2F). With only a few hidden neurons our rule is also capable to learn patterns where the visi-\nble neurons are silent during a few time-steps. The training pattern in Fig. 2H exhibits a gap of 5\ntime steps. After learning the weights towards 10 hidden neurons with learning rule Eq. (5) recall\nperformance is nearly perfect (see Fig. 2J). With only visible neurons (not shown in Fig. 2) or static\nweights towards hidden neurons the time gap was not learned (see Fig. 2I).\n\n5\n\n\f\u0001\ns\nt\ni\nn\nu\ny\nr\na\nr\nt\ni\nb\nr\na\n\u0001\n\nw\nD\n\n0\n\n-40 -20\n\n0\n\n20\ntpost -tpre \u0001ms\u0001\n\n40\n\nFigure 4: The learning rule Eq. (11) is compatible with Spike-Timing Dependent Plasticity (STDP):\nthe weight gets potentiated if a presynaptic spike is followed by a postsynaptic spike and depressed\notherwise. The time course of the postsynaptic potential and the refractory kernel is given in the\ntext.\n\nIn Fig. 3 we used again delta target distributions P \u2217(v) = \u03b4(v \u2212 v\u2217) with random uncorrelated\npatterns v\u2217 of different length. Each model was trained with 20000 pattern presentations. For a\npattern of length 2Nv = 60 only Nv/2 = 15 trainable hidden neurons are suf\ufb01cient to reach perfect\nrecall (see Fig. 3A). This is in clear contrast to the case of static hidden weights. Again the static\nweights were obtained by reshuf\ufb02ing those that we obtained after learning with Eq. (5). Fig. 3B\ncompares the capacity of our learning rule with Nh = Nv = 30 hidden neurons to the case of\nno hidden neuron or static weights towards hidden neurons. Without learning the weights towards\nhidden neurons the performance drops to almost chance level for sequences of 45 or more time\nsteps, whereas with our learning rule this decrease of performance occurs only at sequences of 100\nor more time steps.\n\n4 Limit to Continuous Time\n\nStarting from the neurons in the last section we show that in the limit to continuous time we can\nimplement the sequence learning task with stochastic spiking neurons [7].\n\nFirst note that the state of a neuron at time t in the model described in the previous section is\n\nfully de\ufb01ned by ut,i := Pj wij xt\u22121,j (see Eq. (6)) and its spiking activity xt,i. The weighted\nsumPj wij xt\u22121,j is the response of neuron i to the spikes of its presynaptic neurons and its own\n\nspikes. The terms in this sum depend on the previous time step only. In a more realistic model the\npostsynaptic neuron feels the in\ufb02uence of presynaptic spikes through a perturbation of the membrane\npotential on the order of a few milliseconds, which in the limit to continuous time clearly cannot be\nmodeled by a one-time step response. For a more realistic model we replace ut,i in Eq. (6) by\n\nut,i =\n\n\u03basxt\u2212s,i\n\n\u01ebsxt\u2212s,j\n\n,\n\n(10)\n\nwhere xt\u2212s,i \u2208 {0, 1}. The kernel \u01eb models the time-course of the response to a presynaptic spike\nand \u03ba the refractoriness. Our model holds for any choices of \u01eb and \u03ba, including for example a hard\nrefractory period where the neuron is forced not to spike.\nIn order to take the limit \u03b4t \u2192 0 in Eq. (9) we note that we can scale Rw(v|h) without changing\nthe learning rule Eq. (5), since there only the ratio R\u03b8(v|h)/ hR\u03b8(v|h\u2032)iQ\u03b8(h\u2032|v) enters. We use the\n\nscaling Rw(v|h) \u2192 eRw(v|h) := (g0\u03b4t)\u2212Sv Rw(v|h), where Sv denotes the total number of spikes\nin the visible sequence v, i.e. Sv = PT\n\ni=1 vt,i. Note that for (0, 1)-units the expectation in\nEq. (8) becomes hxt,iiP\u03b8 (xt,i|xt\u22121) = g(ut,i)\u03b4t = \u03c1t,i\u03b4t . Now we take the limit \u03b4t \u2192 0 in Eq. (8)\n\nt=0PNv\n\n6\n\n\u221eXs=1\n|\n\n=:x\u03ba\nt,i\n\n{z\n\n}\n\n+Xj6=i\n\nwij\n\n\u221eXs=1\n|\n\n=:x\u01eb\n\nt,j\n\n{z\n\n}\n\n\fand (9) and \ufb01nd\n\n\u2202 log Pw(x)\n\n0\n\n\u2202wij\n\n=Z T\neRw(v|h) = exp Z T\n\n0\n\ndt \u03b2(xi(t) \u2212 \u03c1i(t))x\u01eb\n\nj (t)\n\n\u03b2vi(t)ui(t) \u2212 \u03c1i(t)! ,\n\ndt\n\nNvXi=1\n\n(11)\n\n(12)\n\nwhere the training pattern runs from time 0 to T , xi(t) = Pt(f )\n\n) is the sum of delta\nspikes of neuron i at times t(f )\ni (t)) is the convolution\nof presynaptic spike trains with the response kernel \u01eb(t). With neuron i\u2019s response to past spiking\nj (t) and the escape rate function \u03c1i(t) = g0 exp (\u03b2ui(t)) we\nactivity ui(t) = x\u03ba\nrecovered the de\ufb01ning equations of a simpli\ufb01ed stochastic spike response model [7].\n\nj (t) =R ds \u01eb(s)xj (t \u2212 s) (and similarly x\u03ba\n\n\u03b4(t \u2212 t(f )\n\n, x\u01eb\n\ni\n\ni\n\ni\n\ni (t) +Pj6=i wij x\u01eb\n\nIn Fig. 4 we display the weight change after forcing two neurons to \ufb01re with a \ufb01xed time lag. For the\n\ufb01gure we used the kernels \u01ebs \u221d exp(\u2212s/\u03c4m)\u2212exp(\u2212s/\u03c4s) and \u03bas \u221d \u2212 exp(\u2212s/\u03c4m) with \u03c4m = 10\nms and \u03c4s = 2 ms. Our learning rule is consistent with STDP in the sense that a presynaptic spike\nfollowed by a postsynaptic spike leads to potentiation and to depression otherwise. Note that this\nresult was also found in [18].\n\n5 Approximate online version\n\nWithout hidden neurons the learning rule found by using Eq. (11) is straightforward to implement\nin an online way where the parameters are updated at every moment in time according to \u02d9wij \u221d\nj (t) instead of waiting with the update until a training batch \ufb01nished. Finding\n(xi(t) \u2212 \u03c1i(t))x\u01eb\nan online version of the learning algorithm for networks with hidden neurons turns out to be a\nchallenge, since we need to know the whole sequences v and h in order to evaluate the importance\nfactor R\u03b8(v|h)/hR\u03b8(v|h\u2032)iQ\u03b8 (h\u2032|v). Here we propose to use in each time step an approximation\nof the importance factor based on the network dynamics during the preceding period of typical\nsequence length and multiply it by the low-pass \ufb01ltered change of parameters. We write this section\nwith xi(t) \u2208 {0, 1}, but similar expressions are easily found for xi(t) \u2208 {\u22121, 1}.\n\nAlgorithm 2 Sequence learning (online mode)\n\nSet an initial wij, eij, a, \u00afr, t\nwhile wij not converged do\nif t mod N T == 0 then\n\nv \u223c P \u2217(v)\n\nend if\ns = t mod T\nif s < \u03c4 then\n\nend if\nx(s) = (v(s), h(s))\neij \u2190 (1 \u2212 \u03b4t\na \u2190 (1 \u2212 \u03b4t\n\u00afr \u2190 (1 \u2212 \u03b4t\nwij \u2190 wij + \u03b7 exp(a)\nt \u2190 t + \u03b4t\n\nT )a +PNv\n\n\u00afr\n\neij\n\nN T )\u00afr + exp(a)\n\nend while\nreturn wij\n\nh(s) \u223c P (h(s)) else h(s) \u223c Pw(h(s)|past spiking activity)\n\nT )eij + \u03b2(xi(s) \u2212 \u03c1i(s))x\u01eb\n\nj (s)\ni=1 \u03b2vi(s)ui(s) \u2212 \u03c1i(s)\n\nIn Eq. (13a) and (13b) we summarize how to use low-pass \ufb01lters to approximate the integrals in\nEq. (11) and Eq. (12). The time constant of the low-pass \ufb01lter is chosen to match the sequence length\nT . To \ufb01nd an online estimate of hR\u03b8(v, h\u2032)iQ\u03b8 (h\u2032|v) we assume that a training pattern v \u223c P \u2217(v)\nis presented a few times in a row and after time N T , with N \u2208 N, N \u226b 1, a new training pattern\nis picked from the training distribution. Under this assumption we can replace the average over\n\n7\n\n\fhidden sequences by a low-pass \ufb01lter of r with time constant N T , see Eq. (13c). At the beginning\nof each pattern presentation - i.e. during the time interval [0, \u03c4 ), with \u03c4 on the order of the kernel\ntime constant \u03c4m - the hidden activity h(s) is drawn from a given distribution P (h(s)).\n\n\u02d9eij(t) = \u2212\n\n\u02d9a(t) = \u2212\n\n1\nT\n\n1\nT\n\na(t) +\n\nNvXi=1\n\neij(t) + \u03b2(xi(t) \u2212 \u03c1i(t))x\u01eb\n\nj (t)\n\neij(T ) \u2248\n\n\u2202 log Pw(x)\n\n\u2202wij\n\nN T \u02d9\u00afr(t) = \u2212\u00afr(t) + r(t),\n\n\u03b2vi(t)ui(t) \u2212 \u03c1i(t)\n\nexp(a(T )) \u2248 Rw(v|h)\n\nr(t) := exp(a(t))\n\n\u00afr(N T ) \u2248 hR\u03b8(v, h\u2032)iQ\u03b8 (h\u2032|v)\n\nFinally we learn the model parameters in each time step according to\n\n\u02d9wij (t) = \u03b7\n\nr(t)\n\u00afr(t)\n\neij(t) .\n\n(13a)\n\n(13b)\n\n(13c)\n\n(14)\n\nThis online algorithm is certainly a rough approximation of the batch algorithm. Nevertheless, when\napplied to the challenging example (Fig. 2A) in section 3, the performance of the online rule is close\nto the one of the batch rule (Fig. 2F, G).\n\n6 Discussion\n\nLearning long and temporally correlated sequences with neural networks is a dif\ufb01cult task. In this\npaper we suggested a statistical model with hidden neurons and derived a learning rule that leads to\noptimal recall of the learned sequences given the neuronal dynamics. The learning rule is derived by\nminimizing the Kullback-Leibler divergence from training distribution to model distribution with a\nvariant of the EM-algorithm, where we use importance sampling to draw hidden sequences given the\nvisible training sequence. Choosing an appropriate distribution in the importance sampling step we\nare able to circumvent inference which usually makes the training of non-Markovian models hard.\nThe resulting learning algorithm consists of a local term modulated by a global factor. We showed\nthat it is ready to be implemented with biologically realistic neurons and that an approximate online\nversion exists.\n\nOur approach follows the ideas outlined in [2], where sequence learning was considered with visible\nneurons. Here we extended this model by adding stochastic hidden neurons that help to perform well\nwith sequences of linearly depend states - including non-Markovian sequences - or long sequences.\nAs in [18] we look at the limit of continuous time and \ufb01nd that the learning rule is consistent with\nSpike-Timing Dependent Plasticity. In contrast to Reservoir Computing [13] we train the weights\ntowards hidden neurons which clearly helps to improve performance. Our learning rule does not\nneed a \u201cwake\u201d and a \u201csleep\u201d phase as we know it from Boltzmann machines [1, 8].\n\nViewed in a different light our learning algorithm has a nice interpretation: as in reinforcement\nlearning, the hidden neurons explore different sequences, where each trial leads to a global reward\nsignal that modulates the weight change. However, in contrast to common reinforcement learning\nthe reward is not provided by an external teacher but depends solely on the internal dynamics and\nthe visible neurons do not explore but are clamped to the training sequence.\n\nTo make our model even more biologically relevant, future work should aim for a biological im-\nplementation of the global importance factor that depends on the spike timing and the membrane\npotential of all the visible neurons (see Eq. (9)). It would also be interesting to study online ap-\nproximations of the learning algorithm in more detail or its application to models with the Hidden\nMarkov structure.\n\nAcknowledgments\n\nThe authors thank Robert Urbanczik for helpful discussions. This work was supported by the Swiss\nNational Science Foundation (SNF), grant 31-133094, and a grant from the Swiss SystemsX.ch\ninitiative (Neurochoice, evaluated by the SNF).\n\n8\n\n\fReferences\n\n[1] D. Ackley and G. E. Hinton. A learning algorithm for boltzmann machines. Cognitive Science, 9(1):147\u2013\n\n169, 1985.\n\n[2] D. Barber. Learning in spiking neural assemblies. Advances in Neural Information Processing Systems,\n\n15, 2003.\n\n[3] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2011. In press.\n[4] L. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis\nof probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1):164\u2013171, 1970.\n[5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm.\n\nJournal of the Royal Statistical Society. Series B (Methodological), 39(1):1\u201338, 1977.\n\n[6] A. D\u00a8uring, A. Coolen, and D. Sherrington. Phase diagram and storage capacity of sequence processing\n\nneural networks. Journal of Physics A: Mathematical and General, 31:8607, 1998.\n\n[7] W. Gerstner and W. M. Kistler. Spiking neuron models: single neurons, populations, plasticity. Cambridge\n\nUniversity Press, 2002.\n\n[8] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u2013800, 2002.\n\n[9] G. E. Hinton and A. Brown. Spiking boltzmann machines. Advances in Neural Information Processing\n\nSystems, 12, 2000.\n\n[10] J. Hop\ufb01eld. Neural networks and physical systems with emergent collective computational abilities.\n\nProceedings of the National Academy of Sciences of the United States of America, 79(8):2554, 1982.\n\n[11] P. Latham and J. W. Pillow. Neural characterization in partially observed populations of spiking neurons.\n\nAdvances in Neural Information Processing Systems, 20:1161\u20131168, 2008.\n\n[12] M. Lengyel, J. Kwag, O. Paulsen, and P. Dayan. Matching storage and recall: hippocampal spike timing-\n\ndependent plasticity and phase response curves. Nature Neuroscience, 8(12):1677\u201383, 2005.\n\n[13] M. Luko\u02c7sevi\u02c7cius and H. Jaeger. Reservoir computing approaches to recurrent neural network training.\n\nComputer Science Review, 3(3):127\u2013149, 2009.\n\n[14] W. Maass, T. Natschl\u00a8ager, and H. Markram. Real-time computing without stable states: a new framework\n\nfor neural computation based on perturbations. Neural Computation, 14(11):2531\u201360, 2002.\n\n[15] D. J. C. MacKay. Information Theory, Inference & Learning Algorithms. Cambridge University Press,\n\n2002.\n\n[16] G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley and Sons, 1997.\n[17] Y. Mishchenko and L. Paninski. Ef\ufb01cient methods for sampling spike trains in networks of coupled\n\nneurons. The Annals of Applied Statistics, 5(3):1893\u20131919, 2011.\n\n[18] J.-P. P\ufb01ster, T. Toyoizumi, D. Barber, and W. Gerstner. Optimal spike-timing-dependent plasticity for\n\nprecise action potential \ufb01ring in supervised learning. Neural Computation, 18(6):1318\u20131348, 2006.\n\n[19] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Pro-\n\nceedings of the IEEE, 77(2):257\u201386, 1989.\n\n[20] H. Sompolinsky and I. Kanter. Temporal association in asymmetric neural networks. Physical Review\n\nLetters, 57(22):2861\u201364, 1986.\n\n[21] I. Sutskever, G. E. Hinton, and G. Taylor. The Recurrent Temporal Restricted Boltzmann Machine.\n\nAdvances in Neural Information Processing Systems, 21:1601\u201308, 2009.\n\n[22] G. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. Advances\n\nin Neural Information Processing Systems, 19:1345\u201352, 2007.\n\n9\n\n\f", "award": [], "sourceid": 826, "authors": [{"given_name": "Johanni", "family_name": "Brea", "institution": null}, {"given_name": "Walter", "family_name": "Senn", "institution": null}, {"given_name": "Jean-pascal", "family_name": "Pfister", "institution": null}]}