Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Richard Zemel, Rama Natarajan, Peter Dayan, Quentin Huys*

As animals interact with their environments, they must constantly update estimates about their states. Bayesian models combine prior probabil- ities, a dynamical model and sensory evidence to update estimates op- timally. These models are consistent with the results of many diverse psychophysical studies. However, little is known about the neural rep- resentation and manipulation of such Bayesian information, particularly in populations of spiking neurons. We consider this issue, suggesting a model based on standard neural architecture and activations. We illus- trate the approach on a simple random walk example, and apply it to a sensorimotor integration task that provides a particularly compelling example of dynamic probabilistic computation.

Bayesian models have been used to explain a gamut of experimental results in tasks which require estimates to be derived from multiple sensory cues. These include a wide range of psychophysical studies of perception;13 motor action;7 and decision-making.3, 5 Central to Bayesian inference is that computations are sensitive to uncertainties about afferent and efferent quantities, arising from ignorance, noise, or inherent ambiguity (e.g., the aperture problem), and that these uncertainties change over time as information accumulates and dissipates. Understanding how neurons represent and manipulate uncertain quantities is therefore key to understanding the neural instantiation of these Bayesian inferences.

Most previous work on representing probabilistic inference in neural populations has fo- cused on the representation of static information.1, 12, 15 These encompass various strategies for encoding and decoding uncertain quantities, but do not readily generalize to real-world dynamic information processing tasks, particularly the most interesting cases with stim- uli changing over the same timescale as spiking itself.11 Notable exceptions are the re- cent, seminal, but, as we argue, representationally restricted, models proposed by Gold and Shadlen,5 Rao,10 and Deneve.4

In this paper, we first show how probabilistic information varying over time can be repre- sented in a spiking population code. Second, we present a method for producing spiking codes that facilitate further processing of the probabilistic information. Finally, we show the utility of this method by applying it to a temporal sensorimotor integration task.

1 TRAJECTORY ENCODING AND DECODING

We assume that population spikes R(t) arise stochastically in relation to the trajectory X(t) of an underlying (but hidden) variable. We use RT and XT for the whole trajectory and

spike trains respectively from time 0 to T . The spikes RT constitute the observations and are assumed to be probabilistically related to the signal by a tuning function f (X, i):

```
P (R(i, T )|X(T )) f (X, i) (1)
```

for the spike train of the ith neuron, with parameters i. Therefore, via standard Bayesian inference, RT determines a distribution over the hidden variable at time T , P (X(T )|RT ).

We first consider a version of the dynamics and input coding that permits an analytical examination of the impact of spikes. Let X(t) follow a stationary Gaussian process such that the joint distribution P (X(t1), X(t2), . . . , X(tm)) is Gaussian for any finite collection of times, with a covariance matrix which depends on time differences: Ctt = c(|t - t |). Function c(|t|) controls the smoothness of the resulting random walks. Then,

```
P (X(T )|RT ) p(X(T )) dX(T )P (R X(T ) T |X(T ))P (X(T )|X (T )) (2)
```

where P (X(T )|X(T )) is the distribution over the whole trajectory X(T ) conditional on the value of X(T ) at its end point. If RT are a set of conditionally independent inhomoge- neous Poisson processes, we have

```
P (RT |X(T )) f (X(t d f (X( ), i i ), i) exp - i i) , (3)
```

where ti are the spike times of neuron i in RT . Let = [X(ti )] be the vector of stimulus positions at the times at which we observed a spike and = [(ti )] be the vector of spike positions. If the tuning functions are Gaussian f (X, i) exp(-(X - i)2/22) and sufficiently dense that d f (X, i i) is independent of X (a standard assumption in population coding), then P (RT |X(T )) exp(- - 2/22) and in Equation 2, we can marginalize out X(T ) except at the spike times ti :

P (X(T )|RT ) p(X(T )) d exp -[, X(T )]T C-1 [, X(T )] - - 2 (4) 2 22

and C is the block covariance matrix between X(ti ), x(T ) at the spike times [tt ] and the final time T . This Gaussian integral has P (X(T )|RT ) N ((T ), (T )), with

```
(T ) = CT t(Ctt + I2)-1 = k (T ) = CT T - kCtT (5)
```

CT T is the T, T th element of the covariance matrix and CT t is similarly a row vector. The dependence in on past spike times is specified chiefly by the inverse covariance matrix, and acts as an effective kernel (k). This kernel is not stationary, since it depends on factors such as the local density of spiking in the spike train RT .

For example, consider where X(t) evolves according to a diffusion process with drift:

```
dX = -Xdt + dN (t) (6)
```

where prevents it from wandering too far, N (t) is white Gaussian noise with mean zero and 2 variance. Figure 1A shows sample kernels for this process.

Inspection of Figure 1A reveals some important traits. First, the monotonically decreasing kernel magnitude as the time span between the spike and the current time T grows matches the intuition that recent spikes play a more significant role in determining the posterior over X(T ). Second, the kernel is nearly exponential, with a time constant that depends on the time constant of the covariance function and the density of the spikes; two settings of these parameters produced the two groupings of kernels in the figure. Finally, the fully adaptive kernel k can be locally well approximated by a metronomic kernel k

```
Kernels k and ks Variance ratio Full kernel A B D -0.5
-2 10 10 2 / 0 -4 2 5 Space 10
Kernel size (weight) 0 0.5 0 0.03 0.06 0.09 0.04 0.06 0.08 0.1 t-tspike Time C True stimulus and means Regular, stationary kernel E 0.5 -0.5
0 0 Space Space
-0.5 0.5 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Time Time
```

Figure 1: Exact and approximate spike decoding with the Gaussian process prior. Spikes are shown in yellow, the true stimulus in green, and P (X(T )|RT ) in gray. Blue: exact inference with nonstationary and red: approximate inference with regular spiking. A Ker- nel samples for a diffusion process as defined by equations 5, 6. B, C: Mean and variance of the inference. D: Exact inference with full kernel k and E: approximation based on metronomic kernel k

the form of decoder used for the network model in the next section.6 Figure 1D shows an example of how well Equation 5 specifies a distribution over X(t) through very few spikes.

Finally, 1E shows a factorized approximation with the stationary kernel similar to that used by Hinton and Brown6 and in our recurrent network:

```
^ t P (X(t)|R(t)) f (X, kst j=0 j ij = exp(-E(X(t), R(t), t)), (7) i i)
```

By design, the mean is captured very well, but not the variance, which in this example grows too rapidly for long interspike intervals (Figure 1B, C). Using a slower kernel im- proves performance on the variance, but at the expense of the mean. We thus turn to the net- work model with recurrent connections that are available to reinstate the spike-conditional characteristics of the full kernel.

2 NETWORK MODEL FORMULATION

Above we considered how population spikes RT specify a distribution over X(T ). We now extend this to consider how interconnected populations of neurons can specify distributions over time-varying variables. We frame the problem and our approach in terms of a two-level network, connecting one population of neurons to another; this construction is intended to apply to any level of processing. The network maps input population spikes R(t) to output population spikes S(t), where input and output evolve over time. As with the input spikes, ST indicates the output spike trains from time 0 to T , and these output spikes are assumed to determine a distribution over a related hidden variable.

For the recurrent and feedforward computation in the network, we start with the de- ceptively simple goal9 of producing output spikes in such a way that the distribution Q(X(T )|ST ) they imply over the same hidden variable X(T ) as the input, faithfully matches P (X(T )|RT ). This might seem a strange goal, since one could surely just lis- ten to the input spikes. However, in order for the output spikes to track the hidden variable, the dynamics of the interactions between the neurons must explicitly capture the dynamics

of the process X(T ). Once this `identity mapping' problem has been solved, more general, complex computations can be performed with ease. We illustrate this on a multisensory integration task, tracking a hidden variable that depends on multiple sensory cues.

The aim of the recurrent network is to take the spikes R(t) as inputs, and produce output spikes that capture the probabilistic dynamics. We proceed in two steps. We first consider the probabilistic decoding process which turns ST into Q(X(t)|ST ). Then we discuss the recurrent and feedforward processing that produce appropriate ST given this decoder. Note that this decoding process is not required for the network processing; it instead provides a computational objective for the spiking dynamics in the system.

We use a simple log-linear decoder based on a spatiotemporal kernel:6

```
Q(X(T )|ST ) exp(-E(X(T ), ST , T )) , where (8) E(X, S T T , T ) = S(j, T - ) j =0 j (X, ) (9)
```

is an energy function, and the spatiotemporal kernels are assumed separable: j(X, ) = gj(X)( ). The spatial kernel gj(X) is related to the receptive field f (X, j) of neuron j and the temporal kernel j(X, ) to k

The dynamics of processing in the network follows a standard recurrent neural architecture for modeling cortical responses, in the case that network inputs R(t) and outputs S(t) are spikes. The effect of a spike on other neurons in the network is assumed to have some simple temporal dynamics, described here again by the temporal kernel ( ):

```
ri(t) = T R(i, T - )( ) s S(j, T - )( ) =0 j (t) = T =0
```

where T is the extent of the kernel. The response of an output neuron is governed by a stochastic spiking rule, where the probability that neuron j spikes at time t is given by:

```
P (S(j, t) = 1) = (uj(t)) = ( w v i ij ri(t) + k kj sk (t - 1)) (10) where () is the logistic function, and W and V are the feedforward and recurrent weights. If ( ) = exp(- ), then uj(T ) = (0)(Wj R(T ) + Vj S(T )) + (1)uj(T - 1); this corresponds to a discretization of the standard dynamics for the membrane potential of a leaky integrate-and-fire neuron: duj = -u dt j +WR+VS, where the leak is determined by the temporal kernel.
```

The task of the network is to make Q(X(T )|ST ) of Equation 8 match P (X(T )|RT ) com- ing from one of the two models above (exact dynamic or approximate stationary kernel). We measure the discrepancy using the Kullback-Leibler (KL) divergence:

```
J = KL [P (X(T )|R t T )||Q(X (T )|ST )] (11) and, as a proof of principle in the experiments below, find optimal W and V by minimizing the KL divergence J using back-propagation through time (BPTT). In or- der to implement this in the most straightforward way, we convert the stochastic spik- ing rule (Equation 10) to a deterministic rule via the mean-field assumption: Sj(t) = ( w v i ij ri(t) + k kj sk (t - 1)). The gradients are tedious, but can be neatly ex- pressed in a temporally recursive form. Note that our current focus in the system is on the representational capability of the system, rather than its learning. Our results establish that the system can faithfully represent the posterior distribution. We return to the issue of more plausible learning rules below.
```

The resulting network can be seen as a dynamic spiking analogue of the recurrent network scheme of Pouget et al.:9 both methods formulate feedforward and recurrent connections so that a simple decoding of the output can match optimal but complex decoding applied to the inputs. A further advantage of the scheme proposed here is that it facilitates downstream processing of the probabilistic information, as the objective encourages the formation of distributions at the output that factorize across the units.

3 RELATED MODELS

Ideas about the representation of probabilistic information in spiking neurons are in vogue. One treatment considers Poisson spiking in populations with regular tuning functions, as- suming that stimuli change slowly compared with the inter-spike intervals.8 This leads to a Kalman filter account with much formal similarity to the models of P (X(T )|RT ). However, because of the slow timescale, recurrent dynamics can be allowed to settle to an underlying attractor. In another approach, the spiking activity of either a single neuron4 or a pair of neurons5 is considered as reporting (logarithmic) probabilistic information about an underlying binary hypothesis. A third treatment proposes that a population of neurons directly represents the (logarithmic) probability over the state of a hidden Markov model.10

Our method is closely related to the latter two models. Like Deneve's4 we consider the transformation of input spikes to output spikes with a fixed assumed decoding scheme so that the dynamics of an underlying process is captured. Our decoding mechanism produces something like the predictive coding apparent in Deneve's scheme, except that here, a neu- ron may not need to spike not only if it itself has recently spiked and thereby conveyed the appropriate information; but also if one of its population neighbors has recently spiked. This is explicitly captured by the recurrent interactions among the population. Our scheme also resembles Rao's10 approach in that it involves population codes. Our representational scheme is more general, however, in that the spatiotemporal decoder defines the relation- ship between output spikes and Q(X(T )|ST ), whereas his method assumes a direct en- coding, with each output neuron's activity proportional to log Q(X(T )|ST ). Our decoder can produce such a direct encoding if the spatial and temporal kernels are delta functions, but other kernels permit coordination amongst the population to take into account temporal effects, and to produce higher fidelity in the output distribution.

Do not remove: This comment is monitored to verify that the site is working properly