#### Authors

Lavi Shpigelman, Koby Crammer, Rony Paz, Eilon Vaadia, Yoram Singer

#### Abstract

We devise and experiment with a dynamical kernel-based system for tracking hand movements from neural activity. The state of the system corresponds to the hand location, velocity, and acceleration, while the system's input are the instantaneous spike rates. The system's state dy- namics is defined as a combination of a linear mapping from the previous estimated state and a kernel-based mapping tailored for modeling neural activities. In contrast to generative models, the activity-to-state mapping is learned using discriminative methods by minimizing a noise-robust loss function. We use this approach to predict hand trajectories on the basis of neural activity in motor cortex of behaving monkeys and find that the proposed approach is more accurate than both a static approach based on support vector regression and the Kalman filter.

1 Introduction

The paper focuses on the problem of tracking hand movements, which constitute smooth spatial trajectories, from spike trains of a neural population. We do so by devising a dynam- ical system which employs a tailored kernel for spike trains along with a linear mapping corresponding to the states' dynamics. Consider a situation where a subject performs free hand movements during a task that requires accurate space and time precision. In the lab, it may be a constrained reaching task while in real life it may be an every day task such as eating. We wish to track the hand position given only spike trains from a recorded neural population. The rationale of such an undertaking is two fold. First, this task can be viewed as a stem towards the development of a Brain Machine Interface (BMI) which gradually and rapidly become a possible future solution for the motor disabled patients. Recent studies of BMIs [13, 3, 10] (being on-line and feedback enabled) show that a relatively small number of cortical units can be used to move a cursor or a robot effectively, even without genera- tion of hand movements and that training of the subjects improves the overall success of the BMIs. Second, an open loop (off-line) movement decoding (see e.g. [7, 1, 15, 11, 8]), while inappropriate for BMIs, is computationally less expensive, easier to implement and allows repeated analysis thus providing a handle to understandings of neural computations in the brain.

Early studies [6] showed that the direction of arm movement is reflected by the population vector of preferred directions weighted by current firing ra tes, suggesting that intended

movement is encoded in the firing rate which, in turn, is modulated by the angle between a unit's preferred direction (PD) and the intended direction. This linear regression approach is still prevalent and is applied, with some variation of the learning methods, in closed and open loop settings. There is relatively little work on the development of dedicated nonlinear methods.

Both movement and neural activity are dynamic and can therefore be naturally modeled by dynamical systems. Filtering methods often employ generative probabilistic models such as the well known Kalman filter [16] or more neurally specialized models [1] in which a cortical unit's spike count is generated by a probability function of its underlying firing rate which is tuned to movement parameters. The movement, being a smooth trajectory, is modeled as a linear transition with (typically additive Gaussian) noise. These methods have the advantage of being aware of the smooth nature of movement and provide models of what neurons are tuned to. However, the requirement of describing a neural population's firing probability as a function of movement state is hard to satisfy without making costly assumptions. The most prominent is the assumption of statistical independence of cells given the movement.

Kernel based methods have been shown to achieve state of the art results in many applica- tion domains. Discriminative kernel methods, such as Support Vector Regression (SVR) forgo the task of modeling neuronal tuning functions. Furthermore, the construction of kernel induced feature spaces, lends itself to efficient implementation of distance measures over spike trains that are better suited to comparing two neural population trajectories than the Euclidean distance in the original space of spike counts per bins [11, 5]. However, SVR is a "static" method that does not take into account the smooth dynamics of the pre- dicted movement trajectory which imposes a statistical dependency between consecutive examples.

This paper introduces a kernel based regression method that incorporates linear dynamics of the predicted trajectories. In Sec. 2 we formally describe the problem setting. We intro- duce the movement tracking model and the associated learning framework in Sec. 3. The resulting learning problem yields a new kernel for linear dynamical systems. We provide an efficient calculation of this kernel and describe our dual space optimization method for solving the learning problem. The experimental method is presented in Sec. 4. Results, underscoring the merits of our algorithm are provided in Sec. 5 and conclusions are given in Sec. 6.

2 Problem Setting

Our training set contains m trials. Each trial (typically indexed by i or j) consists of a pair ti of movement and neural recordings, designated by Yi, Oi . Yi = yi end t is a time t=1 series of movement state values and yi t Rd is the movement state vector at time t in trial i. We are interested in reconstructing position, however, for better modeling, yit may be a vector of position, velocity and acceleration (as is the case in Sec. 4). This trajectory is

observed during model learning and is the inference target. Oi = {ot}tiend t=1 is a time series of neural spike counts and oi t Rq is a vector of spike counts from q cortical units at time t. We wish to learn a function zi = f Oi t 1:t that is a good estimate (in a sense formalized in the sequel) of the movement yit. Thus, f is a causal filtering method.

We confine ourselves to a causal setting since we plan to apply the proposed method in a closed loop scenario where real-time output is required. The partition into separate trajecto- ries is a natural one in a setting where a session is divided into many trials, each consisting of one attempt at accomplishing the basic task (such as reaching movements to displayed targets). In tasks that involve no hitting of objects, hand movements are typically smooth.

End point movement in small time steps is loosely approximated as having constant ac- celeration. On the other hand, neural spike counts (which are typically measured in bins of 50 - 100ms) vary greatly from one time step to the next. In summary, our goal is to devise a dynamic mapping from sequences of neural activities ending at a given time to the instantaneous hand movement characterization (location, velocity, and acceleration).

3 Movement Tracking Algorithm

Our regression method is defined as follows: given a series O Rqtend of observations and, possibly, an initial state y0, the predicted trajectory Z Rdtend is, zt = Azt-1 + W (ot) , tend t > 0 , (1)

where z0 = y0, A Rdd is a matrix describing linear movement dynamics and W Rdq is a weight matrix. (ot) is a feature vector of the observed spike trains at time t and is later replaced by a kernel operator (in the dual formulation to follow). Thus, the state transition is a linear transformation of the previous state with the addition of a non-linear effect of the observation. Note that unfolding the recursion in Eq. (1) yields zt = Aty0 + t At-kW (o k=1 k ) . Assuming that A describes stable dynamics (the real parts of the eigenvalues of A are les than 1), then the current prediction depends, in an exponentially decaying manner, on the previous observations. We further assume that A is fixed and wish to learn W (we describe our choice of A in Sec. 4). In addition, ot may also encompass a series of previous spike counts in a window ending at time t (as is the case in Sec. 4). Also, note that this model (in its non-kernelized version) has an algebraic form which is similar to the Kalman filter (to which we compare our results later).

Primal Learning Problem: The optimization problem presented here is identical to the standard SVR learning problem (see, for example [12]) with the exception that zit is defined as in Eq. (1) while in standard SVR, zt = W (ot) (i.e. without the linear dynamics). Given a training set of fully observed trials Yi, Oi m we define the learning problem i=1 to be

                                                  ti                                  1               m     end    d                     min               W 2 + c                      zi          - yi           .                                                                     t             t                             (2)                      W           2                                        s            s                                                   i=1 t=1 s=1


Where W 2 = (W)2 (is the Forbenius norm). The second term is a sum of training a,b ab errors (in all trials, times and movement dimensions). | | is the insensitive loss and is defined as |v| = max {0, |v| - }. The first term is a regularization term that promotes small weights and c is a fixed constant providing a tradeoff between the regularization term and the training error. Note that to compensate for different units and scales of the movement dimensions one could either define a different s and cs for each dimension of the movement or, conversely, scale the sth movement dimension. The tracking method, combined with the optimization specified here, defines the complete algorithm. We name this method the Discriminative Dynamic Tracker or DDT in short.

A Dual Solution: The derivation of the dual of the learning problem defined in Eq. (2) is rather mundane (e.g. [12]) and is thus omitted. Briefly, we replace the -loss with pairs of slack variables. We then write a Lagrangian of the primal problem and replace zit with its (less-standard) definition. We then differentiate the Lagrangian with respect to the slack variables and W and obtain a dual optimization problem. We present the dual dual problem in a top-down manner, starting with the general form and finishing with a kernel definition. The form of the dual is

      max      - 1 ( - )T G ( - ) + ( - )T y - ( + )T                     2           ,

s.t. ,   [0, c]                                         .       (3)


Note that the above expression conforms to the dual form of SVR. Let equal the size of the movement space (d), multiplied by the total number of time steps in all the training trajecto- ries. , R are vectors of Lagrange multipliers, y R is a column concatenation of T T T all the training set movement trajectories y11 ym tm , = [, . . . , ]T R end

and G R is a Gram matrix (vT denotes transposition). One obvious difference be- tween our setting and the standard SVR lies within the size of the vectors and Gram matrix. In addition, a major difference is the definition of G. We define G here in a hierarchical manner. Let i, j {1, . . . , m} be trajectory (trial) indexes. G is built from blocks indexed by Gij , which are in turn made from basic blocks, indexed by Kij tq as follows

           G11  G1m                                                                      Kij11  Kij1tj                                                                                                          .                                          .        G =           .                                   .                                                                .                . ... .                                                       , Gij =                   ..                     . .                 ..                 ,                .                                        .                                                                                                                      Gm1                             Gmm                                           Kij  Kij                                                                                                         ti       1                                                                                                                                                         end                                  ti          tj                                                                                                                                               end end


where block Gij refers to a pair of trials (i and j). Finally Each basic block, Kij tq refers to a

pair of time steps t and q in trajectories i and j respectively. ti , tj are the time lengths end end of trials i and j. Basic blocks are defined as t q

                                          Kij =                                At-r kij Aq-s T ,                                                    tq                                            rs                                                                         (4)                                                               r=1 s=1


where kij = k oi , oj rs r s is a (freely chosen) basic kernel between the two neural observa- tions oir and ojs at times r and s in trials i and j respectively. For an explanation of kernel operators we refer the reader to [14] and mention that the kernel operator can be viewed as computing oi oj r s where is a fixed mapping to some inner product space. The choice of kernel (being the choice of feature space) reflects a modeling decision that specifies how similarities between neural patterns are measured. The resulting dual form of the tracker is zt = k k Gtk where Gt is the Gram matrix row of the new example.

It is therefore clear from Eq. (4) that the linear dynamic characteristics of DDT results in a Gram matrix whose entries depend on previous observations. This dependency is ex- ponentially decaying as the time difference between events in the trajectories grow. Note that solution of the dual optimization problem in Eq. (3) can be calculated by any stan- dard quadratic programming optimization tool. Also, note that direct calculation of G is inefficient. We describe an efficient method in the sequel.

Efficient Calculation of the Gram Matrix Simple, straight-forward calculation of the Gram matrix is time consuming. To illustrate this, suppose each trial is of length ti = n, end then calculation of each basic block would take (n2) summation steps. We now describe a procedure based on dynamic-programming method for calculating the Gram matrix in a constant number of operations for each basic block.

Omitting the indexing over trials to ease notation, we are interested in calculating the basic block Ktq. First, define Btq = t k k=1 kq At-k . the basic block Ktq can be recursively calculated in three different ways: Ktq = Kt(q-1)AT + Btq (5)

                 Ktq                 = AK(t-1)q + (Bqt)T                                                                                                                (6)

Ktq                 = AK(t-1)(q-1)AT + (Bqt)T + Btq - ktq .                                                                                            (7) Thus, by adding Eq. (5) to Eq. (6) and subtracting Eq. (7) we get               Ktq         = AK(t-1)q + Kt(q-1)AT - AK(t-1)(q-1)AT + ktqI . Btq (and the entailed summation) is eliminated in exchange for a 2D dynamic program with initial conditions: K11 = k11I , K1q = K1(q-1)AT + k1qI , Kt1 = AK(t-1)1 + kt1I.


Table 1: Mean R2, MAE & MSE (across datasets, folds, hands and directions) for each algorithm.

                                                                                   R2                                            MAE                                     MSE                                       Algorithm                              pos.     vel.              accl.          pos.               vel.         accl.         pos.     vel.             accl.                                       Kalman filter                           0.64     0.58              0.30           0.40               0.15         0.37          0.78     0.27             1.16                                       DDT-linear                             0.59     0.49              0.17           0.63               0.41         0.58          0.97     0.50             1.23                                       SVR-Spikernel                          0.61     0.64              0.37           0.44               0.14         0.34          0.76     0.20             0.98                                       DDT-Spikernal                          0.73     0.67              0.40           0.37               0.14         0.34          0.50     0.16             0.91

1

0.8

Scores 2                                     0.6

0.4                                                                                                                                                                                       left hand, X dir.

left hand, Y dir.                                     0.2                 DDT-Spikernel, R                                                                                                                                                      right hand, X dir.

right hand, Y dir.

00           0.2    0.4     0.6          0.8     1       0           0.2         0.4          0.6      0.8        1        0      0.2    0.4       0.6         0.8     1                                                    Kalman filter, R2 Scores                                DDT-linear, R2 Scores                                       SVR-Spikernel, R2 Scores


Figure 1: Correlation coefficients (R2, of predicted and observed hand positions) comparisons of the DDT-Spikernel versus the Kalman filter (left), DDT-linear (center) and SVR-Spikernel (right). Each data point is the R2 values obtained by the DDT-Spikernel and by another method in one fold of one of the datasets for one of the two axes of movement (circle / square) and one of the hands (filled/non-filled). Results above the diagonals are cases were the DDT-Spikernel outperformes.

Suggested Optimization Method. One possible way to solve the optimization problem (essentially, a modification of the method described in [4] for classification) is to sequen- tially solve a reduced problem with respect to a single constraint at a time. Define:

                                         i =                  -                                           -           min                              -                                .                                                                         j          j Gij - yi                                                         j           j Gij - yi                                                                                                                       i,[0,c]                                                             j                                                                i                    j


Then i is the amount of -insensitive error that can be corrected for example i by keeping () () all constant and changing . Optimality is reached by iteratively choosing the j=i i

example with the largest i and changing its () within the [0, c] limits to minimize the i error for this example.

4 Experimental Setting

The data used in this work was recorded from the primary motor cortex of a Rhesus (Macaca Mulatta) monkey (~4.5 kg). The monkey sat in a dark chamber, and up to 8 electrodes were introduced into MI area of each hemisphere. The electrode signals were amplified, filtered and sorted. The data used in this report was recorded on 8 different days and includes hand positions, sampled at 500Hz, spike times of single units (isolated by sig- nal fit to a series of windows) and of multi units (detection by threshold crossing) sampled at 1ms precision. The monkey used two planar-movement manipulanda to control 2 cur- sors on the screen to perform a center-out reaching task. Each trial began when the monkey centered both cursors on a central circle. Either cursor could turn green, indicating the hand to be used in the trial. Then, one of eight targets appeared ('go signal'), the center circle disappeared and the monkey had to move and reach the target to receive liquid reward. The number of multi-unit channels ranged from 5 to 15, the number of single units was 20-27 and the average total was 34 units per dataset. The average spike rate per channel was 8.2 spikes/sec. More information on the recordings can be found in [9].

                                              DDT (Spikernel)        DDT (Spikernel)                                                                                 DDT (Spikernel)

88.1%                          75%                                                  78.7%

100%        Kalman Filter        SVR (Spikernel)         87.5%                        SVR (Spikernel)          91.88%

100%                63.75%               99.4%            80.0%                              98.7%             86.3%

SVR (Spikernel)             78.12%              96.3%      Kalman Filter                             95.6%      Kalman Filter

62.5%                                             86.8%                                                84.4%

DDT (Linear)                          DDT (Linear)                                         DDT (Linear)


Figure 2: Comparison of R2-performance between algorithms. Each algorithm is represented by a vertex. The weight of an edge between two algorithms is the fraction of tests in which the algorithm on top achieves higher R2 score than the other. A bold edge indicates a fraction higher than 95%. Graphs from left to right are for position, velocity, and acceleration respectively.

The results that we present here refer to prediction of instantaneous hand movements during the period from 'Go Signal' to 'Target Reach' times of both hands in successful trials. Note that some of the trials required movement of the left hand while keeping the right hand steady and vise versa. Therefore, although we considered only movement periods of the trials, we had to predict both movement and non-movement for each hand. The cumulative time length of all the datasets was about 67 minutes. Since the correlation between the movements of the two hands tend to zero - we predicted movement for each hand separately, choosing the movement space to be [x, y, vx, vy, ax, ay]T for each of the hands (preliminary results using only [x, y, vx, vy]T were less accurate).

We preprocessed the spike trains into spike counts in a running windows of 100ms (choice of window size is based on previous experience [11]). Hand position, velocity and acceler- ation were calculated using the 500Hz recordings. Both spike counts and hand movement were then sampled at steps of 100ms (preliminary results with step size 50ms were negli- gibly different for all algorithms). A labeled example yi, oi t t for time t in trial i consisted of the previous 10 bins of population spike counts and the state, as a 6D vector for the left or right hand. Two such consecutive examples would than have 9 time bins of spike count overlap. For example, the number of cortical units q in the first dataset was 43 (27 single and 16 multiple) and the total length of all the trials that were used in that dataset is 529 seconds. Hence in that session there are 5290 consecutive examples where each is a 4310 matrix of spike counts along with two 6D vectors of end point movement.

In order to run our algorithm we had to choose base kernels, their parameters, A and c (and , to be introduced below). We used the Spikernel [11], a kernel designed to be used with spike rate patterns, and the simple dot product (i.e. linear regression). Kernel parmeters and c were chosen (and subsequently held fixed) by 5 fold cross validation over half of the first dataset only. We compared DDT with the Spikernel and with the linear kernel to standard SVR using the Spikernel and the Kalman filter. We also obtained tracking results using both DDT and SVR with the standard exponential kernel. These results were slightly less accurate on average than with the Spikernel and are therefore omitted here. The Kalman filter was learned assuming the standard state space model (yt = Ayt-1 + , ot = Hyt +, where , are white Gaussian noise with appropriate correlation matrices) such as in [16]. y belonged to the same 6D state space as described earlier. To ease the comparison - the same matrix A that was learned for the Kalman filter was used in our algorithm (though we show that it is not optimal for DDT), multiplied by a scaling parameter . This parameter was selected to produce best position results on the training set. The selected value is 0.8.

The figures that we show in Sec. 5 are of test results in 5 fold cross validation on the rest of the data. Each of the 8 remaining datasets was divided into 5 folds. 4/5 were used for

                                                                                                                               X              Y                                               R2        MAE    MSE    # Support

14K

position      Position                                                                                    12K                                                                                                                                         Actual                                                                                                                                         DDT-Spikernel                                                                                                                                         SVR-Spikernel                                                                                    10K                  Velocity                                                                              velocity

8K

6K                              Acceleration                                                                                                                acceleration


Figure 3: Effect of on R2, MAE ,MSE and Figure 4: Sample of tracking with the DDT-

number of support vectors. Spikernel and the SVR-Spikernel.

training (with the parameters obtained previously and the remaining 1/5 as test set). This process was repeated 5 times for each hand. Altogether we had 8sets 5folds 2hands = 80 folds.

5 Results We begin by showing average results across all datasets, folds, hands and X/Y directions for the four algorithms that are compared. Table. 1 shows mean Correlation Coefficients (R2, between recorded and predicted movement values), Mean insensitive Absolute Errors (MAE) and Mean Square Errors (MSE). R2 is a standard performance measure, MAE is the error minimized by DDT (subject to the regularization term) and MSE is minimized by the Kalman filter. Under all the above measures the DDT-Spikernel outperforms the rest with the SVR-Spikernel and the Kalman Filter alternating in second place.

To understand whether the performance differences are significant we look at the distribu- tion of position (X and Y) R2 values at each of the separate tests (160 altogether). Figure 1 shows scatter plots of R2 results for position predictions. Each plot compares the DDT- Spikernel (on the Y axis) with one of the other three algorithms (on the X axes). It is clear that in spite large differences in accuracy across datasets, the algorithm pairs achieve similar success with the DDT-Spikernel achieving a better R2 score in almost all cases.

To summarize the significance of R2 differences we computed the number of tests in which one algorithm achieved a higher R2 value than another algorithm (for all pairs, in each of the position, velocity and acceleration categories). The results of this tournament between the algorithms are presented in Figure 2 as winning percentages. The graphs produce a ranking of the algorithms and the percentages are the significances of the ranking between pairs. The DDT-Spikernel is significantly better then the rest in tracking position.

The matrix A in use is not optimal for our algorithm. The choice of scales its effect. When = 0 we get the standard SVR algorithm (without state dynamics). To illustrate the effect of we present in Figure 3 the mean (over 5 folds, X/Y direction and hand) R2 results on the first dataset as a function of . It is clear that the value chosen to minimize position error is not optimal for minimizing velocity and acceleration errors. Another important effect of is the number of the support patterns in the learned model, which drops considerably (by about one third) when the effect of the dynamics is increased. This means that more training points fall strictly within the -tube in training, suggesting that the kernel which tacitly results from the dynamical model is better suited for the problem. Lastly, we show a sample of test tracking results for the DDT-Spikernel and SVR-Spikernel in Figure 4. Note that the acceleration values are not smooth and are, therefore, least aided by the dynamics of the model. However, adding acceleration to the model improves the prediction of position.

6 Conclusion We described and reported experiments with a dynamical system that combines a linear state mapping with a nonlinear observation-to-state mapping. The estimation of the sys- tem's parameters is transformed to a dual representation and yields a novel kernel for tem- poral modelling. When a linear kernel is used, the DDT system has a similar form to the Kalman filter as t . However, the system's parameters are set so as to minimize the regularized -insensitive 1 loss between state trajectories. DDT also bares similarity to SVR, which employs the same loss yet without the state dynamics. Our experiments indi- cate that by combining a kernel-induced feature space, linear state dynamics, and using a robust loss we are able to leverage the trajectory prediction accuracy and outperform com- mon approaches. Our next step toward an accurate brain-machine interface for predicting hand movements is the development of a learning procedure for the state dynamic mapping A and further developments of neurally motivated and compact representations.

Acknowledgments This study was partly supported by a center of excellence grant (8006/00) administered by the ISF, BMBF-DIP, by the U.S. Israel BSF and by the IST Programme of the Eu- ropean Community, under the PASCAL Network of Excellence, IST-2002-506778. L.S. is supported by a Horowitz fellowship.