{"title": "Spikernels: Embedding Spiking Neurons in Inner-Product Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 141, "page_last": 148, "abstract": null, "full_text": "Spikernels:\n\nEmbedding Spiking Neurons\n\nin Inner-Product Spaces\n\nLavi Shpigelman\u0002\u0001 Yoram Singer Rony Paz\u0001\u0002\u0003 Eilon Vaadia\u0001\u0005\u0004\n\n School of computer Science and Engineering\n\n\u0001 Interdisciplinary Center for Neural Computation\n\u0003 Dept. of Physiology, Hadassah Medical School\nThe Hebrew University Jerusalem, 91904, Israel\n\n{shpigi,singer}@cs.huji.ac.il\n{ronyp,eilon}@hbf.huji.ac.il\n\nAbstract\n\nInner-product operators, often referred to as kernels in statistical learning, de-\n\ufb01ne a mapping from some input space into a feature space. The focus of\nthis paper is the construction of biologically-motivated kernels for cortical ac-\ntivities. The kernels we derive, termed Spikernels, map spike count sequences\ninto an abstract vector space in which we can perform various prediction tasks.\nWe discuss in detail the derivation of Spikernels and describe an ef\ufb01cient al-\ngorithm for computing their value on any two sequences of neural population\nspike counts. We demonstrate the merits of our modeling approach using the\nSpikernel and various standard kernels for the task of predicting hand move-\nment velocities from cortical recordings. In all of our experiments all the ker-\nnels we tested outperform the standard scalar product used in regression with\nthe Spikernel consistently achieving the best performance.\n\n1\n\nIntroduction\n\nNeuronal activity in primary motor cortex (MI) during multi-joint arm reaching movements in 2-\nD and 3-D [1, 2] and drawing movements [3] has been used extensively as a test bed for gaining\nunderstanding of neural computations in the brain. Most approaches assume that information is\ncoded by \ufb01ring rates, measured on various time scales. The tuning curve approach models the\naverage \ufb01ring rate of a cortical unit as a function of some external variable, like the frequency\nof an auditory stimulus or the direction of a planned movement. Many studies of motor cortical\nareas [4, 2, 5, 3, 6] showed that while single units are broadly tuned to movement direction,\na relatively small population of cells (tens to hundreds) carries enough information to allow\nfor accurate prediction. Such broad tuning can be found in many parts of the nervous system,\nsuggesting that computation by distributed populations of cells is a general cortical feature. The\npopulation-vector method [4, 2] describes each cell\u2019s \ufb01ring rate as the dot product between that\ncell\u2019s preferred direction and the direction of hand movement. The vector sum of preferred\ndirections, weighted by the measured \ufb01ring rates is used both as a way of understanding what\nthe cortical units encode and as a means for estimating the velocity vector.\nSeveral recent studies [7, 8, 9] propose that neurons can represent or process multiple parameters\nsimultaneously, suggesting that it is the dynamic organization of the activity in neuronal popu-\nlations that may represent temporal properties of behavior such as the computation of transfor-\nmation from \u2019desired action\u2019 in external coordinates to muscle activation patterns. Some studies\n\n\f[10, 11, 12] support the notion that neurons can associate and dissociate rapidly to functional\ngroups in the process of performing a computational task. The concepts of simultaneous encod-\ning of multiple parameters and dynamic representation in neuronal populations, could together\nexplain some of the conundrums in motor system physiology. These concepts also invite usage\nof increasingly complex models for relating neural activity to behavior. Advances in comput-\ning power and recent developments of physiological recording methods allow recording of ever\ngrowing numbers of cortical units that can be used for real-time analysis and modeling. These\ndevelopments and new understandings have recently been used to reconstruct movements on the\nbasis of neuronal activity in real-time in an effort to facilitate the development of hybrid brain-\nmachine interfaces that allow interaction between living brain tissue and arti\ufb01cial electronic or\nmechanical devices to produce brain controlled movements [13, 6, 14, 15, 11, 16, 17]. Cur-\nrent attempts at predicting movement from cortical activity rely on modeling techniques such as\ncosine-tuning estimation (pop. vector) [18], linear regression [15, 19] and arti\ufb01cial neural nets\n[15] (though this study reports getting better results by linear regression). A major de\ufb01ciency\nof standard approaches is poor ability to extract the relevant information from monitored brain\nactivity in an ef\ufb01cient manner that will allow reducing the number of recorded channels and\nrecording time.\n\nThe paper is organized as follows. In Sec. 2 we describe the problem setting that this paper\nIn Sec. 3 we introduce and explain the main mathematical tool that we\nis concerned with.\nuse, namely, the kernel operator.\nIn Sec. 4 we discuss the design and implementation of a\nbiologically-motivated kernel for neural activities. We report experimental results in Sec. 5 and\ngive conclusions in Sec. 6.\n\nical motor behavior of a subject. Our goal is to learn a predictive model of some behavior\n\n2 Problem setting\nConsider the case where we monitor instantaneous spike rates from cortical units during phys-\nparameter with the cortical activity as the input. Formally speaking, let \u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\u000b\n be a sequence\nof instantaneous \ufb01ring rates from \f cortical units consisting of \r\nsamples altogether. We use \u0001\u000f\u000e\u0011\u0010\nthe length of a sequence \u0001 . Let \u0001\u001d\u001c be the\nto denote sequences of \ufb01ring rates and denote by \u0012\u0014\u0013\u0016\u0015\u0018\u0017\u0019\u0001\u001b\u001a\n\u001e th sample (i.e. instantaneous \ufb01ring rates) of a sequence\u0001 . We also use \u0001 \u001f\nnation of \u0001 with one more sample \u001f\n. We refer to the instantaneous \ufb01ring rate of a unit ! by \u001f#\" .\nWe also need to employ a notation for sub-sequences. The$ -long pre\ufb01x\u0001\n( . Finally,\nthroughout the work we need to examine a substrings of sequences. We denote by ) a vector of\n\u001e5-\n37\u0012\u0014\u0013\u0016\u0015#\u0017\u0019\u0001\u001b\u001a .\nindices into the sequence \u0001 where )+*,\u0017\nWe also need to introduce some notation for target variables we would like to predict. Let89\u0002\u0005\u0004\u0007\n\nvelocity in the \u001f direction, :<;\n( of the form >@?\n\u0006A\b\u001d\nCB\ntherefore con\ufb01ne ourselves to causal predictors that use \u0001\nlike to make =\n3 Kernel methods for regression\n\nis denoted\u0001&%\u0011'\n36/ / /&3\n). Our goal is to learn an approximation =\n\nfrom neural \ufb01ring rates to movement parameter. In general, information about\nmovement can be found in neural activity both before and after the time of movement itself. Our\nplan, though, is to design a model that can be used for controlling a neural prosthesis. We will\n\n\u001a as close as possible (in a sense that is explained in the sequel) to8\n\ndenote some parameter of the movement that we would like to predict (e.g.\n\n( . We therefore would\n( .\n\nto predict 8\n\nto denote the concate-\n\n\u001a and 243\n\n\u001e.-\n\n\u001e.1\n\n\u000e / / /0\u000e\n\n\u001e.1\n\nthe movement\n\n*D>E\u0017.\u0001\n\n%F'\n\n%\u0011'\n\nA major mathematical notion employed in this paper is kernel operators. Kernel operators al-\nlow algorithms whose interface to the data is limited to scalar products to employ complicated\npremappings of the data into feature spaces by use of kernels. Formally, a kernel is an inner-\nis some arbitrary vector space. An explicit way\nsuch that\n\nproduct operator GH?JILKMI\nto describe G\nGQ\u0017\u0014\u001fJ\u000e\u0011\u001f+RS\u001aT*UN#\u0017\u0014\u001fV\u001aXWJ\u0017\u0005\u0004\n\nples that fall within it\u2019s boundaries are considered well estimated and do not contribute to the\nis the feature vec-\n\nSupport Vector Regression minimizes Vapnik\u2019s [21] -insensitive loss function \u0001\n\b\n\t\f\u000b\u000e\r\u0010\u000f\nerror. Examples outside the tube contribute linearly to the loss. Say N#\u0017\u0005\u0004\ntor implemented by kernel GQ\u0017\nWS\u000e\u0014\u0004\n>J\u0017\u0005\u0004\n\u001b\u001d\u001c\n\n\u0002\u0011\u0013\u0012 which de\ufb01nes a hyperplane with width around the estimate. Exam-\n\u001a . To estimate a linear (linear in feature space) regression\n\n\u001a\u0018\u0017\u001a\u0019 with precision , one minimizes\n>E\u0017\u0019N\n\nW\u001bN#\u0017\u0005\u0004\n\n\u0017\u001a\u001e\n\n>J\u0017\u0005\u0004\n\n8\u0003\u0002\n\n\u001a\u0011\u001a\n\n\u0017\u0016\u0004\n\n\u0017\u0016\u0015\n\n\u0001\u0007\u0006\n\nThis can be written as a constrained minimization problem\n\n\u001c\"!\n\nminimize\n\nsubject to\n\n\u0017*$\u0016\u001c+\u0017,$'&\n\n\u0017)\u001e\n\n\u000e%$&\u000e%$'&\u0016\u001a\u0018*\n\u001b(\u001c\n\u0017\u0005\u0004J\u001c.\u001a-\u0017/. \u001a0\u0002\nW N\n\n\u001c\u0007!\n8A\u001c\u00183\u001a1\u00172$ \u001c\n\u001c\u0019\u001a-\u0017/. \u001aX3\u001a1\u00172$4&\n\n\u00173\u0004\n\n#\u0003\u0017\u0005\u0015\n\u0017\u0005\u0015\n8\t\u001c\u0018\u0002\n$\u0016\u001c\n\u000e%$\n\nW\u0016N\n\u0017\u0016\u0015\n\u001c65\n\nBy switching to the dual problem of this optimization problem, it is possible to incorporate the\nkernel function, achieving a mapping that may not be feasible by calculating (possibly in\ufb01nite)\n\nfeature vectors N#\u0017\u0019\u0001\t\u001a . For \u001e87\n\nmaximize\n\n\u000e9\n\u000e<;\n\n\u0017\u0016;\n\n\u001a\u0018*\n\nsubject to\n\n2A\u000e / / / \u000e\u0017\n\u001c\"!\n\u0017\u0016;A&\n\u00022;\n\u0017\u0016;A&\n\u001c.\u001a\n\u000eG\u001eIH and \u001f\n\u00022F\n\u001c\u0007!\n\n\u001c\"!\n\u001c\u0005?\n\n@J\u0017\u0005\u0004\n\n\u0017\u0016;\n\n\u001c\u0007!\n\n\u0002,;\n\n\u001c.\u001a\n\n!V\u0017\u0005\u0004\n\n\u001c\u0011\u000e\u0014\u0004J\u001a\u0018\u0017\u001a.\n\nIn summary, SVM regression solves a quadratic optimization problem to \ufb01nd a hyperplane in the\n\nkernel induced feature space that best estimates the data for an -insensitive linear loss function.\n4 Spikernels\n\nThe quality of SVM learning is highly dependent on how the data is embedded in the feature\nspace via the kernel operator. For this reason, several studies have been devoted lately to devel-\noping new kernels [22, 23, 24]. In fact, for classi\ufb01cation problems, a good kernel would render\nthe work of the classi\ufb01cation algorithm trivial. With this in mind, we develop a kernel for neural\nspiking activity.\n\n4.1 Motivation\n\nOur goal in developing a kernel for spike trains is to map similar patterns to nearby areas of the\nfeature space. Current methods for predicting response variables from neural activities use stan-\ndard linear regression techniques (see for instance [15]) or or even replace the time pattern with\nmean \ufb01ring rates. A notable example is the population vector method [18]. Other approaches use\noff-the-shelf learning algorithms, intended for general purpose. In the description of our kernel\nwe attempt to capture some well accepted notions on similarities between spike trains. We make\nthe following assumptions regarding similarities between spike patterns:\n\n\u001a\n*\n\u000e\n\u0001\n\u001a\n\u0001\n\u001a\n\u001a\n*\n\u001a\n2\n\u0015\n\u001c\n-\n\u001f\n \n%\n\u0001\n8\n\u001c\n\u0002\n\u001c\n\u0001\n\u0006\n2\n\u0015\n\u001c\n-\n\u001f\n \n%\n\u001c\n\u001a\n\u001c\n&\n\u000f\n\u000f\n5\n:\n&\n\u001f\n \n%\n&\n\u001c\n\u001c\n\u001f\n \n%\n&\n\u001c\n\u0002\n2\n\u001b\n\u001f\n \n%\n\u001c\n@\n\u001c\nC\n\u001e\n\u0002\n\n;\n\u001c\n&\n\u001c\n\u000f\n \n%\n\u001c\n&\n\u001c\n\u001a\n*\n\u000f\n\u001a\n*\n\u001f\n \n%\n&\n\u001c\n\fPattern A\nPattern A\nPattern B\nPattern B\n\nR\na\nt\ne\n\nPattern(cid:0)A\nPattern(cid:0)A\nPattern(cid:0)B\nPattern(cid:0)B\n\nR\na\n\nt\n\ne\n\nPattern(cid:0)A\nPattern(cid:0)A\nPattern(cid:0)B\nPattern(cid:0)B\n\nTime(cid:0)of\nTime(cid:0)of\nInterest(cid:0)\nInterest(cid:0)\n\nR\na\n\nt\n\ne\n\nTime\n\nTime\n\nTime\n\nFigure 1: Illustrative examples of pattern similarities. Left: bin-by-bin comparison yields small\ndifferences. Middle: patterns with large bin-by-bin differences that can be eliminated with some\ntime warping. Right: patterns whose suf\ufb01x (time of interest) is similar and pre\ufb01x is different.\n\n The most commonly made assumption is that similar \ufb01ring patterns may have small differences\nin a bin-by-bin comparison. This type of variation is due to inherent noise of any physical system\nbut also responses to external factors that were not recorded and are not directly related the to\nthe task performed. On the left-hand side of Fig. 1 we show an example of two patterns that are\nbin-wise similar though clearly not identical.\n A cortical population may display highly speci\ufb01c patterns to represent speci\ufb01c information. It\nis conceivable that some features of external stimuli are represented by population dynamics that\nwould be best described as \u2019temporal\u2019 coding.\n Two patterns may be quite different in a simple bin-wise comparison but if they are aligned\nby some non-linear time distortion or shifting, the similarity becomes apparent. An illustration\nof such patterns is given in the middle plots of Fig. 1. In comparing patterns we would like to\ninduce a higher score when the time-shifts are small.\n\n Patterns that are associated with identical values of an external stimulus at time $ may be\nsimilar at that time but very different at $\n\u0001\u0003\u0002 when values of the external stimulus for these\n\npatterns are no longer similar (as illustrated on the right-hand-side of Fig. 1).\n\n4.2 Kernel de\ufb01nition\n\nWe describe the kernel by specifying the features that make up the feature space. Our construc-\ntion of the feature space builds on the work of Lodhi et al. [24]. First, we need to introduce a few\n\n\u0017\u0016;\n\n\u0012\u0014\u0013\u0016\u0015\u0018\u0017\u0019\u0001\u001b\u001a . The set of all possible \u0015 -long index\n\u0012\u0014\u0013\u001b\u0015#\u0017\u0019\u0001\u001b\u001a\u0014\u0012 . Also,\n)J\u0002\t\b\n\u000e\r\f#\u001a denote a bin-wise distance over a pair of samples (\ufb01ring rates). We also overload no-\n\"\u000f\u001a a distance between sequences. The sequence\n\u000e\u00162\u0016\u001a . The\n\u0002D\u0017\n\nmore notations. Let \u0001 be a sequence of length \u0012\nvectors de\ufb01ning a sub-sequence of \u0001\nlet \ntation and denote by \n\u001c\u0015\u0014\n\u000e\u0011\u0010\n*\u0013\u0012\ncomponent of our (in\ufb01nite) feature vector N\u0018\u0017\u0019\u0001\u001b\u001a\n\ndistance is the sum over the samples constituting the two sequences. Let\n\nis de\ufb01ned as,\n\n\u0004\u0006\u00059?\n\u0007\u0017\u0019\u0001\n\n\u0017\u0019\u0001\u000f\u000e.\u000e\u0011\u0010\n\n\u0016#\u000e\u0011\u0017\n\n/ / /\n\n243\n\n%\u000b\n\nis\n\n\u000e\u001e\u001d \u001f\n\n\u0016*),+.-0/\n\nis compared to\n\nis a normalization constant that simpli\ufb01es the calculation and and )F%\n\n\u0017\u0019\u0001\u001b\u001a\u0018*\nis a sum over all n-long sub-sequences of \u0001 . Each sub-sequence\n\nN\u0019\u0018\nwhere and \u001e\nindex of ) . In words, N8\u0018V\u0017\u0019\u0001\t\u001a\nIn particular, part of the weight of each sub-sequence of \u0001\nsequence is toward the end of \u0001 . Put another way, the entry indexed by\nis to the time series \u0001 near its end.\n\nThis de\ufb01nition seems to \ufb01t our assumptions on neural coding for the following reasons:\n\n(the feature coordinate) and is weighted up according to its similarity to\n\n.\nre\ufb02ects how concentrated the sub-\n\nmeasures how close\n\nis the \ufb01rst\n\n#$\u001a\u0006%\n\n\u001a\u000f!\n\n&('\n\n\u001e\u001b\u001a\n\n1546\u000e\u001e7\n\n(1)\n\n\u0018 1\n\n\u000732\n\n+.-\n\nIt allows for complex patterns: small values of\n\nand\n\neach coordinate\n\nsuf\ufb01x of \u0001 or not.\n\ntends toward being either 2 or \u000f depending whether\n\n(or concentrated measures) mean that\n\nis almost identical to a\n\n*\n\u0007\n*\n\n)\n?\n1\n)\n\n)\n1\n3\n\u001a\n1\n\"\n!\n%\n\u000f\n\u0010\n\u001c\n \n\"\n?\n\u0017\n1\n\u000e\n\u0010\n\u0010\n\u0010\n\u0010\n\n\u0017\n\u0016\n\u0010\n\u0010\n\f Patterns that are piece-wise similar to\nthat decays as the sample-by-sample comparison between the sequences grows large.\n We allow gaps in the indexes de\ufb01ning sub-sequences, thus, allowing for time warping.\n Patterns that begin further from the required prediction time are penalized by an exponentially\ndecaying weight.\n\nfeature coordinate with a weight\n\ncontribute to the\n\n4.3 Ef\ufb01cient kernel calculation\n\nhe de\ufb01nition of N given by Eq. (1) requires the manipulation of an in\ufb01nite feature space. Straight-\n\nforward calculation of the feature values and performing the induced inner-product is clearly\nimpossible. Based on ideas from [24] we developed an indirect method for evaluating the kernel\nthrough a recursion which can be performed ef\ufb01ciently using dynamic programing. We now\ndescribe the recursion.\n\n1 . We now describe two recursive\nDenote by\u001f\nequations for N with respect to the length of the time series and the sub-sequence length. Due\nto the lack of space we skip some of the algebraic manipulations that are needed to derive the\nrecursions. The \ufb01rst equation is\n\nthe last entry in the sequence\u0001 \u001f\n\n+.-\n\n\u000732\n\n\u0017\u0019\u0001\u0016\u001fV\u001a\n\n\u0017\u0019\u0001\t\u001a\u0018\u0017\n\nis, again, with respect to both the length of the sub-sequence (\n\nEq. (2) simply separates the sum over sub-sequences of \u0001\nspeci\ufb01ed by the index vectors and the latter where )\nfor N\nsequence \u0010 ,\n\u0003\u0005\u0004\nfor\b\nThe last equation simply states that the feature is a sum over all possible values of)\n\n),+\ninto two subsets: one where \u001f\n1 speci\ufb01es \u001f\n@\u0007\u0006\n\n. Note that\nis empty. Eqs. (2) and (3) are now used for computing the recursion equation for\n\nis not\n. The second recursive equation\n) and the length of the\n\nN*\u0018V\u0017\u0014\u0010\u001b\u001aP*\n\n7J\u0017\u0019\u0010\t%\u0011'\n\n4V%0\u001a\n\nN*\u0018\n\n(3)\n\n@\u0014!\n\n),+\n\n154\n\n\u000732\n\n\u000732\n\n:\n\n,\n\n7\u0016\u0017\u0019\u0001\u001b\u001a\n\n(2)\n\n\u0017\u0019\u0001 \u001fJ\u000e\u0011\u0010\u001b\u001a#*\n\nN*\u0018V\u0017\u0019\u0001\u0016\u001fV\u001a\n\u0017\u0019\u0001 \u001fV\u001a and plug Eq. (3) into N8\u0018\n\n\u0017\u0014\u0010\u001b\u001a\u000b\nN\u0019\u0018\nWe plug Eq. (2) into N*\u0018\n\u0017\u0014\u0010\u001b\u001a . Using algebraic manipulations we\nreplace integrals over scalar products of N by the proper kernels and get the following recursive\n1(4)\n\nfunction,\n\n\u0010\u000f\n\n\u000732\n\n\u0003\u0011\u0004\n\n\u0016*),+\n\n\u0016*),+\n\n@\u0007\u0006\n\n\t\u000b\n\r\f\u0007\u000e\n\nAssuming that the computation time of the integral in Eq. (4) is a constant, computing the entire\nif\n\nwe cache the term on the right hand side of Eq.(4) as follows. De\ufb01ne,\n\n\u000732\n\n\u000732\n\n154\n\n@