{"title": "On the Analysis of Multi-Channel Neural Spike Data", "book": "Advances in Neural Information Processing Systems", "page_first": 936, "page_last": 944, "abstract": "Nonparametric Bayesian methods are developed for analysis of multi-channel spike-train data, with the feature learning and spike sorting performed jointly. The feature learning and sorting are performed simultaneously across all channels. Dictionary learning is implemented via the beta-Bernoulli process, with spike sorting performed via the dynamic hierarchical Dirichlet process (dHDP), with these two models coupled. The dHDP is augmented to eliminate refractoryperiod violations, it allows the \u201cappearance\u201d and \u201cdisappearance\u201d of neurons over time, and it models smooth variation in the spike statistics.", "full_text": "On the Analysis of Multi-Channel Neural Spike Data\n\nDepartment of Electrical and Computer Engineering, Duke University, Durham, NC 27708\n\nBo Chen, David E. Carlson and Lawrence Carin\n\n{bc69, dec18, lcarin}@duke.edu\n\nAbstract\n\nNonparametric Bayesian methods are developed for analysis of multi-channel\nspike-train data, with the feature learning and spike sorting performed jointly.\nThe feature learning and sorting are performed simultaneously across all chan-\nnels. Dictionary learning is implemented via the beta-Bernoulli process, with\nspike sorting performed via the dynamic hierarchical Dirichlet process (dHDP),\nwith these two models coupled. The dHDP is augmented to eliminate refractory-\nperiod violations, it allows the \u201cappearance\u201d and \u201cdisappearance\u201d of neurons over\ntime, and it models smooth variation in the spike statistics.\n\n1\n\nIntroduction\n\nThe analysis of action potentials (\u201cspikes\u201d) from neural-recording devices is a problem of long-\nstanding interest (see [21, 1, 16, 22, 8, 4, 6] and the references therein). In such research one is\ntypically interested in clustering (sorting) the spikes, with the goal of linking a given cluster to\na particular neuron. Such technology is of interest for brain-machine interfaces and for gaining\ninsight into the properties of neural circuits [14]. In such research one typically (i) \ufb01lters the raw\nsensor readings, (ii) performs thresholding to \u201cdetect\u201d the spikes, (iii) maps each detected spike to\na feature vector, and (iv) then clusters the feature vectors [12]. Principal component analysis (PCA)\nis a popular choice [12] for feature mapping. After performing such sorting, one typically must (v)\nsearch for refractory-time violations [5], which occur when two or more spikes that are suf\ufb01ciently\nproximate are improperly associated with the same cluster/neuron (which is impossible due to the\nrefractory time delay required for the same neuron to re-emit a spike). Recent research has combined\n(iii) and (iv) within a single model [6], and methods have been developed recently to address (v)\nwhile performing (iv) [5].\nMany of the early methods for spike sorting were based on classical clustering techniques [12] (e.g.,\nK-means and GMMs, with a \ufb01xed number of mixtures), but recently Bayesian methods have been\ndeveloped to account for more modeling sophistication. For example, in [5] the authors employed a\nmodi\ufb01cation to the Chinese restaurant formulation of the Dirichlet process (DP) [3] to automatically\ninfer the number of clusters (neurons) present, allow statistical drift in the feature statistics, permit\nthe \u201cappearance\u201d/\u201cdisappearance\u201d of neurons with time, and automatically account for refractory-\ntime requirements within the clustering (not as a post-clustering step). However, [5] assumed that\nthe spike features were provided via PCA in the \ufb01rst two or three principal components (PCs).\nIn [6] feature learning and spike sorting were performed jointly via a mixture of factor analyzers\n(MFA) formulation. However, in [6] model selection was performed (for the number of features and\nnumber of neurons) and a maximum likelihood (ML) \u201cpoint\u201d estimate was constituted for the model\nparameters; since a \ufb01xed number of clusters are inferred in [6], the model does not directly allow for\nthe \u201cappearance\u201d/\u201cdisappearance\u201d of neurons, or for any temporal dependence to the spike statistics.\n\nThere has been an increasing interest in developing neural devices with C > 1 recording channels,\neach of which produces a separate electrical recording of neural activity. Recent research shows\nincreased system performance with large C [18]. Almost all of the above research on spike sorting\n\n1\n\n\f300\n\n100\n\n2\n\u2212\nC\nP\n\n\u2212100\n\n\u2212300\n\n\u2212500\n\n \n\nGround Truth\n\n \n\nUnkown Neuron\nKnown Neuron\n\n300\n\n100\n\n\u2212100\n\n2\n\u2212\nC\nP\n\n\u2212300\n\nK\u2212means\n\n300\n\n100\n\n\u2212100\n\n2\n\u2212\nC\nP\n\n\u2212300\n\nGMM\n\n300\n\n100\n\n2\n\u2212\nC\nP\n\n\u2212100\n\n\u2212300\n\nHDP\u2212DL\n\n\u2212500\n\n\u2212700\n\n\u2212400\n\n\u2212100\nPC\u22121\n(b)\n\n200\n\n500\n\n\u2212500\n\n\u2212700\n\n\u2212400\n\n200\n\n500\n\n\u2212100\nPC\u22121\n(c)\n\n\u2212500\n\n\u2212700 \u2212500 \u2212300 \u2212100 100 300 500\n\nPC\u22121\n(d)\n\n\u2212700 \u2212500 \u2212300 \u2212100 100 300 500\n\nPC\u22121\n(a)\n\nFigure 1: Comparison of spike sorting on real data. (a) Ground truth; (b) K-means clustering on the \ufb01rst 2\nprincipal components; (c) GMM clustering with the \ufb01rst 2 principal components; (d) proposed method. We\nlabel using arrows examples K-means and the GMM miss, and that the proposed method properly sort.\n\nhas been performed on a single channel, or when multiple channels are present each is typically\nanalyzed in isolation. In [5] C = 4 channels were considered, but it was assumed that a spike\noccurred at the same time (or nearly same time) across all channels, and the features from the four\nchannels were concatenated, effectively reducing this again to a single-channel analysis. When\nC 1, the assumption that a given neuron is observed simultaneously on all channels is typically\ninappropriate, and in fact the diversity of neuron sensing across the device is desired, to enhance\nfunctionality [18].\nThis paper addresses the multi-channel neural-recording problem, under conditions for which con-\ncatenation may be inappropriate; the proposed model generalizes the DP formulation of [5], with\na hierarchical DP (HDP) formulation [20]. In this formulation statistical strength is shared across\nthe channels, without assuming that a given neuron is simultaneously viewed across all channels.\nFurther, the model generalizes the HDP, via a dynamic HDP (dHDP) [17] to allow the \u201cappear-\nance\u201d/\u201cdisappearance\u201d of neurons, while also allowing smooth changes in the statistics of the neu-\nrons. Further, we explicitly account for refractory times, as in [5]. We also perform joint feature\nlearning and clustering, using a mixture of factor analyzers construction as in [6], but we do so in a\nfully Bayesian, multi-channel setting (additionally, [6] did not account for time-varying statistics).\nThe learned factor loadings are found to be similar to wavelets, but they are matched to the properties\nof neuron spikes; this is in contrast to previous feature extraction on spikes [11] based on orthogonal\nwavelets, that are not necessarily matched to neuron properties.\nTo give a preview of the results, providing a sense of the importance of feature learning (relative to\nmapping data into PCA features learned of\ufb02ine), in Figure 1 we show a comparison of clustering\nresults on the \ufb01rst channel of d533101 data from hc-1 [7]. For all cases in Figure 1 the data are\ndepicted in the \ufb01rst two PCs for visualization, but the proposed method in (d) learns the number of\nfeatures and their composition, while simultaneously performing clustering. The results in (b) and\n(c) correspond respectively to widely employed K-means and GMM analysis, based on using two\nPCs (in these cases the analysis are employed in PCA space, as have been many more-advanced ap-\nproaches [5]). From Figures 1 (b) and (c), we observe that both K-means and GMM work well, but\ndue to the constrained feature space they incorrectly classify some spikes (marked by arrows). How-\never, the proposed model, shown in Figure 1(d), which incorporates dictionary learning with spike\nsorting, infers an appropriate feature space (not shown) and more effectively clusters the neurons.\nThe details of this model, including a multi-channel extension, are discussed in detail below.\n\n2 Model Construction\n\n2.1 Dictionary learning\nWe initially assume that spike detection has been performed on all channels. Spike n 2{ 1, . . . , Nc}\non channel c 2{ 1, . . . , C} is a vector x(c)\nn 2 RD, de\ufb01ned by D time samples for each spike,\ncentered at the peak of the detected signal; there are Nc spikes on channel c.\nData from spike n on channel c, x(c)\nn , is represented in terms of a dictionary D 2 RD\u21e5K, where\nK is an upper bound on the number of needed dictionary elements (columns of D), and the model\n\n2\n\n\finfers the subset of dictionary elements needed to represent the data. Each x(c)\n\nn is represented as\n\n(1)\nwhere \u21e4(c) = diag((c)\nK bK) is a diagonal matrix, with b = (b1, . . . , bK)T 2\n{0, 1}K. De\ufb01ning dk as the kth column of D, and letting ID represent the D \u21e5 D identity matrix,\nthe priors on the model parameters are\n\nx(c)\nn = D\u21e4(c)s(c)\n\n2 b2, . . . , (c)\n\n1 b1, (c)\n\nn + \u270f(c)\nn\n\ndk \u21e0N (0,\n\n1\nD\n\nID) ,\n\n(c)\n\nk \u21e0T N +(0, 1\nc ) ,\n\nn \u21e0N (0, \u23031\n\u270f(c)\nc )\n\n(2)\n\n1 , . . . ,\u2318 (c)\n\n1 , . . . ,\u2318 (c)\n\nwhere \u2303c = diag(\u2318(c)\nD ), and T N +(\u00b7) represents the truncated (positive) normal distribu-\ntion. Gamma priors (detailed when presenting results) are placed on c and on each of the ele-\nments of (\u2318(c)\nD ). For the binary vector b we impose the prior bk \u21e0 Bernoulli(\u21e1k), with\n\u21e1k \u21e0 Beta(a/K, b(K 1)/K), implying that the number of non-zero components of b is drawn\nBinomial(K, a/(a + b(K 1))); this corresponds to Poisson(a/b) in the limit K ! 1. Parameters\na and b are set to favor a sparse b.\nThis model imposes that each x(c)\nn is drawn from a linear subspace, de\ufb01ned by the columns of D\nwith corresponding non-zero components in b; the same linear subspace is shared across all channels\nc 2{ 1, . . . , C}. However, the strength with which a column of D contributes toward x(c)\nn depends\non the channel c, as de\ufb01ned by \u21e4(c). Concerning \u21e4(c), rather than explicitly imposing a sparse\ndiagonal via b, we may also draw (c)\nck ), with shrinkage priors employed on the ck\n(i.e., with the ck drawn from a gamma prior that favors large ck; which encourages many of the\ndiagonal elements of \u21e4(c) to be small, but typically not exactly zero). In tests, the model performed\nsimilarly when shrinkage priors were used on \u21e4(c) relative to explicit imposition of sparseness via\nb; all results below are based on the latter construction.\n\nk \u21e0T N +(0, 1\n\nn ) ,\u2713\n\n(c)\n\n(3)\n\nn \u21e0 G(c) ,\n\nG \u21e0 DP(\u21b50G0)\n\nG(c) \u21e0 DP(\u21b5cG) ,\n\n2.2 Multi-Channel Dynamic hierarchical Dirichlet process\nWe sort the spikes on the channels by clustering the {s(c)\nn }, and in this sense feature design (learning\n{D\u21e4(c)}) and sorting are performed simultaneously. We \ufb01rst discuss how this may be performed\nvia a hierarchical Dirichlet process (HDP) construction [20], and then extend this via a dynamic\nHDP (dHDP) [17] considering multiple channels. In an HDP construction, the {s(c)\nn } are modeled\nas being drawn\nn \u21e0 f (\u2713(c)\ns(c)\n\ni \u2713\u21e4i , withP1i=1 \u21e1(c)\n\nwhere a draw from, for example, DP(\u21b50G0) may be constructed [19] as G =P1i=1 \u21e1i\u2713\u21e4i , where\n\u21e1i = ViQh