{"title": "Bayesian Spike-Triggered Covariance Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1692, "page_last": 1700, "abstract": "Neurons typically respond to a restricted number of stimulus features within the high-dimensional space of natural stimuli. Here we describe an explicit model-based interpretation of traditional estimators for a neuron's multi-dimensional feature space, which allows for several important generalizations and extensions. First, we show that traditional estimators based on the spike-triggered average (STA) and spike-triggered covariance (STC) can be formalized in terms of the \"expected log-likelihood\" of a Linear-Nonlinear-Poisson (LNP) model with Gaussian stimuli. This model-based formulation allows us to define maximum-likelihood and Bayesian estimators that are statistically consistent and efficient in a wider variety of settings, such as with naturalistic (non-Gaussian) stimuli. It also allows us to employ Bayesian methods for regularization, smoothing, sparsification, and model comparison, and provides Bayesian confidence intervals on model parameters. We describe an empirical Bayes method for selecting the number of features, and extend the model to accommodate an arbitrary elliptical nonlinear response function, which results in a more powerful and more flexible model for feature space inference. We validate these methods using neural data recorded extracellularly from macaque primary visual cortex.", "full_text": "Bayesian Spike-Triggered Covariance Analysis\n\nIl Memming Park\n\nCenter for Perceptual Systems\nUniversity of Texas at Austin\n\nAustin, TX 78712, USA\n\nmemming@austin.utexas.edu\n\nJonathan W. Pillow\n\nCenter for Perceptual Systems\nUniversity of Texas at Austin\n\nAustin, TX 78712, USA\n\npillow@mail.utexas.edu\n\nAbstract\n\nNeurons typically respond to a restricted number of stimulus features within the\nhigh-dimensional space of natural stimuli. Here we describe an explicit model-\nbased interpretation of traditional estimators for a neuron\u2019s multi-dimensional\nfeature space, which allows for several important generalizations and extensions.\nFirst, we show that traditional estimators based on the spike-triggered average\n(STA) and spike-triggered covariance (STC) can be formalized in terms of the \u201cex-\npected log-likelihood\u201d of a Linear-Nonlinear-Poisson (LNP) model with Gaussian\nstimuli. This model-based formulation allows us to de\ufb01ne maximum-likelihood\nand Bayesian estimators that are statistically consistent and ef\ufb01cient in a wider\nvariety of settings, such as with naturalistic (non-Gaussian) stimuli. It also allows\nus to employ Bayesian methods for regularization, smoothing, sparsi\ufb01cation, and\nmodel comparison, and provides Bayesian con\ufb01dence intervals on model parame-\nters. We describe an empirical Bayes method for selecting the number of features,\nand extend the model to accommodate an arbitrary elliptical nonlinear response\nfunction, which results in a more powerful and more \ufb02exible model for feature\nspace inference. We validate these methods using neural data recorded extracel-\nlularly from macaque primary visual cortex.\n\n1\n\nIntroduction\n\nA central problem in systems neuroscience is to understand the probabilistic relationship between\nsensory stimuli and neural responses. Most neurons in the early sensory pathway are only sensitive\nto a low-dimensional space of stimulus features, and ignore the other axes in the high-dimensional\nspace of stimuli. Dimensionality reduction therefore plays an important role in neural characteriza-\ntion. The most popular dimensionality-reduction method for neural data uses the \ufb01rst two moments\nof the spike-triggered stimulus distribution: the spike-triggered average (STA) and the eigenvectors\nof the spike-triggered covariance (STC) [1\u20135]. These features are interpreted as \ufb01lters or \u201crecep-\ntive \ufb01elds\u201d that form the \ufb01rst stage in a linear-nonlinear-Poisson (LNP) cascade model [6,7]. In this\nmodel, stimuli are projected onto a bank of linear \ufb01lters, whose outputs are combined via a nonlinear\nfunction, which drives spiking as an inhomogeneous Poisson process (see Fig. 1).\nPrior work has established the conditions for statistical consistency and ef\ufb01ciency of the STA and\nSTC as feature space estimators [1, 2, 8, 9]. However, these moment-based estimators have not yet\nbeen interpreted in terms of an explicit probabilistic encoding model. We formalize that relationship\nhere, building on a recent information-theoretic treatment of spike-triggered average and covariance\nanalysis (iSTAC) [9]. Our general approach is inspired by probabilistic and Bayesian formulations\nof principal components analysis (PCA) and extreme components analysis (XCA), moment-based\nmethods for linear dimensionality reduction that are closely related to STC analysis, but which were\nonly more recently formulated in terms of an explicit probabilistic model [10\u201314].\n\n1\n\n\fFigure 1: Schematic of linear-nonlinear-Poisson (LNP) neural encoding model [6].\n\nHere we show, \ufb01rst of all, that STA and STC arise naturally from the expected log-likelihood of an\nLNP model with an \u201cexponentiated-quadratic\u201d nonlinearity, where expectation is taken with respect\nto a Gaussian stimulus distribution. This insight allows us to formulate exact maximum-likelihood\nestimators that apply to arbitrary stimulus distributions. We then introduce Bayesian methods for\nregularizing and smoothing receptive \ufb01eld estimates, and an approximate empirical Bayes method\nfor selecting the feature space dimensionality, which obviates nested hypothesis tests, bootstrapping,\nor cross-validation based methods [5]. Finally, we generalize these estimators to accommodate LNP\nmodels with arbitrary elliptically symmetric nonlinearities. The resulting model class provides a\nricher and more \ufb02exible model of neural responses but can still recover a high-dimensional feature\nspace (unlike more general information-theoretic estimators [8, 15], which do not scale easily to\nmore than 2 \ufb01lters). We apply these methods to a variety of simulated datasets and to responses\nfrom neurons in macaque primary visual cortex stimulated with binary white noise stimuli [16].\n\n2 Model-based STA and STC\n\nIn a typical neural characterization experiment, the experimenter presents a train of rapidly varying\nsensory stimuli and records a spike train response. Let x denote a D-dimensional vector containing\nthe spatio-temporal stimulus affecting a neuron\u2019s scalar spike response y in a single time bin. A\nprincipal goal of neural characterization is to identify , a low-dimensional projection matrix such\nthat >x captures the neuron\u2019s dependence on the stimulus x. The columns of can be regarded as\nlinear receptive \ufb01elds that provide a basis for the neural feature space.\nThe methods we consider here all assume that neural responses can be described by an LNP cascade\nmodel (Fig. 1). Under this model, the conditional probability of a response y|x is Poisson with rate\nf (>x), where f is a vector function mapping feature space to instantaneous spike rate.1\n\n2.1 STA and STC analysis\n\nThe STA and the STC matrix are the (empirical) \ufb01rst and second moments, respectively, of the\nspike-triggered stimulus ensemble {xi|yi}N\n\ni=1. They are de\ufb01ned as:\n\nSTA: \u00b5 =\n\n1\nnsp\n\nyixi,\n\nand STC: \u21e4=\n\n1\nnsp\n\nyi(xi \u00b5)(xi \u00b5)>,\n\n(1)\n\nwhere nsp = P yi is the number of spikes and N is the total number of time bins. Traditional\n\nSTA/STC analysis provides an estimate for the feature space basis consisting of: (1) \u00b5, if it is\nsigni\ufb01cantly different from zero; and (2) the eigenvectors of \u21e4 whose eigenvalues are signi\ufb01cantly\nsmaller or larger from those of the prior stimulus covariance = E[xxT ]. This estimate is provably\nconsistent only in the case of stimuli drawn from a spherically symmetric (for STA) or independent\nGaussian distribution (for STC) [17].2\n\ninhomogeneous Poisson process, but we use discrete time bins here for concreteness.\n\n1Here f has units of spikes/bin, for some \ufb01xed bin size . In the limit ! 0, the model output is an\n2For elliptically symmetric or colored Gaussian stimuli, a consistent estimate requires whitening the stimuli\n\nby \n\n1\n2 and then multiplying the estimated features (STA and STC eigenvectors) again by \n\n1\n2 (see [5]).\n\n2\n\nNXi=1\n\nNXi=1\n\nlinear filtersnonlinearityPoissonspiking\f2.2 Equivalent model-based formulation\n\nMotivated by [9], we consider an LNP model where the spike rate is de\ufb01ned by an exponentiated\ngeneral quadratic function:\n\n(2)\nwhere C is a symmetric matrix, b is a vector, and a is a scalar. Then the log-likelihood per spike, the\nconditional log-probability of the data divided by the number of spikes, is\n\nf (x) = exp 1\nlog P (yi|C, b, a, xi) = 1\n\n2 x>Cx + b>x + a ,\nnspXi\n2 \u00b5>C\u00b5 + b>\u00b5 + a N\n\nnsp\n\nea\" 1\nNXi\n\n(yi log f (xi) f (xi))\n\nL = 1\n\nnspXi\n\n= 1\n\n2 Tr [C\u21e4] + 1\n\n2 xi>Cxi + b>xi# .\nexp 1\n\nIf the stimuli are drawn from x \u21e0N (0, ), a zero-mean Gaussian with covariance , then the\nexpression in square brackets (eq. 4) will converge to its expectation, given by:\n2 b>(1 C)1\n\n2 x>Cx+b>x = |I C| 1\n\nE\uf8ffe\n\n2 exp\u21e3 1\n\nso long as (1C) is invertible and positive de\ufb01nite.3 Substituting this expectation (eq. 5) into the\nlog-likelihood (eq. 4) yields a quantity we call the expected log-likelihood \u02dcL, which can be expressed\nin terms of the STA, STC, , and the model parameters:\n\nb\u2318 ,\n\n(5)\n\n1\n\n\u02dcL = 1\n\n2 Tr [C\u21e4] + 1\n\n2 \u00b5>C\u00b5 + b>\u00b5 + a N\n\nnsp|I C| 1\n\n2 exp\u21e3 1\n\n2 b>(1 C)1\n\nb + a\u2318 .\n\nMaximizing this expression yields expected-ML estimates (see online supplement for derivation):\n\n(3)\n\n(4)\n\n(6)\n\n(7)\n\n\u02dcCml = 1 \u21e41,\n\u02dcaml = log\u2713 nsp\n\nN \u21e41\n\n1\n\n2\u25c6 1\n\n\u02dcbml =\u21e4 1\u00b5,\n\n2 \u00b5>1\u21e41\u00b5.\n\nThus, for an LNP model with exponentiated-quadratic nonlinearity stimulated with Gaussian noise,\nthe (expected) maximum likelihood estimates can be obtained in closed form from the STA, STC,\nstimulus covariance, and mean spike rate nsp/N.\nSeveral features of this solution are worth remarking. First, if the quadratic component C = 0, then\n\u02dcbml = 1\u00b5, the whitened STA (as in [17]). Second, if the stimuli are white, meaning = I,\nthen \u02dcCml = I \u21e41, which has the same eigenvectors as the STC matrix. Third, if we plug the\nexpected-ML estimates back into the log-likelihood, we get\n\n\u02dcL = 1\n\n(8)\nwhich (for = I) is the information-theoretic spike-triggered average and covariance (iSTAC) cost\nfunction [9]. The iSTAC estimator \ufb01nds the subspace that maximizes the \u201csingle-spike information\u201d\n[18] under a Gaussian model of the raw and spike-triggered stimulus distributions (that coincides\nwith (eq. 8)), but its precise relationship to maximum likelihood has not been shown previously.\n\n2Tr\u21e5\u21e41\u21e4 + \u00b5>1\u00b5 log\u21e41 + const\n\n2.3 Generalizing to non-Gaussian stimuli\n\nThe conditions for which the STA and STC provide asymptotically ef\ufb01cient estimators for a neural\nfeature space are clear from the derivations above: if the stimuli are Gaussian (a condition which is\nrarely if ever met in practice), the STA is optimal when the nonlinearity is f (x) = exp(b>x + a)\n(as shown in [8]); the STC is optimal when f (x) = exp(x>Cx + a) (as shown in [9]).\nHowever, the maximum of the exact model log-likelihood (eq. 4) yields a consistent and asymp-\ntotically ef\ufb01cient estimator even when stimuli are not Gaussian. Numerically optimizing this loss\n\n3If it is not, then this expectation does not exist, and simulations of the corresponding model will produce\n\nimpossibly high spike counts, with STA and STC dominated by the response to a single stimulus.\n\n3\n\n\fC =\n\ndXi=1\n\nfunction is computationally more expensive than computing the STA and STC eigendecomposition,\nbut the log-likelihood is jointly concave in the model parameters (C, b, a), meaning ML estimates\ncan be obtained rapidly by convex optimization [19].\nFor cases where x is high-dimensional, it is easier to directly estimate a low-rank representation of\nC, rather than optimize the entire D \u21e5 D matrix. We therefore de\ufb01ne a rank-d representation for C:\n(9)\n\nwisiwi> = W SW >,\n\nwhere W is a matrix whose columns wi are features, si 2 {1, 1} are constants that control the\nshape of the nonlinearity along each axis in feature space (-1 for suppressive, +1 for excitatory), and\nS is a diagonal matrix containing si along the diagonal. (We will assume the si are \ufb01xed using the\nsign of the eigenvalues of the expected-ML estimate \u02dcCml, and not varied thereafter).\nThe feature space of the resulting model is spanned by b and the columns of W . We refer to ML\nestimators for (b, W ) as maximum-likelihood STA and STC (or exact ML, as opposed to expected-\nML estimates from moment-based formulas (eq. 7); see Figs. 2-3 for comparisons). These estimates\nwill closely match the standard STA and STC-based feature space when stimuli are Gaussian, but (as\nmaximum-likelihood estimates) are also consistent and asymptotically ef\ufb01cient for arbitrary stimuli.\nAn additional difference between maximum-likelihood and standard STA/STC analysis is that the\nparameters (b, W ) have meaningful units of length: the vector norm of b determines the amplitude\nof the \u201clinear\u201d contribution to the neural response (via b>x), while the norm of columns in W\ndetermines the amplitude of \u201csymmetric\u201d excitatory or suppressive contributions to the response\n(via x>W SW >x). Shrinking these vectors (e.g., with a prior) has the effect of reducing their\nin\ufb02uence in the model, and they drop out of the model entirely if we shrink them to zero (a fact that\nwe will exploit in the next section). By contrast, the standard STA and STC eigenvectors are usually\ntaken as unit vectors, providing a basis for the neural feature space in which the nonlinearity (\u201cN\u201d\nstage) must still be estimated. We are free to normalize the ML estimates (\u02c6b, \u02c6W ) and estimate an\narbitrary nonlinearity in a similar manner, but it is noteworthy that the parameters (a, b, W ) specify\na complete encoding model in and of themselves.\n\n3 Bayesian STC\n\nNow that we have de\ufb01ned an explicit model and likelihood function underlying STA and STC anal-\nysis, we can straightforwardly apply Bayesian methods for estimation, prediction, error bars, model\ncomparison, etc., by introducing a prior over the model parameters. Bayesian methods can be es-\npecially useful in cases where we have prior information (e.g., about smoothness or sparseness of\nneural features, [20\u201325]), and in general have attractive theoretical properties for high-dimensional\ninference problems [26\u201328].\nHere we consider two types of priors: (1) a smoothing prior, which holds the \ufb01lters to be smooth\nin space/time; and (2) a sparsifying prior, which we employ to directly estimate the feature space\ndimensionality (i.e., the number of signi\ufb01cant \ufb01lters). We apply these priors to b and the columns of\nW , in conjunction with either exact (for accuracy) or expected (for speed) log-likelihood functions\nde\ufb01ned above. We refer to the resulting estimators as Bayesian STC (or \u201cBSTC\u201d).\nWe perform BSTC estimation by maximizing the sum of log-likelihood and log-prior to obtain\nmaximum a posteriori (MAP) estimates of the \ufb01lters and constant a. It is worth noting that since\nthe derivatives of the expected likelihood (eq. 6) are also written in terms of STA/STC, optimization\nusing the expected log-likelihood can be carried out more ef\ufb01ciently\u2014it reduces the cost of each\niteration by a factor of N compared to optimizing the exact likelihood (eq. 3).\n\n3.1 Smoothing prior\n\nNeural receptive \ufb01elds are generally smooth, so a prior that encourages this tendency will tend\nto improve performance. Receptive \ufb01eld estimates under such a prior will be smooth unless the\nlikelihood provides suf\ufb01cient evidence for jaggedness. To encourage smoothness, we placed a zero-\nmean Gaussian prior on the second-order differences of each \ufb01lter [29]:\n\nLw \u21e0N (0, 1I),\n\n4\n\n(10)\n\n\fA\n\ntrue filter\nSTA/STC\n\nExpected ML\nExact ML\nBayesian smoothing\n\nB\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\nc\nu\nr\nt\ns\nn\no\nc\ne\nr\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n\n103\n\nGaussian stimuli\n\nC\n\nsparse binary stimuli\n\n104\n\n# samples\n\n105 103\n\n104\n\n# samples\n\n105\n\nFigure 2: Estimated \ufb01lters and error rates for various estimators. An LNP model with 4 orthogonal\n32-elements \ufb01lters (see text) was simulated with two types of stimuli (A-B: white Gaussian: C:\nsparse binary). Mean \ufb01ring rate 0.16 spk/s. (A) Filters estimated from 10,000 samples. STA/STC\n\ufb01lters are normalized to match the norm of true \ufb01lters. (B) Convergence to the true \ufb01lter under each\nmethod, Gaussian stimuli. (C) Convergence for sparse binary stimuli.\n\nwhere L is the discrete Laplacian operator and is a hyperparameter controlling the smoothness of\nfeature vectors. This is equivalent to imposing a penalty (given by 1\n2 wi>LL>wi) on the squared\nsecond derivatives b and W in the optimization function. Larger implies a narrower Gaussian prior\non these differences, hence a stronger preference for smooth \ufb01lters. For simplicity, we assumed all\n\ufb01lters came from the same prior, resulting in a single hyperparameter for all \ufb01lters, and used\ncross-validation to choose an appropriate for each dataset.\nTo illustrate the effects of this prior, we simulated an example dataset from an LNP neuron with\nexponentiated-quadratic nonlinearity and four 32-element, 1-dimensional (temporal) \ufb01lters. The\n\ufb01lter shapes were given by orthogonalized randomly-placed Gaussians (Fig. 2). We \ufb01xed the di-\nmensionality of our feature space estimates to be the same as the true model, since our focus was\nthe quality of each corresponding \ufb01lter estimate.\nFor Gaussian stimuli, we found that classical STA/STC, expected-ML, and exact-ML estimates\nwere indistinguishable (Fig. 2). However, for \u201csparse\u201d binary stimuli (3 of the 32 pixels set ran-\ndomly to \u00b11), for which STA/STC and expected-ML estimates are no longer consistent, we found\nsigni\ufb01cantly better performance from the exact-ML estimates (Fig. 2C). Most importantly, for both\nGaussian and sparse stimuli alike, the smoothing prior provided a large improvement in the quality\nof feature space estimates, achieving similar error with 2 orders of magnitude fewer stimuli.\n\n3.2 Automatic selection of feature space dimensionality\n\nWhile smoothing regularizes receptive \ufb01eld estimates by penalizing \ufb01lter roughness, a perhaps more\ncritical aspect of the STA/STC model is its vast number of possible parameters due to uncertainty in\nthe number of \ufb01lters. Our approach to this problem was inspired by Bayesian PCA [10], a method for\nautomatically choosing the number of meaningful principle components using a \u201cfeature-selection\nprior\u201d designed to encourage sparsity. The basic idea behind this approach is that a zero-mean Gaus-\nsian prior on each \ufb01lter wi (separately controlled by a hyperparameter \u21b5i) can be used to \u201cshrink to\nzero\u201d any components that do not contribute meaningfully to the evidence, just as in automatic rele-\nvance determination (ARD), also known as sparse Bayesian learning [27,30]. Unlike PCA, we seek\nto preserve components of the STC matrix with both large and small eigenvalues, which correspond\nto excitatory and suppressive \ufb01lters, respectively. One solution to this problem, Bayesian Extreme\nComponents Analysis [14], preserves large and small eigenvalues of the covariance matrix, but does\nnot incorporate additional priors on \ufb01lter shape, and has not yet been formulated for our (Poisson)\nlikelihood function. Instead, we address the problem by using the sign of the diagonal elements in\nS to determine whether a feature w produces a positive or negative eigenvalue in C (eq. 9). (Recall\nthat the eigenvalues of C = 1 \u21e41 are positive and negative, while those of the STC matrix \u21e4\nare strictly positive). Reparametrizing the STC in terms of C therefore allows us to apply a variant\nof the Bayesian PCA algorithm directly to b and the columns of W .\nThe details of our approach are as follows. We put the ARD prior on each column of W :\n\nwi \u21e0N 0,\u21b5 1\ni I ,\n\n5\n\n(11)\n\n\fexpected likelihood\n\nexact likelihood\n\n0\n0\n0\n0\n0\n5\n\n0\n0\n0\n0\n0\n1\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n)\nk\np\ns\n/\ns\nt\n\na\nn\n(\n \nt\ni\nf\n-\nf\no\n-\ns\ns\ne\nn\nd\no\no\ng\n\ncross-validation\nexpected ARD\nexpected smooth+ARD\nexact ARD\nexact smooth+ARD\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ntrue\n\ns\nn\no\ni\ns\nn\ne\nm\nd\n \nf\no\n#\n\ni\n\n \n\n \nl\n\na\nn\n!\n\n0\n0\n0\n0\n1\n\n0\n\nML smooth ARD both ML smooth ARD both\n\n0\n103\n\n104\n\n105\n\n# of samples\n\n106\n\nFigure 3: Goodness-of-\ufb01t of estimated models and the estimated dimension as a function of number\nof samples. The same simulation parameters as Fig. 2 were used. Left: Information per spike\n(normalized difference in log-likelihoods) captured by different estimates. Models were estimated\nfrom 103, 104, and 5 \u21e5 104 stimuli respectively. Right: Estimated number of dimensions as a\nfunction of the number of training samples. When both smoothing and ARD priors are used, the\nvariability rapidly diminishes to near zero.\n\ni D\n\nwhere \u21b5i is a hyperparameter controlling the prior variance of wi. We impose the same prior on b,\nwith an additional hyperparamter \u21b50, resulting in (D + 1) hyperparameters for the complete model.\nWe initialize b to its ML estimate and the wi to the eigenvectors of \u02dcCml, scaled by the square root of\ntheir eigenvalues. Then, we optimize the parameters and hyperparameters in a similar fashion to the\nBayesian PCA algorithm [10]: we alternate between maximizing the posterior for the parameters\n(a, b, W ) given hyperparameters \u21b5, and evidence optimization (arg max\u21b5 Pr[(x, y)|\u21b5]) to update \u21b5.\nSince a closed form for the evidence is not known, we use the approximate \ufb01xed point update rule\ndeveloped in [10]: \u21b5new\n|||wi||2 . This update is valid when each element of the receptive \ufb01eld wi\nis well de\ufb01ned (non-zero), otherwise it overestimates the corresponding \u21b5i. The algorithm begins\nwith all \u21b5i set to zero (in\ufb01nite prior variance), giving ML estimates for the parameters. Subsequent\nupdates will cause some \u21b5i to grow without bound, shrinking the prior variance of the corresponding\nfeature vector wi until it drops out of the model entirely as \u21b5i ! 1. The remaining wj, for which\n\u21b5j remain \ufb01nite, de\ufb01ne the feature space estimate. Note that these updates are fast (especially with\nexpected log-likelihood), providing a much less computationally intensive estimate of feature space\ndimensionality than bootstrap-based methods [5].\nFigure 3 (left) shows that ARD prior greatly increases the model goodness-of-\ufb01t (likelihood on\ntest data), and is synergistic with the smoothing prior de\ufb01ned above. The improvement (relative to\nML estimates) is greatest when the number of samples is small, and it enhances both expected and\nexact likelihood estimates. We compared this method for estimating feature space dimensionality\nwith a more classical (non-Bayesian) approach based on cross-validation. We \ufb01rst \ufb01t a full-rank\nmodel with exact likelihood, and built a sparse model by adding \ufb01lters from this set greedily until\nthe likelihood of test data began to decrease. The resulting estimate of dimension is underestimated\nwhen there is not enough data, and even with large amount of data, it has high variance (Fig. 3, right).\nIn comparison, our ARD-based estimate converged quickly to the correct dimension and exhibited\nsmaller variability. When both smoothing and ARD priors were used, the variability decreased\nmarkedly and always achieved the correct dimension even for moderate amounts of data. One\nadditional advantage of Bayesian approach is that it can use all the available data; under cross-\nvalidation, some proportion of data is needed to form the test set (in this example we provided extra\ndata for this method only).\n\n4 Extension: the elliptical-LNP model\n\nFinally, the model and inference procedures we have described above can be extended to a much\nmore general class of response functions with zero additional computational cost. We can replace\nthe exponential function which operates on the quadratic form in the model nonlinearity (eq. 2)\n\n6\n\n\fi\n\n)\nn\nb\n/\nk\np\ns\n(\n \ne\nt\na\nr\n\n14\n12\n10\n8\n6\n4\n2\n0\n\n(cid:239)(cid:24)\n\ndata\nexp(x)\nlog(1+exp(x))\nspline fit\n\n0\n\n(cid:24)\n\nFigure 4: 1-D nonlinear functions g mapping\nz, the output of the quadratic stage, to spike\nrate for a V1 complex cell [16]. The exact-ML\n\ufb01lter estimate for W and b were obtained us-\ning the smoothing BSTC with an exponential\nnonlinearity. (Final \ufb01lter estimates for this cell\nshown in Fig. 5). The quadratic projection (z)\nwas computed using the \ufb01lter estimates, and is\nplotted against the observed spike counts (gray\ncircles), histogram-based estimate of the non-\nlinearity (green diamonds), exponential non-\nlinearity (black trace), a well-known alterna-\ntive nonlinearity log(1 + ez) (red), and a cu-\nbic spline estimated using 7 knots (green trace).\nWe \ufb01xed the \ufb01tted cubic spline nonlinearity and\nthen re\ufb01t the \ufb01lters, resulting in an estimate of\nthe elliptical-LNP model.\n\nwith an arbitrary function g(\u00b7), resulting in a model class that includes any elliptically symmetric\nmapping of the stimulus to spike rate. We call this the elliptical-LNP model.\nThe elliptical-LNP model can be formalized by writing the nonlinearity f (x) (depicted in Fig. 1)\nas the composition of two nonlinear functions: a quadratic function that maps high dimensional\nstimulus to real line z(x) = 1\n2 x>Cx + b>x + a, and a 1-D nonlinearity g(z). The full nonlinearity\nis thus f (x) = g(z(x)).\nAlthough LNP with exponential nonlinearity has been widely adapted in neuroscience for its sim-\nplicity, the actual nonlinearity of neural systems is often sub-exponential. Moreover, the effect of\nnonlinearity is even more pronounced in the exponentiated-quadratic function, and hence it may be\nhelpful to use a sub-exponential function g. Figure 4 shows the nonlinearity of an example neuron\nfrom V1 (see next section) compared to g(z) = ez (the assumption implicit in STA/STC), a more\nlinear function g(z) = log(1 + ez), and a cubic spline \ufb01t by maximum likelihood.\nThe likelihood given by eq. 3 can be optimized ef\ufb01ciently as long as g and g0 can be computed\nef\ufb01ciently. The log-likelihood is concave in (a, b, C) so long as g obeys the standard regularity\nconditions (convex and log-concave), but we did not impose those conditions here. For fast opti-\nmization, we \ufb01rst used the exponentiated-quadratic nonlinearity as an initialization (expected then\nexact-ML), then we re\ufb01ned the model with a spline nonlinearity.\n\n5 Application to neural data\n\nWe applied BSTC to data from a V1 complex cell (data published in [16]). The stimulus consisted\nof oriented binary white noise (\u201c\ufb02ickering bars\u201d) aligned with the cell\u2019s preferred orientation. We\nselected a cell (544l029.p21) that was reported to have large set of \ufb01lters, to illustrate the power\nof our technique. The size of receptive \ufb01eld was chosen to be 16 bars \u21e5 10 time bins, yielding a\n160-dimensional stimulus space. Three features of this data that make BSTC appropriate: (1) the\nstimulus is non-Gaussian; (2) the nonlinearity is not exponential (Fig. 4); (3) the \ufb01lters are smooth\nin space and time (Fig. 5).\nWe estimated the nonlinearity using a cubic spline, and applied a smoothing BSTC to 104 samples\npresented at 100 Hz (Fig. 5, top). The ARD-prior BSTC estimate trained on 2\u21e5105 stimuli preserved\n14 \ufb01lters (Fig. 5, bottom). The quality of the \ufb01lters are qualitatively close to that obtained by\nSTA/STC. However, the resulting model has better overall goodness-of-\ufb01t, as well as signi\ufb01cant\nimprovement over the exact ML model for each reduced dimension model (Fig. 6). To achieve the\nsame level of \ufb01t as using 2 \ufb01lters for BSTC, the exact ML based sparse model required 6 additional\n\ufb01lters (dotted line).\nWe also compared BSTC to a generalized linear model (GLM) with same number of linear and\nquadratic \ufb01lters \ufb01t by STA/STC (a method described previously by [7]). This approach places a\nprior over the weights on squared \ufb01lter outputs, but not on the \ufb01lters themselves. On a test set,\n\n7\n\n\fb\n\nexcitatory\n\nsuppresive\n\nSTA/STC\n\nBSTC\n\nSTA/STC\n\nBSTC+ARD\n\nFigure 5: Estimating visual receptive \ufb01elds from a complex cell. Each image corresponds to a nor-\nmalized 16 dimensions spatial pixels (horizontal) by 10 time bins (vertical) \ufb01lter. (top) Smoothing\nprior recovers better \ufb01lters. Bayesian STC (BSTC) with smoothing prior and \ufb01xed spline nonlin-\nearity applied to a \ufb01xed number of \ufb01lters. (bottom) Sparsi\ufb01cation determines the number of \ufb01lters.\nBSTC with ARD, smoothing, and spline nonlinearity recovers 14 receptive \ufb01elds out of 160.\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n)\nk\np\ns\n/\ns\nt\n\na\nn\n(\n \nt\ni\nf\n-\nf\n\no\n-\ns\ns\ne\nn\nd\no\no\ng\n\n0\n\n0\n\ntrain\n\ntest\n\nBSTC\n\nML\n\ntrain\n\ntest\n\nFigure 6: Goodness-of-model \ufb01ts from ex-\nact ML solution with exponential nonlin-\nearity compared to BSTC with a \ufb01xed\nspline nonlinearity and smoothing prior\n(2 \u21e5 105 samples). Filters are added in the\norder that increases the likelihood on the\ntraining set the most. The corresponding\n\ufb01lters are visualized in \ufb01g. 5.\n\n2\n\n4\n\n6\n# dimensions\n\n8\n\n10\n\n12\n\n14\n\nBSTC outperformed the GLM on all cells in the dataset, achieving 34% more bits/spike (normalized\nlog-likelihood) over a population of 50 cells.\n\n6 Conclusion\n\nWe have provided an explicit, probabilistic, model-based framework that formalizes the classical\nmoment-based estimators (STA, STC) and a more recent information-theoretic estimator (iSTAC)\nfor neural feature spaces. The maximum of the \u201cexpected log-likelihood\u201d under this model, where\nexpectation is taken with respect to Gaussian stimulus distribution, corresponds precisely to the\nmoment-based estimators for uncorrelated stimuli. A model-based formulation allows us to com-\npute exact maximum-likelihood estimates when stimuli are non-Gaussian, and we have incorporated\npriors in conjunction with both expected and exact likelihoods to achieve Bayesian methods for\nsmoothing and feature selection (estimation of the number of \ufb01lters).\nThe elliptical-LNP model extends BSTC analysis to a richer class of nonlinear response models.\nAlthough the assumption of elliptical symmetry makes it less general than information-theoretic\nestimators such as maximally informative dimensions (MID) [8, 15], it has signi\ufb01cant advantages\nin computational ef\ufb01ciency, number of local optima, and suitability for high-dimensional feature\nspaces. The elliptical-LNP model may also be easily extended to incorporate spike-history effects\nby adding linear projections of the neuron\u2019s spike history as inputs, as in the generalized linear model\n(GLM) [9, 17, 25, 31]. We feel the synthesis of multi-dimensional nonlinear stimulus sensitivity (as\ndescribed here) and non-Poisson, history-dependent spiking presents a promising tool for unlocking\nthe statistical structure of the neural code.\n\n8\n\n\fReferences\n[1] J. Bussgang. Crosscorrelation functions of amplitude-distorted gaussian signals. RLE Technical Reports,\n\n[2] E. J. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Comput. Neural\n\n216, 1952.\n\nSyst., 12:199\u2013213, 2001.\n\n[3] R. de Ruyter and W. Bialek. Real-time performance of a movement-senstivive neuron in the blow\ufb02y\n\nvisual system. Proc. R. Soc. Lond. B, 234:379\u2013414, 1988.\n\n[4] O. Schwartz, E. J. Chichilnisky, and E. P. Simoncelli. Characterizing neural gain control using spike-\n\ntriggered covariance. Adv. Neural Information Processing Systems, pages 269\u2013276, 2002.\n\n[5] O. Schwartz, J. W. Pillow, N. C. Rust, and E. P. Simoncelli. Spike-triggered neural characterization. J.\n\nVision, 6(4):484\u2013507, 7 2006.\n\n[6] E. P. Simoncelli, J. Pillow, L. Paninski, and O. Schwartz. Characterization of neural responses with\n\nstochastic stimuli. The Cognitive Neurosciences, III, chapter 23, pages 327\u2013338. MIT Press, 2004.\n\n[7] S. Gerwinn, J. Macke, M. Seeger, and M. Bethge. Bayesian inference for spiking neuron models with a\n\nsparsity prior. Adv. in Neural Information Processing Systems 20, pages 529\u2013536. MIT Press, 2008.\n\n[8] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network: Comput.\n\nNeural Syst., 14:437\u2013464, 2003.\n\n[9] J. W. Pillow and E. P. Simoncelli. Dimensionality reduction in neural models: An information-theoretic\n\ngeneralization of spike-triggered average and covariance analysis. J. Vision, 6(4):414\u2013428, 4 2006.\n\n[10] C. M. Bishop. Bayesian PCA. Adv. in Neural Information Processing Systems, pages 382\u2013388, 1999.\n[11] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. J. the Royal Statistical\n\nSociety. Series B, Statistical Methodology, pages 611\u2013622, 1999.\n\n[12] T. P. Minka. Automatic choice of dimensionality for PCA. NIPS, pages 598\u2013604, 2001.\n[13] M. Welling, F. Agakov, and C. K. I. Williams. Extreme components analysis. Adv. in Neural Information\n\nProcessing Systems 16. MIT Press, 2004.\n\n[14] Y. Chen and M. Welling. Bayesian extreme components analysis. IJCAI, 2009.\n[15] T. Sharpee, N. C. Rust, and W. Bialek. Analyzing neural responses to natural signals: maximally infor-\n\nmative dimensions. Neural Comput, 16(2):223\u2013250, Feb 2004.\n\n[16] N. C. Rust, O. Schwartz, J. A. Movshon, and E. P. Simoncelli. Spatiotemporal elements of macaque V1\n\nreceptive \ufb01elds. Neuron, 46(6):945\u2013956, Jun 2005.\n\n[17] L. Paninski. Maximum likelihood estimation of cascade point-process neural encoding models. Network:\n\nComput. Neural Syst., 15(04):243\u2013262, November 2004.\n\n[18] N. Brenner, S. P. Strong, R. Koberle, W. Bialek, and R. R. de Ruyter van Steveninck. Synergy in a neural\n\n[19] L. Paninski. Maximum likelihood estimation of cascade point-process neural encoding models. Network:\n\ncode. Neural Comput, 12(7):1531\u20131552, Jul 2000.\n\nComputation in Neural Systems, 15:243\u2013262, 2004.\n\n[20] F. Theunissen, S. David, N. Singh, A. Hsu, W. Vinje, and J. Gallant. Estimating spatio-temporal receptive\n\ufb01elds of auditory and visual neurons from their responses to natural stimuli. Network: Comput. Neural\nSyst., 12:289\u2013316, 2001.\n\n[21] M. Sahani and J. Linden. Evidence optimization techniques for estimating stimulus-response functions.\n\nNIPS, 15, 2003.\n\n[22] S. V. David, N. Mesgarani, and S. A. Shamma. Estimating sparse spectro-temporal receptive \ufb01elds with\n\nnatural stimuli. Network: Comput. Neural Syst., 18(3):191\u2013212, 2007.\n\n[23] I. H. Stevenson, J. M. Rebesco, N. G. Hatsopoulos, Z. Haga, L. E. Miller, and K. P. K\u00a8ording. Bayesian\ninference of functional connectivity and network structure from spikes. IEEE Transactions on Neural\nSystems and Rehabilitation Engineering, 17(3):203\u2013213, 2009.\n\n[24] S. Gerwinn, J. H Macke, and M. Bethge. Bayesian inference for generalized linear models for spiking\n\nneurons. Frontiers in Computational Neuroscience, 2010.\n\n[25] A. Calabrese, J. W. Schumacher, D. M. Schneider, L. Paninski, and S. M. N. Woolley. A generalized\nlinear model for estimating spectrotemporal receptive \ufb01elds from responses to natural sounds. PLoS One,\n6(1):e16104, 2011.\n\n[26] W. James and C. Stein. Estimation with quadratic loss. 4th Berkeley Symposium on Mathematical Statis-\n\ntics and Probability, 1:361\u2013379, 1960.\n\n[27] M. Tipping. Sparse Bayesian learning and the relevance vector machine. JMLR, 1:211\u2013244, 2001.\n[28] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via l1\n\nminimization. PNAS, 100:2197\u20132202, 2003.\n\n[29] K. R. Rad and L. Paninski. Ef\ufb01cient, adaptive estimation of two-dimensional \ufb01ring rate surfaces via\n\ngaussian process methods. Network: Comput. Neural Syst., 21(3-4):142\u2013168, 2010.\n\n[30] D. Wipf and S. Nagarajan. A new view of automatic relevance determination. Adv. in Neural Information\n\nProcessing Systems 20, pages 1625\u20131632. MIT Press, 2008.\n\n[31] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N. Brown. A point process framework\nfor relating neural spiking activity to spiking history, neural ensemble and extrinsic covariate effects. J.\nNeurophysiol, 93(2):1074\u20131089, 2005.\n\n9\n\n\f", "award": [], "sourceid": 954, "authors": [{"given_name": "Il Memming", "family_name": "Park", "institution": ""}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": ""}]}