{"title": "Convolutional spike-triggered covariance analysis for neural subunit models", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 801, "abstract": "Subunit models provide a powerful yet parsimonious description of neural spike responses to complex stimuli. They can be expressed by a cascade of two linear-nonlinear (LN) stages, with the first linear stage defined by convolution with one or more filters. Recent interest in such models has surged due to their biological plausibility and accuracy for characterizing early sensory responses. However, fitting subunit models poses a difficult computational challenge due to the expense of evaluating the log-likelihood and the ubiquity of local optima. Here we address this problem by forging a theoretical connection between spike-triggered covariance analysis and nonlinear subunit models. Specifically, we show that a ''convolutional'' decomposition of the spike-triggered average (STA) and covariance (STC) provides an asymptotically efficient estimator for the subunit model under certain technical conditions. We also prove the identifiability of such convolutional decomposition under mild assumptions. Our moment-based methods outperform highly regularized versions of the GQM on neural data from macaque primary visual cortex, and achieves nearly the same prediction performance as the full maximum-likelihood estimator, yet with substantially lower cost.", "full_text": "Convolutional Spike-triggered Covariance Analysis\n\nfor Neural Subunit Models\n\nAnqi Wu1\n\nIl Memming Park2\n\nJonathan W. Pillow1\n\n1 Princeton Neuroscience Institute, Princeton University\n\n2 Department of Neurobiology and Behavior, Stony Brook University\n\n{anqiw, pillow}@princeton.edu\nmemming.park@stonybrook.edu\n\nAbstract\n\nSubunit models provide a powerful yet parsimonious description of neural re-\nsponses to complex stimuli. They are de\ufb01ned by a cascade of two linear-nonlinear\n(LN) stages, with the \ufb01rst stage de\ufb01ned by a linear convolution with one or more\n\ufb01lters and common point nonlinearity, and the second by pooling weights and an\noutput nonlinearity. Recent interest in such models has surged due to their biologi-\ncal plausibility and accuracy for characterizing early sensory responses. However,\n\ufb01tting poses a dif\ufb01cult computational challenge due to the expense of evaluating\nthe log-likelihood and the ubiquity of local optima. Here we address this problem\nby providing a theoretical connection between spike-triggered covariance analy-\nsis and nonlinear subunit models. Speci\ufb01cally, we show that a \u201cconvolutional\u201d\ndecomposition of a spike-triggered average (STA) and covariance (STC) matrix\nprovides an asymptotically ef\ufb01cient estimator for class of quadratic subunit mod-\nels. We establish theoretical conditions for identi\ufb01ability of the subunit and pool-\ning weights, and show that our estimator performs well even in cases of model\nmismatch. Finally, we analyze neural data from macaque primary visual cortex\nand show that our moment-based estimator outperforms a highly regularized gen-\neralized quadratic model (GQM), and achieves nearly the same prediction perfor-\nmance as the full maximum-likelihood estimator, yet at substantially lower cost.\n\nIntroduction\n\n1\nA central problem in systems neuroscience is to build \ufb02exible and accurate models of the sensory\nencoding process. Neurons are often characterized as responding to a small number of features in the\nhigh-dimensional space of natural stimuli. This motivates the idea of using dimensionality reduction\nmethods to identify the features that affect the neural response [1\u20139]. However, many neurons in\nthe early visual pathway pool signals from a small population of upstream neurons, each of which\nintegrates and nolinearly transforms the light from a small region of visual space. For such neurons,\nstimulus selectivity is often not accurately described with a small number of \ufb01lters [10]. A more\naccurate description can be obtained by assuming that such neurons pool inputs from an earlier stage\nof shifted, identical nonlinear \u201csubunits\u201d [11\u201313].\nRecent interest in subunit models has surged due to their biological plausibility and accuracy for\ncharacterizing early sensory responses. In the visual system, linear pooling of shifted recti\ufb01ed linear\n\ufb01lters was \ufb01rst proposed to describe sensory processing in the cat retina [14, 15], and more recent\nwork has proposed similar models for responses in other early sensory areas [16\u201318]. Moreover,\nrecent research in machine learning and computer vision has focused on hierarchical stacks of such\nsubunit models, often referred to as Convolutional Neural Networks (CNN) [19\u201321].\nThe subunit models we consider here describe neural responses in terms of an LN-LN cascade, that\nis, a cascade of two linear-nonlinear (LN) processing stages, each of which involves linear projection\nand a nonlinear transformation. The \ufb01rst LN stage is convolutional, meaning it is formed from one or\n\n1\n\n\fW4\n\n1st LN stage\n\n2nd LN stage\n\nW3\n\nW1 W2\n\nW5 W6\n\nW7\n\nstimulus\n\nsubunit\nfilter\n\npooling\nweights\n\nsubunit\nnonliearity\n\noutput\nnonlinearity\n\nmore banks of identical, spatially shifted subunit\n\ufb01lters, with outputs transformed by a shared sub-\nunit nonlinearity. The second LN stage consists\nof a set of weights for linearly pooling the non-\nlinear subunits, an output nonlinearity for mapping\nthe output into the neuron\u2019s response range, and\n\ufb01nally, an noise source for capturing the stochas-\nticity of neural responses (typically assumed to be\nGaussian, Bernoulli or Poisson). Vintch et al pro-\nposed one variant of this type of subunit model, and\nshowed that it could account parsimoniously for the\nmulti-dimensional input-output properties revealed\nby spike-triggered analysis of V1 responses [12, 13].\nHowever, \ufb01tting such models remains a challeng-\ning problem. Simple LN models with Gaussian or\nPoisson noise can be \ufb01t very ef\ufb01ciently with spike-\ntriggered-moment based estimators [6\u20138], but there\nis no equivalent theory for LN-LN or subunit mod-\nels. This paper aims to \ufb01ll that gap. We show that a\nconvolutional decomposition of the spike-triggered\naverage (STA) and covariance (STC) provides an\nasymptotically ef\ufb01cient estimator for a Poisson sub-\nunit model under certain technical conditions:\nthe\nstimulus is Gaussian, the subunit nonlinearity is well\ndescribed by a second-order polynomial, and the \ufb01nal nonlinearity is exponential. In this case, the\nsubunit model represents a special case of a canonical Poisson generalized quadratic model (GQM),\nwhich allows us to apply the expected log-likelihood trick [7, 8] to reduce the log-likelihood to a\nform involving only the moments of the spike-triggered stimulus distribution. Estimating the subunit\nmodel from these moments, an approach we refer to as convolutional STC, has \ufb01xed computational\ncost that does not scale with the dataset size after a single pass through the data to compute suf\ufb01cient\nstatistics. We also establish theoretical conditions under which the model parameters are identi\ufb01-\nable. Finally, we show that convolutional STC is robust to modest degrees of model mismatch, and\nis nearly as accurate as the full maximum likelihood estimator when applied to neural data from V1\nsimple and complex cells.\n2 Subunit Model\nWe begin with a general de\ufb01nition of the Poisson convolutional subunit model (Fig. 1). The model\nis speci\ufb01ed by:\n\nFigure 1: Schematic of subunit LN-LNP cas-\ncade model. For simplicity, we show only 1\nsubunit type.\n\nPoisson\nspiking \n\nresponse\n\nsubunit outputs:\nspike rate:\n\nspike count:\n\nsmi = f (km \u00b7 xi)\n\n = g\u21e3Xm Xi\n\ny| \u21e0 Poiss(),\n\nwmi smi\u2318\n\n(1)\n(2)\n\n(3)\n\nwhere km is the \ufb01lter for the m\u2019th type of subunit, xi is the vectorized stimulus segment in the i\u2019th\nposition of the shifted \ufb01lter during convolution, and f is the nonlinearity governing subunit outputs.\nFor the second stage, wmi is a linear pooling weight from the m\u2019th subunit at position i, and g is the\nneuron\u2019s output nonlinearity. Spike count y is conditionally Poisson with rate .\nFitting subunit models with arbitrary g and f poses signi\ufb01cant computational challenges. However,\nif we set g to exponential and f takes the form of second-order polynomial, the model reduces to\n\nwhere\n\n = exp\u21e3 1\n= exp\u21e3\nC[w,k] =Xm\n\n2Xwmi (km \u00b7 xi)2+Xwmi (km \u00b7 xi) + a\u2318\n+ a\u2318\n\n2 x>C[w,k]x + b>[w,k]x\n\n1\n\nK>m diag(wm)Km,\n\nb[w,k] =Xm\n\nK>mwm,\n\n2\n\n(4)\n\n(5)\n\n(6)\n\n\fand Km is a Toeplitz matrix consisting of shifted copies of km satisfying Kmx =\n[x1, x2, x3, . . .]>km.\nIn essence, these restrictions on the two nonlinearities reduce the subunit model to a (canonical-\nform) Poisson generalized quadratic model (GQM) [7, 8, 22], that is, a model in which the Poisson\nspike rate takes the form of an exponentiated quadratic function of the stimulus. We will pursue\nthe implications of this mapping below. We assume that k is a spatial \ufb01lter vector without time\nexpansion. If we have a spatio-temporal stimulus-response, k should be a spatial-temporal \ufb01lter, but\nthe subunit convolution (across \ufb01lter position i) involves only the spatial dimension(s). From (eqs. 4\nand 5) it can be seen that the subunit model contains fewer parameters than a full GLM, making it a\nmore parsimonious description for neurons with multi-dimensional stimulus selectivity.\n3 Estimators for Subunit Model\nWith the above de\ufb01nitions and formulations, we now present three estimators for the model pa-\nrameters {w, k}. To simplify the notation, we omit the subscript in C[w,k] and b[w,k], but their\ndependence on the model parameters is assumed throughout.\nMaximum Log-Likelihood Estimator\nThe maximum log-likelihood estimator (MLE) has excellent asymptotic properties, though it comes\nwith the high computational cost. The log-likelihood function can be written:\n\n(7)\n\n(8)\n\n(9)\n\nLMLE(\u2713) = Xi\n\ni\n\nyi log i Xi\n2 x>i Cxi + b>xi + a) X exp( 1\n\n= X yi( 1\n= Tr[C\u21e4] + b>\u00b5 + ansp \"Xi\n\nexp( 1\n\n2 x>i Cxi + b>xi + a)\n\n2 x>i Cxi + b>xi + a)#\n\nwhere \u00b5 =Pi yixi is the spike-triggered average (STA) and \u21e4= Pi yixix>i\ncovariance (STC) and nsp =Pi yi is the total number of spikes. We denote the MLE as \u2713MLE.\nMoment-based Estimator with Expected Log-Likelihood Fitting\nIf the stimuli are drawn from x \u21e0N (0, ), a zero-mean Gaussian with covariance , then the\nexpression in square brackets divided by N in (eq. 9) will converge to its expectation, given by\n\nis the spike-triggered\n\n(10)\nSubstituting this expectation into (9) yields a quantity called expected log-likelihood, with the ob-\njective function as,\n\n2 x>i Cxi + b>xi + a)\u21e4 = |I C| 1\nE\u21e5exp( 1\nLELL(\u2713) = Tr[C\u21e4] + b>\u00b5 + ansp N|I C| 1\n\n(11)\nwhere N is the number of time bins. We refer to \u2713MELE = arg max\u2713 LELL(\u2713) as the MELE (maxi-\nmum expected log-likelihood estimator) [7, 8, 22].\nMoment-based Estimator with Least Squares Fitting\nMaximizing (11) w.r.t {C, b, a} yields analytical expected maximum likelihood estimates [7]:\n\n2 exp 1\n2 exp 1\n\n2 b>(1 C)1b + a\n2 b>(1 C)1b + a\n\nCmele = 1 \u21e41, bmele =\u21e4 1\u00b5, amele = log( nsp\n\n(12)\nWith these analytical estimates, it is straightforward and to optimize w and k by directly minimizing\nsquared error:\n\n2 \u00b5>1\u21e41\u00b5\n\nN |\u21e41|\n\n2 ) 1\n\n1\n\nLLS(\u2713) = ||Cmele K> diag(w)K||2\n\n(13)\nwhich corresponds to an optimal \u201cconvolutional\u201d decomposition of the moment-based estimates.\nThis formulation shows that the eigenvectors of Cmele are spanned by shifted copies of k. We\ndenote this estimate \u2713LS.\nAll three estimators, \u2713MLE, \u2713MELE and \u2713LS should provide consistent estimates for the subunit model\nparameters due to consistency of ML and MELE estimates. However, the moment-based estimates\n\n2 + ||bmele K>w||2\n\n2\n\n3\n\n\fIdenti\ufb01ability\n\n(MELE and LS) are computationally much simpler, and scale much better to large datasets, due\nto the fact that they depend on the data only via the spike-triggered moments. In fact their only\ndependence on the dataset size is the cost of computing the STA and STC in one pass through the\ndata. As for ef\ufb01ciency, \u2713LS has the drawback of being sensitive to noise in the Cmele estimate, which\nhas far more free parameters than in the two vectors w and k (for a 1-subunit model). Therefore,\naccurate estimation of Cmele should be a precondition for good performance of \u2713LS, and we expect\n\u2713MELE to perform better for small datasets.\n4\nThe equality C = C[w,k] = K> diag(w)K is a core assumption to bridge the theoretical con-\nnection between a subunit model and the spike-triggered moments (STA & STC). In case we care\nabout recovering the underlying biological structure, we maybe interested to know when the solution\nis unique and naively interpretable. Here we address the identi\ufb01ability of the convolution decom-\nposition of C for k and w estimation. Speci\ufb01cally, we brie\ufb02y study the uniqueness of the form\nC = K> diag(w)K for a single subunit and multiple subunits respectively. We provide the proof\nfor the single subunit case in the main text, and the proof for multiple subunits sharing the same\npooling weight w in the supplement.\nNote that failure of identi\ufb01ability only indicates that there are possible symmetries in the solution\nspace so that there are multiple equivalent optima, which is a question of theoretical interest, but it\nholds no implications for practical performance.\n4.1\nIdenti\ufb01ability for Single Subunit Model\nWe will frequently make use of frequency domain representation. Let B 2 Rd\u21e5d denote the discrete\nFourier transform (DFT) matrix with j-th column is,\n\n(14)\n\nd (j1), e 2\u21e1\n\nd 2(j1), e 2\u21e1\n\nd 3(j1), . . . , e 2\u21e1\n\nbj =h1, e 2\u21e1\n\nWe assume that k and w have full support in the frequency domain.\n\nd (d1)(j1)i> .\nLetek be a d-dimensional vector resulting from a discrete Fourier transform, that is,ek = Bkk where\nBk is a d \u21e5 dk DFT matrix, and similarly ew 2 Rd be a Fourier representation of w.\nAssumption 1. No element inek or ew is zero.\nTheorem. Suppose Assumption 1 holds, the convolution decomposition C = K> diag(w)K is\nuniquely identi\ufb01able up to shift and scale, where C 2 Rd\u21e5d and d = dk + dw 1.\nProof. We \ufb01x k (and thusek) to be a unit vector to deal with the obvious scale invariance. First note\nwhere B 2 Rd\u21e5d is the DFT matrix and (\u00b7)H denotes conjugate transpose operation. Thus,\nNote thatfW := Bw diag(w)BH\n\nC = BH diag(ek)H Bw diag(w)BH\n\nthat we can rewrite the convolution operator K using DFT matrices as,\n\nK = BH diag(Bkk)Bw\n\nw is a circulant matrix,\n\nfW := circulant(ew) =0BBBB@\n\nw diag(ek)B\new2\new3\new4\new3\n...\n...\new1\newd\new1\new2\neC = BCBH = diag(ek)H fW diag(ek) =fW (ekekH)>\n\newd\new1\n...\newd2\newd1\n\new1\new2\n...\newd1\newd\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n1CCCCA\n\n(15)\n\n(16)\n\n(17)\n\n(19)\n\nHence, we can rewrite (16) in the frequency domain as,\n\n(18)\nSince B is invertible, the uniqueness of the original C decomposition is equivalent to the uniqueness\n\nof eC decomposition. The newly de\ufb01ned decomposition is\neC = fW (ekekH)>.\n\n4\n\n\f(20)\n\ni=1 ribibH\n\nSuppose there are two distinct decompositions {fW ,ek} and {eV ,eg}, where both {k,ek} and {g,eg}\nare unit vectors, such that eC =fW (ekekH)> = eV (egegH)>. Since bothfW and eV have no zero,\nde\ufb01ne the element-wise ratio R := (fW ./eV )> 2 Rd\u21e5d, then we have\nNote that rank(R ekekH) = rank(egegH) = 1.\nWe can express R as R = Pd\nvector a and b, (aaH) (bbH) = (a b)(a b)H, we get\ndXi=1\ndXi=1\n\nR is also a circulant matrix which can be diagonalized by DFT [23]: R = B diag (r1, . . . , rd)BH.\ni . Using the identity for Hadamard product that for any\n\nR ekekH =egegH\n\nBy Lemma 1 (in the appendix), {b1 ek, b2 ek, . . . , bd ek} is a linearly independent set.\nTherefore, to satisfy the rank constraint rank(R ekekH) = 1, ri can be non-zero at most a single i.\nWithout loss of generality, let ri 6= 0 and all other r\u00b7 to be zero, then we have,\nBecause bi, ek and eg are unit vectors, ri = 1. By recognizing that\u21e3diag(bi)ek\u2318 is the Fourier\ntransform of i 1 positions shifted k, denoted as ki1, we have, ki1(ki1)> = gg>. Therefore,\ni ) eV =fW thus, vi1 = w. that is, v must also\ng = ki1. Moreover, from (20) and (22), (bibH\nbe a shifted version of w.\nIf restricting k and g to be unit vectors, then any solution v and g would satisfy w = vi1 and\ng = ki1. Therefore, the two decompositions are identical up to scale and shift.\n\ni ) ekekH =egegH =) ri diag(bi)ekekH diag(bi)H =egegH\n\nri(bi ek)(bi ek)H\n\ni ) (ekekH) =\n\nR ekekH =\n\nri(bibH\n\nri(bibH\n\n(21)\n\n(22)\n\nIdenti\ufb01ability for Multiple Subunits Model\n\n4.2\nMultiple subunits model (with m > 1 subunits) is far more complicated to analyze due to large\ndegree of hidden invariances. In this study, we only provide the analysis under a speci\ufb01c condition\nwhen all subunits share a common pooling weight w.\nAssumption 2. All models share a common w.\nWe make a few additional assumptions. We would like to consider a tight parameterization where\nno combination of subunits can take over another subunit\u2019s task.\nAssumption 3. K := [k1, k2, k3, . . . , km] spans an m-dimensional subspace where ki is the sub-\nunit \ufb01lter for i-th subunit and K 2 Rdk\u21e5m. In addition, K has orthogonal columns.\nWe denote K with p positions shifted along the column as Kp := [kp\n3, . . . , kp\nthat trivially, m \uf8ff dk < dk + dw 1 < d since dw > 1.\nTo allow arbitrary scale corresponding to each unit vector ki, we introduce coef\ufb01cient \u21b5i to the i-th\nsubunit, thus extending (19) to,\n\nm]. Also, note\n\n2, kp\n\n1, kp\n\n(23)\n\nC =\n\nmXi=1fW (\u21b5iekiekH\n\ni )> =fW mXi=1\n\ni !>\n\u21b5iekiekH\n\n=fW (eKAeKH)>\n\nwhere A 2 Rm\u21e5m is a diagonal matrix of \u21b5i and eK 2 Rd\u21e5m is the DFT of K.\nAssumption 4. @\u2326 2 Rm\u21e5m such that Ki\u2326= P Ki, 8i, where P 2 Rdk\u21e5dk is the permutation\nmatrix from Ki to Kj by shifting rows, namely Kj = P Ki, 8i, j, and \u2326 is a linear projection\ncoef\ufb01cient matrix satisfying Kj = Ki\u2326.\nAssumption 5. A has all positive or all negative values on the diagonal.\nGiven these assumptions, we establish the proposition for multiple subunits model.\n\nuniquely identi\ufb01able up to shift and scale.\n\nProposition. Under Assumptions (1-5), the convolutional decomposition C =fW (eKAeKH)> is\n\nThe proof for the proposition and illustrations of Assumption 4-5 are in the supplement.\n\n5\n\n\f0.4\n\n0.2\n\n0\n\n 0\n\n10\n\n20\n\n30\n\na)\n\nc)\n\n0.4\n\n0\n\n-0.4\n0\n\n10\n\n20\n\n30\n\ntrue parameters\nMELE\nsmoothMELE\n\nexponential\n\noutput nonlinearity\n\nsmoothLS\nsmoothMELE\nsmoothMLE\n\nb)\n\n7\n\n3.5\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \nn\nu\nr\n\n0\n103\n\n104\n\nsample size\n\n105\n\nsoft-rectifier\n\nc\ni\nt\na\nr\nd\na\nu\nq\n\n1.13\n\n0.64\n\n0.14\n\n0.88\n\nE\nS\nM\n\n0.45\n\ni\n\nd\no\nm\ng\ni\ns\n\ny\nt\ni\nr\na\ne\nn\n\ni\nl\n\nn\no\nn\n \nt\ni\nn\nu\nb\nu\ns\n\n0.03\n103\n\n104\n\nsample size\n\n105\n\n0.76\n\n0.56\n\n0.37\n\n0.74\n\n0.39\n\n0.05\n\n0.7\n\n0.38\n\n0.07\n\n0.7\n\n0.4\n\n0.09\n\n0.6\n\n0.39\n\n0.17\n\n0.71\n\n0.43\n\n0.14\n\nsmoothLS\n\nsmoothMELE\n\nsmoothMLE\n\nFigure 2: a) True parameters and MELE and smoothMELE estimations. b) Speed performance for\nsmoothLS, smoothMELE and smoothMLE. The slightly decreasing running time along with a larger\nsize is resulted from a more and more fully supported subspace, which makes optimization require\nfewer iterations. c) Accuracy performance for all combinations of subunit and output nonlinearities\nfor smoothLS, smoothMELE and smoothMLE. Top left is the subunit model matching the data;\nothers are model mismatch.\n\n5 Experiments\n5.1 Initialization\nAll three estimators are non-convex and contain many local optima, thus the selection of model\ninitialization would affect the optimization substantially. Similar to [12] using \u2018convolutional STC\u2019\nfor initialization, we also use a simple moment based method with some assumptions. For simplicity,\nwe assume all subunit models sharing the same w with different scaling factors as in eq. (23). Our\ninitializer is generated from a shallow bilinear regression. Firstly, initialize w with a wide Gaussian\n\nand A contain information about ki\u2019s and \u21b5\u2019s respectively, hypothesizing that k\u2019s are orthogonal to\neach other and \u21b5\u2019s are all positive (Assumptions 3 and 5). Based on the ki\u2019s and \u21b5i\u2019s we estimated\nfrom the rough Gaussian pro\ufb01le of w, now we \ufb01x those and re-estimate w with the same element-\n\npro\ufb01le, then estimate eKAeKH from element-wise division of Cmele by fW . Secondly, use SVD\nto decompose eKAeKH into an orthogonal base set eK and a positive diagonal matrix A, where eK\nwise division forfW . This bilinear iterative procedure proceeds only a few times in order to avoid\n\nover\ufb01tting to Cmele which is a coarse estimate of C.\n5.2 Smoothing prior\nNeural receptive \ufb01elds are generally smooth, thus a prior smoothing out high frequency \ufb02uctuations\nwould improve the performance of estimators, unless the data likelihood provides suf\ufb01cient evidence\nfor jaggedness. We apply automatic smoothness determination (ASD [24]) to both w and k, each\nwith an associated balancing hyper parameter w and k. Assuming w \u21e0N (0, Cw) with\n\nCw = exp\u2713\u21e2w kk2\nw \u25c6\n\n22\n\n(24)\n\nwhere is the vector of differences between neighboring locations in w. \u21e2w and 2\nw are variance\nand length scale of Cw that belong to the hyper parameter set. k also has the same ASD prior with\nhyper parameters \u21e2k and 2\nk. For multiple subunits, each wi and ki would have its own ASD prior.\n\n6\n\n\flow\u2212rank, smooth, \nexpected GQM\n\nlow\u2212rank, \nsmooth GQM\nsmoothLS(#1)\n\nsmoothLS(#2)\n\nsmoothMELE(#1)\n\nsmoothMELE(#2)\n\nsmoothMLE(#1)\n\nsmoothMLE(#2)\n\nt\nfi\n-\nf\no\n-\ns\ns\ne\nn\nd\no\no\ng\n\n)\nk\np\ns\n/\ns\nt\na\nn\n(\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n \n\nperformance\n\n0\n\n\u22120.06\n\n\u22120.12\n\n\u22120.18\n\n \n\n4\n10\n\n4\n10\n\ntraining size\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\nn\nu\nr\n\n250\n\n200\n\n150\n\n100\n\n50\n\n \n\n \n\n5\n10\n\n5\n10\n\nspeed\n\n \n\n \n\n4\n\n10\n\ntraining size\n\n5\n10\n\nFigure 3: Goodness-of-model \ufb01ts from various estimators and their running speeds (without GQM\ncomparisons). Black curves are regularized GQM (with and without expected log-likelihood trick);\nblue is smooth LS; green is smooth MELE; red is smooth MLE. All the subunit estimators have\nresults for 1 subunit and 2 subunits. The inset \ufb01gure in performance is the enlarged view for large\ngoodness-of-\ufb01t values. The right \ufb01gure is the speed result showing that MLE-based methods require\nexponentially increasing running time when increasing the training size, but our moment-based ones\nhave quite consistent speed.\n\nFig. 2a shows the true w and k and the estimations from MELE and smoothMELE (MELE with\nsmoothing prior). From now on, we use smoothing prior by default.\n5.3 Simulations\nTo illustrate the performance of our moment-based estimators, we generated Gaussian stimuli from\nan LNP neuron with exponentiated-quadratic nonlinearity and 1 subunit model with 8-element \ufb01lter\nk and 33-element pooling weights w. Mean \ufb01ring rate is 0.91 spk/s. In our estimation, each time\nbin stimulus with 40 dimensions is treated as one sample to generate spike response. Fig. 2 b and c\nshow the speed and accuracy performance of three estimators LS, MELE and MLE (with smoothing\nprior). LS and MELE are comparable with baseline MLE in terms of accuracy but are exponentially\nfaster.\nAlthough LNP with exponential nonlinearity has been widely adapted in neuroscience for its sim-\nplicity, the actual nonlinearity of neural systems is often sub-exponential, such as soft-recti\ufb01er non-\nlinearity. But exponential is favored as a convenient approximation of soft-recti\ufb01er within a small\nregime around the origin. Also generally, LNP neuron leans towards sigmoid subunit nonlinearity\nrather than quadratic. Quadratic could well approximate a sigmoid within a small nonlinear regime\nbefore the linear regime of the sigmoid. Therefore, in order to check the generalization performance\nof LS and MELE on mismatch models, we stimulated data from a neuron with sigmoid subunit non-\nlinearity or soft-recti\ufb01er output nonlinearity as shown in Fig. 2c. All the full MLEs formulated with\nno model mismatch provide a baseline for inspecting the performance of the ELL methods. Despite\nthe model-mismatch, our estimators (LS and MELE) are on par with MLE when the subunit nonlin-\nearity is quadratic, but the performance is notably worse for the sigmoid nonlinearity. Even so, in\nreal applications, we will explore \ufb01ts with different subunit nonlinearities using full MLE, where the\nexponential and quadratic assumption is thus primarily useful for a reasonable and extremely fast\ninitializer. Moreover, the running time for moment-based estimators is always exponentially faster.\n5.4 Application to neural data\nIn order to show the predictive performance more comprehensively in real neural dataset, we applied\nLS, MELE and MLE estimators to data from a population of 57 V1 simple and complex cells (data\npublished in [11]). The stimulus consisted of oriented binary white noise (\u201c\ufb02ickering bar\u201d) aligned\nwith the cell\u2019s preferred orientation. The size of receptive \ufb01eld was chosen to be # of bars d \u21e510\ntime bins, yielding a 10d-dimensional stimulus space. The time bin size is 10 ms and the number of\nbars (d) is 16 in our experiment.\nWe compared moment-based estimators and MLE with smoothed low-rank expected GQM and\nsmoothed low-rank GQM [7, 8]. Models are trained on stimuli with size varying from 6.25 \u21e5 103\nto 105 and tested on 5 \u21e5 104 samples. Each subunit \ufb01lter has a length of 5. All hyper parameters\nare chosen by cross validation. Fig. 3 shows that GQM is weakly better than LS but its running\ntime is far more than LS (data not shown). Both MELE and MLE (but not LS) out\ufb01ght GQM and\n\n7\n\n\fa)\n\nsubunit #1\n\nsubunit #1\n0\n\n-0.1\n\n-0.2\n0\n\n10\n\n20\n\nsubunit #2\n\nsubunit #2\n\n0.2\n\n0.1\n\n0\n0\n\n10\n\n20\n\nb)\n\nSTA\n\nexcitatory STC filters\n\nsuppressive STC filters\n\nV1\nresponses\n\nsubunit\nmodel\n\nFigure 4: Estimating visual receptive \ufb01elds from a complex cell (544l029.p21). a) k and w by\n\ufb01tting smoothMELE(#2). Subunit #1 is suppressive (negative w) and #2 is excitatory (positive\nw). Form the y-axis we can tell from w that both imply that middle subunits contribute more than\nthe ends. b) Qualitative analysis. Each image corresponds to a normalized 24 dimensions spatial\npixels (horizontal) by 10 time bins (vertical) \ufb01lter. Top row: STA/STC from true data; Bottom row:\nsimulated response from 2-subunit MELE model given true stimuli and applied the same subspace\nanalysis.\n\nexpected GQM with both 1 subunit and 2 subunits. Especially the improvement is the greatest with\n1 subunit, which results from the average over all simple and complex cells. Generally, the more\n\u201ccomplex\u201d the cell is, the higher probability that multiple subunits would \ufb01t better. Outstandingly,\nMELE outperforms others with best goodness-of-\ufb01t and \ufb02at speed curve. The goodness-of-\ufb01t is\nde\ufb01ned to be the log-likelihood on the test set divided by spike count.\nFor qualitative analysis, we ran smoothMELE(#2) for a complex cell and learned the optimal sub-\nunit \ufb01lters and pooling weights (Fig. 4a), and then simulated V1 response by \ufb01tting 2-subunit MELE\ngenerative model given the optimal parameters. STA/STC analysis is applied to both neural data and\nsimulated V1 response data. The quality of the \ufb01lters trained on 105 stimuli are qualitatively close to\nthat obtained by STA/STC (Fig. 4b). Subunit models can recover STA, the \ufb01rst six excitatory STC\n\ufb01lters and the last four suppressive ones but with a considerably parsimonious parameter space.\n6 Conclusion\nWe proposed an asymptotically ef\ufb01cient estimator for quadratic convolutional subunit models, which\nforges an important theoretical link between spike-triggered covariance analysis and nonlinear sub-\nunit models. We have shown that the proposed method works well even when the assumptions\nabout model speci\ufb01cation (nonlinearity and input distribution) were violated. Our approach reduces\nthe dif\ufb01culty of \ufb01tting subunit models because computational cost does not depend on dataset size\n(beyond the cost of a single pass through the data to compute the spike-triggered moments). We\nalso proved conditions for identi\ufb01ability of the convolutional decomposition, which reveals that for\nmost cases the parameters are indeed identi\ufb01able. We applied our estimators to the neural data from\nmacaque primary visual cortex, and showed that they outperform a highly regularized form of the\nGQM and achieve similar performance to the subunit model MLE at substantially lower computa-\ntional cost.\nAcknowledgments\nThis work was supported by the Sloan Foundation (JP), McKnight Foundation (JP), Simons Global\nBrain Award (JP), NSF CAREER Award IIS-1150186 (JP), and a grant from the NIH (NIMH grant\nMH099611 JP). We thank N. Rust and T. Movshon for V1 data.\n\n8\n\n\fReferences\n[1] R. R. de Ruyter van Steveninck and W. Bialek. Real-time performance of a movement-senstivive neuron\nin the blow\ufb02y visual system: coding and information transmission in short spike sequences. Proc. R. Soc.\nLond. B, 234:379\u2013414, 1988.\n\n[2] J. Touryan, B. Lau, and Y. Dan. Isolation of relevant visual features from random stimuli for cortical\n\ncomplex cells. Journal of Neuroscience, 22:10811\u201310818, 2002.\n\n[3] B. Aguera y Arcas and A. L. Fairhall. What causes a neuron to spike? Neural Computation, 15(8):1789\u2013\n\n1807, 2003.\n\n[4] Tatyana Sharpee, Nicole C. Rust, and William Bialek. Analyzing neural responses to natural signals:\n\nmaximally informative dimensions. Neural Comput, 16(2):223\u2013250, Feb 2004.\n\n[5] O. Schwartz, J. W. Pillow, N. C. Rust, and E. P. Simoncelli. Spike-triggered neural characterization.\n\nJournal of Vision, 6(4):484\u2013507, 7 2006.\n\n[6] J. W. Pillow and E. P. Simoncelli. Dimensionality reduction in neural models: An information-theoretic\ngeneralization of spike-triggered average and covariance analysis. Journal of Vision, 6(4):414\u2013428, 4\n2006.\n\n[7] Il Memming Park and Jonathan W. Pillow. Bayesian spike-triggered covariance analysis. In J. Shawe-\nTaylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Infor-\nmation Processing Systems 24, pages 1692\u20131700, 2011.\n\n[8] Il M. Park, Evan W. Archer, Nicholas Priebe, and Jonathan W. Pillow. Spectral methods for neural char-\nacterization using generalized quadratic models. In Advances in Neural Information Processing Systems\n26, pages 2454\u20132462, 2013.\n\n[9] Ross S. Williamson, Maneesh Sahani, and Jonathan W. Pillow. The equivalence of information-theoretic\nand likelihood-based methods for neural dimensionality reduction. PLoS Comput Biol, 11(4):e1004141,\n04 2015.\n\n[10] Kanaka Rajan, Olivier Marre, and Ga\u02c7sper Tka\u02c7cik. Learning quadratic receptive \ufb01elds from neural re-\n\nsponses to natural stimuli. Neural Computation, 25(7):1661\u20131692, 2013/06/19 2013.\n\n[11] Nicole C. Rust, Odelia Schwartz, J. Anthony Movshon, and Eero P. Simoncelli. Spatiotemporal elements\n\nof macaque v1 receptive \ufb01elds. Neuron, 46(6):945\u2013956, Jun 2005.\n\n[12] B Vintch, A Zaharia, J A Movshon, and E P Simoncelli. Ef\ufb01cient and direct estimation of a neural\nsubunit model for sensory coding. In Adv. Neural Information Processing Systems (NIPS*12), volume 25,\nCambridge, MA, 2012. MIT Press. To be presented at Neural Information Processing Systems 25, Dec\n2012.\n\n[13] Brett Vintch, Andrew Zaharia, J Movshon, and Eero P Simoncelli. A convolutional subunit model for\n\nneuronal responses in macaque v1. J. Neursoci, page in press, 2015.\n\n[14] HB Barlow and W Ro Levick. The mechanism of directionally selective units in rabbit\u2019s retina. The\n\nJournal of physiology, 178(3):477, 1965.\n\n[15] S. Hochstein and R. Shapley. Linear and nonlinear spatial subunits in y cat retinal ganglion cells. J.\n\nPhysiol., 262:265\u2013284, 1976.\n\n[16] Jonathan B Demb, Kareem Zaghloul, Loren Haarsma, and Peter Sterling. Bipolar cells contribute to\nnonlinear spatial summation in the brisk-transient (y) ganglion cell in mammalian retina. The Journal of\nneuroscience, 21(19):7447\u20137454, 2001.\n\n[17] Joanna D Crook, Beth B Peterson, Orin S Packer, Farrel R Robinson, John B Troy, and Dennis M Dacey.\nY-cell receptive \ufb01eld and collicular projection of parasol ganglion cells in macaque monkey retina. The\nJournal of neuroscience, 28(44):11277\u201311291, 2008.\n\n[18] PX Joris, CE Schreiner, and A Rees. Neural processing of amplitude-modulated sounds. Physiological\n\nreviews, 84(2):541\u2013577, 2004.\n\n[19] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern\n\nrecognition unaffected by shift in position. Biological cybernetics, 36(4):193\u2013202, 1980.\n\n[20] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like\n\nmechanisms. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(3):411\u2013426, 2007.\n\n[21] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[22] AlexandroD. Ramirez and Liam Paninski. Fast inference in generalized linear models via expected log-\n\nlikelihoods. Journal of Computational Neuroscience, pages 1\u201320, 2013.\n[23] Philip J Davis. Circulant matrices. American Mathematical Soc., 1979.\n[24] M. Sahani and J. Linden. Evidence optimization techniques for estimating stimulus-response functions.\n\nNIPS, 15, 2003.\n\n9\n\n\f", "award": [], "sourceid": 520, "authors": [{"given_name": "Anqi", "family_name": "Wu", "institution": "Princeton University"}, {"given_name": "Il Memming", "family_name": "Park", "institution": "Stony Brook University"}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": "Princeton University"}]}