{"title": "Bayesian inference for low rank spatiotemporal neural receptive fields", "book": "Advances in Neural Information Processing Systems", "page_first": 2688, "page_last": 2696, "abstract": "The receptive field (RF) of a sensory neuron describes how the   neuron integrates sensory stimuli over time and space. In typical   experiments with naturalistic or flickering spatiotemporal stimuli,   RFs are very high-dimensional, due to the large number of   coefficients needed to specify an integration profile across time   and space.  Estimating these coefficients from small amounts of data   poses a variety of challenging statistical and computational   problems.  Here we address these challenges by developing Bayesian   reduced rank regression methods for RF estimation. This corresponds   to modeling the RF as a sum of several space-time separable (i.e.,   rank-1) filters, which proves accurate even for neurons with   strongly oriented space-time RFs.  This approach substantially   reduces the number of parameters needed to specify the RF, from   1K-100K down to mere 100s in the examples we consider, and confers   substantial benefits in statistical power and computational   efficiency.  In particular, we introduce a novel prior over low-rank   RFs using the restriction of a matrix normal prior to the manifold   of low-rank matrices. We then use a localized'' prior over row and   column covariances to obtain sparse, smooth, localized estimates of   the spatial and temporal RF components.  We develop two methods for   inference in the resulting hierarchical model: (1) a fully Bayesian   method using blocked-Gibbs sampling; and (2) a fast, approximate   method that employs alternating coordinate ascent of the conditional   marginal likelihood.  We develop these methods under Gaussian and   Poisson noise models, and show that low-rank estimates substantially   outperform full rank estimates in accuracy and speed using neural   data from retina and V1.\"", "full_text": "Bayesian inference for low rank spatiotemporal\n\nneural receptive \ufb01elds\n\nMijung Park\n\nElectrical and Computer Engineering\n\nThe University of Texas at Austin\nmjpark@mail.utexas.edu\n\nJonathan W. Pillow\n\nCenter for Perceptual Systems\n\nThe University of Texas at Austin\npillow@mail.utexas.edu\n\nAbstract\n\nThe receptive \ufb01eld (RF) of a sensory neuron describes how the neuron integrates\nsensory stimuli over time and space. In typical experiments with naturalistic or\n\ufb02ickering spatiotemporal stimuli, RFs are very high-dimensional, due to the large\nnumber of coef\ufb01cients needed to specify an integration pro\ufb01le across time and\nspace. Estimating these coef\ufb01cients from small amounts of data poses a vari-\nety of challenging statistical and computational problems. Here we address these\nchallenges by developing Bayesian reduced rank regression methods for RF esti-\nmation. This corresponds to modeling the RF as a sum of space-time separable\n(i.e., rank-1) \ufb01lters. This approach substantially reduces the number of parameters\nneeded to specify the RF, from 1K-10K down to mere 100s in the examples we\nconsider, and confers substantial bene\ufb01ts in statistical power and computational\nef\ufb01ciency. We introduce a novel prior over low-rank RFs using the restriction of\na matrix normal prior to the manifold of low-rank matrices, and use \u201clocalized\u201d\nrow and column covariances to obtain sparse, smooth, localized estimates of the\nspatial and temporal RF components. We develop two methods for inference in\nthe resulting hierarchical model: (1) a fully Bayesian method using blocked-Gibbs\nsampling; and (2) a fast, approximate method that employs alternating ascent of\nconditional marginal likelihoods. We develop these methods for Gaussian and\nPoisson noise models, and show that low-rank estimates substantially outperform\nfull rank estimates using neural data from retina and V1.\n\n1\n\nIntroduction\n\nA neuron\u2019s linear receptive \ufb01eld (RF) is a \ufb01lter that maps high-dimensional sensory stimuli to a\none-dimensional variable underlying the neuron\u2019s spike rate. In white noise or reverse-correlation\nexperiments, the dimensionality of the RF is determined by the number of stimulus elements in\nthe spatiotemporal window in\ufb02uencing a neuron\u2019s probability of spiking. For a stimulus movie with\nnx\u00d7ny pixels per frame, the RF has nxnynt coef\ufb01cients, where nt is the (experimenter-determined)\nnumber of movie frames in the neuron\u2019s temporal integration window. In typical neurophysiology\nexperiments, this can result in RFs with hundreds to thousands of parameters, meaning we can think\nof the RF as a vector in a very high dimensional space.\nIn high dimensional settings, traditional RF estimators like the whitened spike-triggered average\n(STA) exhibit large errors, particularly with naturalistic or correlated stimuli. A substantial liter-\nature has therefore focused on methods for regularizing RF estimates to improve accuracy in the\nface of limited experimental data. The Bayesian approach to regularization involves specifying a\nprior distribution that assigns higher probability to RFs with particular kinds of structure. Popular\nmethods have involved priors to impose smallness, sparsity, smoothness, and localized structure in\nRF coef\ufb01cients[1, 2, 3, 4, 5].\n\n1\n\n\fHere we develop a novel regularization method to exploit the fact that neural RFs can be modeled\nas a low-rank matrices (or tensors). This approach is justi\ufb01ed by the observation that RFs can be\nwell described by summing a small number of space-time separable \ufb01lters [6, 7, 8, 9]. Moreover,\nit can substantially reduce the number of RF parameters: a rank p receptive \ufb01eld in nxnynt di-\nmensions requires only p(nxny + nt \u2212 1) parameters, since a single space-time separable \ufb01lter has\nnxny spatial coef\ufb01cients and nt \u2212 1 temporal coef\ufb01cients (i.e., for a temporal unit vector). When\np (cid:28) min(nxny, nt), as commonly occurs in experimental settings, this parametrization yields con-\nsiderable savings.\nIn the statistics literature, the problem of estimating a low-rank matrix of regression coef\ufb01cients is\nknown as reduced rank regression [10, 11]. This problem has received considerable attention in\nthe econometrics literature, but Bayesian formulations have tended to focus on non-informative or\nminimally informative priors [12]. Here we formulate a novel prior for reduced rank regression using\na restriction of the matrix normal distribution [13] to the manifold of low-rank matrices. This results\nin a marginally Gaussian prior over RF coef\ufb01cients, which puts it on equal footing with \u201cridge\u201d,\nAR1, and other Gaussian priors. Moreover, under a linear-Gaussian response model, the posterior\nover RF rows and columns are conditionally Gaussian, leading to fast and ef\ufb01cient sampling-based\ninference methods. We use a \u201clocalized\u201d form for the row and and column covariances in the matrix\nnormal prior, which have hyperparameters governing smoothness and locality of RF components\nin space and time [5]. In addition to fully Bayesian sampling-based inference, we develop a fast\napproximate inference method using coordinate ascent of the conditional marginal likelihoods for\ntemporal (column) and spatial (row) hyperparameters. We apply this method under linear-Gaussian\nand linear-nonlinear-Poisson encoding models, and show that the latter gives the best performance\non neural data.\nThe paper is organized as follows. In Sec. 2, we describe the low-rank RF model with localized\npriors. In Sec. 3, we describe a fully Bayesian inference method using the blocked-Gibbs sampling\nwith interleaved Metroplis Hastings steps. In Sec. 4, we introduce a fast method for approximate\ninference using conditional empirical Bayesian hyperparameter estimates. In Sec. 5, we extend our\nestimator to the linear-nonlinear Poisson encoding model. Finally, in Sec. 6, we show applications\nto simulated and real neural datasets from retina and V1.\n\n2 Hierarchical low-rank receptive \ufb01eld model\n\n2.1 Response model (likelihood)\n\nWe begin by de\ufb01ning two probabilistic encoding models that will provide likelihood functions for\nRF inference. Let yi denote the number of spikes that occur in response to a (dt \u00d7 dx) matrix stimu-\nlus Xi, where dt and dx denote the number of temporal and spatial elements in the RF, respectively.\nLet K denote the neuron\u2019s (dt \u00d7 dx) matrix receptive \ufb01eld.\nWe will consider, \ufb01rst, a linear Gaussian encoding model:\n\n(1)\nwhere xi = vec(Xi) and k = vec(K) denote the vectorized stimulus and vectorized RF, respec-\ntively, \u03b3 is the variance of the response noise, and b is a bias term. Second, we will consider a\nlinear-nonlinear-Poisson (LNP) encoding model\n\ni k + b, \u03b3),\n\nyi|Xi \u223c N (x(cid:62)\n\n(2)\nwhere g denotes the nonlinearity. Examples of g include exponential and soft rectifying function,\nlog(exp(\u00b7) + 1), both of which give rise to a concave log-likelihood [14].\n\ni k + b)).\n\nyi|Xi, \u223c Poiss(g(x(cid:62)\n\n2.2 Prior for low rank receptive \ufb01eld\n\nWe can represent an RF of rank p using the factorization\nK = KtK(cid:62)\nx ,\n\n(3)\nwhere the columns of the matrix Kt \u2208 Rdt\u00d7p contain temporal \ufb01lters and the columns of the matrix\nKx \u2208 Rdx\u00d7p contain spatial \ufb01lters.\n\n2\n\n\fZ exp(cid:0)\u2212 1\n\nWe de\ufb01ne a prior over rank-p matrices using a restriction of the matrix normal distribution\nMN (0, Cx, Ct). The prior can be written:\np(K|Ct, Cx) = 1\n\n(4)\nwhere the normalizer Z involves integration over the space of rank-p matrices, which has no known\nclosed-form expression. The prior is controlled by a \u201ccolumn\u201d covariance matrix Ct \u2208 Rdt\u00d7dt and\n\u201crow\u201d covariance matrix Cx \u2208 Rdx\u00d7dx, which govern the temporal and spatial RF components,\nrespectively.\nIf we express K in factorized form (eq. 3), we can rewrite the prior\n\nt K](cid:1) ,\n\nx K(cid:62)C\u22121\n\n2Tr[C\u22121\n\np(K|Ct, Cx) = 1\n\nZ exp\n\nx C\u22121\n\nx Kx)(K(cid:62)\n\nt C\u22121\n\n.\n\n(5)\n\n(cid:16) \u2212 1\n2Tr(cid:2)(K(cid:62)\n\nt Kt)(cid:3)(cid:17)\n\nThis formulation makes it clear that we have conditionally Gaussian priors on Kt and Kx, that is:\n\nkt|kx, Cx, Ct \u223c N (0, A\u22121\nkx|kt, Ct, Cx \u223c N (0, A\u22121\n\nx \u2297 Ct),\nt \u2297 Cx),\n\nt Kt.\n\nx C\u22121\n\nt C\u22121\n\nx Kx and At = K(cid:62)\n\n(6)\nwhere \u2297 denotes Kronecker product, and kt = vec(Kt) \u2208 Rpdt\u00d71, kx = vec(Kx) \u2208 Rpdx\u00d71, and\nwhere we de\ufb01ne Ax = K(cid:62)\nWe de\ufb01ne Ct and Cx have a parametric form controlled by hyperparameters \u03b8t and \u03b8x, respectively.\nThis form is adopted from the \u201cautomatic locality determination\u201d (ALD) prior introduced in [5]. In\nthe ALD prior, the covariance matrix encodes the tendency for RFs to be localized in both space-time\nand spatiotemporal frequency.\nFor the spatial covariance matrix Cx, the hyperparameters are \u03b8x = {\u03c1, \u00b5s, \u00b5f , \u03a6s, \u03a6f}, where \u03c1 is\na scalar determining the overall scale of the covariance; \u00b5s and \u00b5f are length-D vectors specifying\nthe center location of the RF support in space and spatial frequency, respectively (where D is the\nnumber of spatial dimensions, e.g., \u201cD=2\u201d for standard 2D visual pixel stimuli). The positive de\ufb01nite\nmatrices \u03a6s and \u03a6f are D \u00d7 D determine the size of the local region of RF support in space and\nspatial frequency, respectively [15]. In the temporal covariance matrix Ct, the hyperparameters \u03b8t,\nwhich are directly are analogous to \u03b8x, determine the localized RF structure in time and temporal\nfrequency.\nFinally, we place a zero-mean Gaussian prior on the (scalar) bias term: b \u223c N (0, \u03c32\nb ).\n\n3 Posterior inference using Markov Chain Monte Carlo\nFor a complete dataset D = {X, y}, where X \u2208 Rn\u00d7(dtdx) is a design matrix, and y is a vector of\nresponses, our goal is to infer the joint posterior over K and b,\np(D|K, b)p(K|\u03b8t, \u03b8x)p(b|\u03c32\n\np(K, b|D) \u221d\n\n(cid:90) (cid:90)\n\nb )p(\u03b8t, \u03b8x, \u03c32\n\nb d\u03b8td\u03b8x.\n\nb )d\u03c32\n\n(7)\n\nb , \u03b8t, \u03b3, b, kt), we then sample \u03b8x and kx similarly.\n\nWe develop an ef\ufb01cient Markov chain Monte Carlo (MCMC) sampling method using blocked-Gibbs\nsampling. Blocked-Gibbs sampling is possible since the closed-form conditional priors in eq. 6\nand the Gaussian likelihood yields closed-form \u201cconditional marginal likelihood\u201d for \u03b8t|(kx, \u03b8x, D)\nand \u03b8x|(kt, \u03b8t, D), respectively1. The blocked-Gibbs \ufb01rst samples (\u03c32\nb , \u03b8t, \u03b3) from the condi-\ntional evidence and simultaneously sample kt from the conditional posterior. Given the samples\nof (\u03c32\nFor sampling from the conditional evidence, we use the Metropolis Hastings (MH) algorithm to\nsample the low dimensional space of hyperparameters. For sampling (b, kt) and kx, we use the\nclosed-form formula (will be introduced shortly) for the mean of the conditional posterior. The\ndetails of our algorithm are as follows.\nStep 1 Given (i-1)th samples of (kx, \u03b8x), we draw ith samples (b, kt, \u03b8t, \u03c32\n, \u03b3(i)|k(i\u22121)\np(b(i), k(i)\nt\nx\n|\u03b8(i)\n, \u03c32\nb\n\n,D),\n1In this section and Sec.4, we \ufb01x the likelihood to Gaussian (eq. 1). An extension to the Poisson likelihood\n\nb , \u03b3) from\n,D)\n, \u03b8(i\u22121)\n, \u03b3(i), k(i\u22121)\n\n,D) = p(\u03b8(i)\n\nt\np(b(i), k(i)\nt\n\n, \u03b3(i)|k(i\u22121)\n\n, \u03b8(i\u22121)\n\n, \u03b8(i\u22121)\n\n, \u03b8(i)\nt\n\n, \u03c32\nb\n\n, \u03c32\nb\n\n(i)\n\nx\n\nx\n\nx\n\nx\n\n(i)\n\nx\n\n(i)\n\nt\n\nmodel (eq. 2) will be described in Sec.5.\n\n3\n\n\fwhich is divided into two parts2:\n\n\u2022 We sample (\u03b8t, \u03c32\n\nb , \u03b3) from the conditional posterior given by\n\n(cid:90)\n(cid:90)\n\np(\u03b8t, \u03c32\n\nb , \u03b3|kx, \u03b8x,D) \u221d p(\u03b8t, \u03c32\n\u221d p(\u03b8t, \u03c32\nt ]T , M(cid:48)\n\nb , \u03b3)\n\nb , \u03b3)\n\np(D|b, kt, kx, \u03b3)p(b, kt|kx, \u03b8x, \u03b8t)dbdkt,\nN (D|M(cid:48)\n\nxwt, \u03b3I)N (wt|0, Cwt)dwt,\n\n(8)\n\nwhere wt is a vector of [b kT\nx is concatenation of a vector of ones and the matrix\nMx, which is generated by projecting each stimulus Xi onto Kx and then stacking it in\neach row, meaning that the i-th row of Mx is [vec(XiKx)](cid:62), and Cwt is a block diagonal\nx \u2297 Ct. Using the standard formula for a product of\nmatrix whose diagonal is \u03c32\ntwo Gaussians, we obtain the closed form conditional evidence:\n2 \u00b5(cid:62)\n\nb , \u03b3, kx, \u03b8x) \u2248\n\nb and A\u22121\n\nt \u00b5t \u2212 1\n\np(D|\u03b8t, \u03c32\n\n2\u03b3 y(cid:62)y\n\nt \u039b\u22121\n\n(cid:104) 1\n\n(cid:105)\n\nexp\n\n(9)\n\n|2\u03c0\u039bt| 1\n2|2\u03c0Cwt| 1\n\n2\n\n2\n\n|2\u03c0\u03b3I| 1\n\nwhere the mean and covariance of conditional posterior over wt given kx are given by\n\n\u00b5t = 1\n\n\u03b3 \u039btM(cid:48)T\n\nx y,\n\nand \u039bt = (C\u22121\n\nwt\n\n+ 1\n\n\u03b3 M(cid:48)T\n\nx Mx)\u22121.\n\n(10)\n\nWe use the MH algorithm to search over the low dimensional hyperparameter space, with\nthe conditional evidence (eq. 9) as the target distribution, under a uniform hyperprior on\n(\u03b8t, \u03c32\n\nb , \u03b3).\n\n\u2022 We sample (b, kt) from the conditional posterior given in eq. 10.\n\nStep 2 Given the ith samples of (b, kt, \u03b8t, \u03c32\n\np(k(i)\n\nx , \u03b8(i)\n\nx |b(i), k(i)\n\nt\n\n, \u03c32\nb\n\n(i)\n\n, \u03b8(i)\nt\n\nwhich is divided into two parts:\n\n, \u03b3(i),D) = p(\u03b8(i)\n\nb , \u03b3), we draw ith samples (kx, \u03b8x) from\n, \u03b3(i),D),\n, \u03b8(i)\n, \u03c32\nt\nb\n\nx |b(i), k(i)\nx |\u03b8(i)\np(k(i)\n\n, \u03c32\nb\nx , b(i), k(i)\n\n, \u03b8(i)\nt\n\n(i)\n\n(i)\n\nt\n\nt\n\n, \u03b3(i),D),\n\n\u2022 We sample \u03b8x from the conditional posterior given by\n\np(\u03b8x|b, kt, \u03b8t, \u03c32\n\nb , \u03b3,D) \u221d p(\u03b8x)\n\u221d p(\u03b8x)\n\np(D|b, kt, kx, \u03b3)p(kx|kt, \u03b8t, \u03b8x)dkx,\nN (D|Mtkx + b1, \u03b3I)N (kx|0, A\u22121\n\n(11)\nt \u2297 Cx)dkx,\n\n(cid:90)\n(cid:90)\n\nwhere the matrix Mt is generated by projecting each stimulus Xi onto Kt and then stacking\nit in each row, meaning that the i-th row of Mt is [vec([X(cid:62)\ni Kt])](cid:62). Using the standard\n(cid:105)\nformula for a product of two Gaussians, we obtain the closed form conditional evidence:\n2\u03b3 (y \u2212 b1)T (y \u2212 b1)\n\np(D|\u03b8x, kt, b) =\n\nx \u00b5x \u2212 1\n\nx \u039b\u22121\n\n(cid:104) 1\n\n2 \u00b5(cid:62)\n\nexp\n\n,\n\n|2\u03c0\u039bx| 1\n2|2\u03c0(A\u22121\n\n2\n\n|2\u03c0\u03b3I| 1\n\nt \u2297 Cx)| 1\n\n2\n\nwhere the mean and covariance of conditional posterior over kx given (b, kt) are given by\n\n\u00b5x = 1\n\n\u03b3 \u039bxM(cid:62)\n\nt (y \u2212 b1),\n\nand \u039bx = (At \u2297 C\u22121\n\nx + 1\n\n\u03b3 M(cid:62)\n\nt Mt)\u22121.\n\n(12)\n\nAs in Step 1, with a uniform hyperprior on \u03b8x, the conditional evidence is the target distri-\nbution in the MH algorithm.\n\n\u2022 We sample kx from the conditional posterior given in eq. 12.\n\nA summary of this algorithm is given in Algorithm 1.\n\n2We omit the sample index, the superscript i and (i-1), for notational cleanness.\n\n4\n\n\fAlgorithm 1 fully Bayesian low-rank RF inference using blocked-Gibbs sampling\nGiven data D, conditioned on samples for other variables, iterate the following:\nb , \u03b8t, \u03b3) from the conditional evidence for (\u03b8t, \u03c32\n\n1. Sample for (b, kt, \u03c32\n\nconditional posterior over (b, kt) (in eq. 10).\n\nb , \u03b3) (in eq. 8) and the\n\n2. Sample for (kx, \u03b8x) from the conditional evidence for \u03b8x (in eq. 11) and the conditional\n\nposterior over kx (in eq. 12).\n\nUntil convergence.\n\n4 Approximate algorithm for fast posterior inference\n\nHere we develop an alternative, approximate algorithm for fast posterior inference. Instead of in-\ntegrating over hyperparameters, we attempt to \ufb01nd point estimates that maximize the conditional\nmarginal likelihood. This resembles empirical Bayesian inference, where the hyperparameters are\nset by maximizing the full marginal likelihood. In our model, the evidence has no closed form; how-\never, the conditional evidence for (\u03b8t, \u03c32\nb , \u03b3) given (kx, \u03b8x) and the conditional evidence for \u03b8x given\nb , \u03b3) are given in closed form (in eq. 8 and eq. 11). Thus, we alternate (1) maximizing the\n(b, kt, \u03b8t, \u03c32\nb , \u03b3) and \ufb01nding the MAP estimates of (b, kt), and (2) maximizing\nconditional evidence to set (\u03b8t, \u03c32\nthe conditional evidence to set \u03b8x and \ufb01nding the MAP estimates of kx, that is,\n\n\u02c6\u03b8t, \u02c6\u03b3, \u02c6\u03c32\n\nb = arg max\nb ,\u03b3\n\u02c6b, \u02c6kt = arg max\n\n\u03b8t,\u03c32\n\nb,kt\n\n\u02c6\u03b8x = arg max\n\n\u03b8x\n\n\u02c6kx = arg max\n\nkx\n\nb , \u03b3, \u02c6kx, \u02c6\u03b8x),\n\nb , \u02c6kx, \u02c6\u03b8x,D),\n\np(D|\u03b8t, \u03c32\np(b, kt|\u02c6\u03b8t, \u02c6\u03b3, \u02c6\u03c32\np(D|\u03b8x, \u02c6b, \u02c6kt, \u02c6\u03b8t, \u02c6\u03b3, \u02c6\u03c32\nb ),\nb ,D).\np(kx|\u02c6\u03b8x, \u02c6b, \u02c6kt, \u02c6\u03b8t, \u02c6\u03b3, \u02c6\u03c32\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nThe approximate algorithm works well if the conditional evidence is tightly concentrated around its\nmaximum. Note that if the hyperparameters are \ufb01xed, the iterative updates of (b, kt) and kx given\nabove amount to alternating coordinate ascent of the posterior over (b, K).\n\n5 Extension to Poisson likelihood\n\nWhen the likelihood is non-Gaussian, blocked-Gibbs sampling is not tractable, because we do not\nhave a closed form expression for conditional evidence. Here, we introduce a fast, approximate\ninference algorithm for the low-rank RF model under the LNP likelihood. The basic steps are the\nsame as those in the approximate algorithm (Sec.4). However, we make a Gaussian approximation to\nthe conditional posterior over (b, kt) given kx via the Laplace approximation. We then approximate\nthe conditional evidence for (\u03b8t, \u03c32\nb ) given kx at the posterior mode of (b, kt) given kx. The details\nare as follows.\nThe conditional evidence for \u03b8t given kx is\n\np(D|\u03b8t, \u03c32\n\nb , kx, \u03b8x) \u221d\n\nPoiss(y|g(M(cid:48)\n\nxwt))N (wt|0, Cwt)dwt\n\n(17)\n\n(cid:90)\n\nThe integrand is proportional to the conditional posterior over wt given kx, which we approximate\nto a Gaussian distribution via Laplace approximation\n\np(wt|\u03b8t, \u03c32\n\nb , kx,D) \u2248 N ( \u02c6wt, \u03a3t),\n\n(18)\nwhere \u02c6wt is the conditional MAP estimate of wt obtained by numerically maximizing the log-\nconditional posterior for wt (e.g., using Newton\u2019s method. See Appendix A),\n2 w(cid:62)\n\n(19)\nand \u03a3t is the covariance of the conditional posterior obtained by the second derivative of the log-\nconditional posterior around its mode \u03a3\u22121\n, where the Hessian of the negative log-\nlikelihood is denoted by Ht = \u2212 \u22022\n\nb , kx,D) = y(cid:62) log(g(M(cid:48)\n\nt = Ht + C\u22121\n\nxwt)) \u2212 g(M(cid:48)\n\nlog p(wt|\u03b8t, \u03c32\n\nlog p(D|wt, M(cid:48)\nx).\n\nxwt) \u2212 1\n\nt C\u22121\n\nwt\n\nwt + c,\n\nwt\n\n\u2202w2\nt\n\n5\n\n\fFigure 1: Simulated data. Data generated from the linear Gaussian response model with a rank-2 RF\n(16 by 64 pixels: 1024 parameters for full-rank model; 160 for rank-2 model). A. True rank-2 RF\n(left). Estimates obtained by ML, full-rank ALD, low-rank approximate method, and blocked-Gibbs\nsampling, using 250 samples (top), and using 2000 samples (bottom), respectively. B. Average mean\nsquared error of the RF estimate by each method (average over 10 independent repetitions).\n\nUnder the Gaussian posterior (eq. 18), the log conditional evidence (log of eq. 17) at the posterior\nmode wt = \u02c6wt is simply\nlog p(D|\u03b8t, \u03c32\n\nb , kx) \u2248 log p(D| \u02c6wt, M(cid:48)\n\n2 log |Cwt\u03a3\u22121\n\n\u02c6wt \u2212 1\n\n|,\n\nx) \u2212 1\n\n2 \u02c6w(cid:62)\n\nt C\u22121\n\nwt\n\nt\n\nwhich we maximize to set \u03b8t and \u03c32\nposterior for kx and the conditional evidence for \u03b8x given (b, kt). (See Appendix B).\n\nb . Due to space limit, we omit the derivations for the conditional\n\n6 Results\n\n6.1 Simulations\n\nWe \ufb01rst tested the performance of the blocked-Gibbs sampling and the fast approximate algorithm\non a simulated Gaussian neuron with a rank-2 RF of 16 temporal bins and 64 spatial pixels shown in\nFig. 1A. We compared these methods with the maximum likelihood estimate and the full-rank ALD\nestimate. Fig. 1 shows that the low-rank RF estimates obtained by the blocked-Gibbs sampling\nand the approximate algorithm perform similarly, and achieve lower mean squared error than the\nfull-rank RF estimates.\n\nFigure 2: Simulated data. Data generated from the linear-nonlinear Poisson (LNP) response model\nwith a rank-2 RF (shown in Fig. 1A) and \u201csoftrect\u201d nonlinearity. A. Estimates obtained by ML, full-\nrank ALD, low-rank approximate method under the linear Gaussian model, and the methods under\nthe LNP model, using 250 (top) and 2000 (bottom) samples, respectively. B. Average mean squared\nerror of the RF estimate (from 10 independent repetitions). The low-rank RF estimates under the\nLNP model perform better than those under the linear Gaussian model.\n\nWe then tested the performance of the above methods on a simulated linear-nonlinear Poisson (LNP)\nneuron with the same RF and the softrect nonlinearity. We estimated the RF using each method\nunder the linear Gaussian model as well as under the LNP model. Fig. 2 shows that the low-rank RF\n\n6\n\nMSE# training data164116timespace    250samplesAB   2000samplesfull-ranktrue kMLlow-rank fast250500100020000.0030.010.112MLlow-rank Gibbsfull-ranklow-rank (Gibbs)low-rank (fast) MSE# training data    250samplesAB   2000samplesfull-rankMLlow-rank MLfull-rank low-rank 2505001000200000.511.52ML full-rank low-rank full-rankMLlow-rank linear GaussianLinear Nonlinear PoissonGaussianLNP\fFigure 3: Comparison of low-rank RF estimates for V1 simple cells (using white noise \ufb02ickering\nbars stimuli [16]). A: Relative likelihood per test stimulus (left) and low-rank RF estimates for\nthree different ranks (right). Relative likelihood is the ratio of the test likelihood of rank-1 STA to\nthat of other estimates. Using 1 minutes of training data, the rank-2 RF estimates obtained by the\nblocked-Gibbs sampling and the approximate method achieve the highest test likelihood (estimates\nare shown in the top row), while rank-1 STA achieves the highest test likelihood, since more noise is\nadded to the low-rank STA as the rank increases (estimates are shown in the bottom row). Relative\nlikelihood under full rank ALD is 2.25. B: Similar plot for another V1 simple cell. The rank-4\nestimates obtained by the blocked-Gibbs sampling and the approximate method achieve the highest\ntest likelihood for this cell. Relative likelihood under full rank ALD is 2.17.\n\nestimates perform better than full-rank estimates regardless of the model, and that the low-rank RF\nestimates under the LNP model achieved the lowest MSE.\n\n6.2 Application to neural data\n\nWe applied our methods to estimate the RFs of V1 simple cells and retinal ganglion cells (RGCs).\nThe details of data collection are described in [16, 9]. We performed 10-fold cross-validation using\n1 minute of training and 2 minutes of test data. In Fig. 3 and Fig. 4, we show the average test\nlikelihood as a function of RF rank under the linear Gaussian model. We also show the low-rank\nRF estimates obtained by our methods as well as the low-rank STA. The low-rank STA (rank-p) is\ni , where di is the i-th singular value, ui and vi are the i-th left\nand right singular vectors, respectively. If the stimulus distribution is non-Gaussian, the low-rank\nSTA will have larger bias than the low-rank ALD estimate.\n\ncomputed as \u02c6KST A,p =(cid:80)p\n\ni diuiv(cid:62)\n\nFigure 4: Comparison of low-rank\nRF estimates for retinal data (using\nbinary white noise stimuli [9]). The\nRF consists of 10 by 10 spatial pixels\nand 25 temporal bins (2500 RF coef-\n\ufb01cients). A: Relative likelihood per\ntest stimulus (left), top three left sin-\ngular vectors (middle) and right sin-\ngular vectors (right) of estimated RF\nfor an off-RGC cell. The sampling-\nbased RF estimate bene\ufb01ts from a\nrank-3 representation, making use\nof three distinct spatial and tempo-\nral components, whereas the perfor-\nmance of the low-rank STA degrades\nabove rank 1. Relative likelihood\nunder full rank ALD is 1.0146. B:\nSimilar plot for on-RGC cell. Rel-\native likelihood under full rank ALD\nis 1.006. Both estimates perform best\nwith rank 1.\n\n7\n\n116124 low-rank (fast) timespaceABrank-1rank-2rank-4V1 simple cell #2V1 simple cell #1low-rank STA low-rank (fast) rank-1rank-2rank-4112space124timeper stimulusrelative likelihoodper stimulusrelative likelihoodlow-rank  (Gibbs)low-rank  (Gibbs)2.250.67124rank3low-rank STA2.500.67124rank30.91BRGC on-cell1240.91ARGC o(cid:31)-cell low-rank per stimulusranklow-rank STA1st2nd3rd 1st2nd3rdtemporal extent0253spatial extent low-rank (fast) low-rank STAspatial extent1101100251st2ndtemporal extent3rd110110 (fast)relative likelihoodper stimulusrelative likelihoodlow-rank (Gibbs)low-rank (Gibbs)124rank3\fFigure 5: RF estimates for a V1 simple cell. (Data from [16]). A: RF estimates obtained by ML\n(left) and low-rank blocked-Gibbs sampling under the linear Gaussian model (middle), and low-rank\napproximate algorithm under the LNP model (right), for two different amounts of training data (30\nsec. and 2 min.). The RF consists of 16 temporal and 16 spatial dimensions (256 RF coef\ufb01cients).\nB: Average prediction (on spike count) error across 10-subset of available data. The low-rank RF\nestimates under the LNP model achieved the lowest prediction error among all other methods. C:\nRuntime of each method. The low-rank approximate algorithms took less than 10 sec., while the\nfull-rank inference methods took 10 to 100 times longer.\n\nFinally, we applied our methods to estimate the RF of a V1 simple cell with four different amounts\nof training data (0.25, 0.5 1, and 2 minutes) and computed the prediction error of each estimate\nunder the linear Gaussian and the LNP models. In Fig. 5, we show the estimates using 30 sec. and 2\nmin. of training data. We computed the test likelihood of each estimate to set the RF rank and found\nthat the rank-2 RF estimates achieved the highest test likelihood. In terms of the average prediction\nerror, the low-rank RF estimates obtained by our fast approximate algorithm achieved the lowest\nerror, while the runtime of the algorithm was signi\ufb01cantly lower than full-rank inference methods.\n\n7 Conclusion\n\nWe have described a new hierarchical model for low-rank RFs. We introduced a novel prior for\nlow-rank matrices based on a restricted matrix normal distribution, which has the feature of pre-\nserving a marginally Gaussian prior over the regression coef\ufb01cients. We used a \u201clocalized\u201d form to\nde\ufb01ne row and column covariance matrices in the matrix normal prior, which allows the model to\n\ufb02exibly learn smooth and sparse structure in RF spatial and temporal components. We developed\ntwo inference methods: an exact one based on MCMC with blocked-Gibbs sampling and an approx-\nimate one based on alternating evidence optimization. We applied the model to neural data using\nboth Gaussian and Poisson noise models, and found that the Poisson (or LNP) model performed\nbest despite the increased reliance on approximate inference. Overall, we found that low-rank esti-\nmates achieved higher prediction accuracy with signi\ufb01cantly lower computation time compared to\nfull-rank estimates.\nWe believe our localized, low-rank RF model will be especially useful in high-dimensional settings,\nparticularly in cases where the stimulus covariance matrix does not \ufb01t in memory. In future work, we\nwill develop fully Bayesian inference methods for low-rank RFs under the LNP noise model, which\nwill allow us to quantify the accuracy of our approximate method. Secondly, we will examine\nmethods for inferring the RF rank, so that the number of space-time separable components can be\ndetermined automatically from the data.\n\nAcknowledgments\n\nWe thank N. C. Rust and J. A. Movshon for V1 data, and E. J. Chichilnisky, J. Shlens, A. .M. Litke,\nand A. Sher for retinal data. This work was supported by a Sloan Research Fellowship, McKnight\nScholar\u2019s Award, and NSF CAREER Award IIS-1150186.\n\n8\n\n0.50.2512100101102103runtime (sec)# minutes of training dataprediction error# minutes of training data116116    rank-2      (LNP)    rank -2 (Gaussian)        ML(Gaussian)timespace30 sec.2 min.BAML full-rank rank-2(fast) full-rank rank-20.250.5120.180.20.220.24rank-2(Gibbs)CGaussianLNP\fReferences\n[1] F. Theunissen, S. David, N. Singh, A. Hsu, W. Vinje, and J. Gallant. Estimating spatio-temporal receptive\n\ufb01elds of auditory and visual neurons from their responses to natural stimuli. Network: Computation in\nNeural Systems, 12:289\u2013316, 2001.\n\n[2] D. Smyth, B. Willmore, G. Baker, I. Thompson, and D. Tolhurst. The receptive-\ufb01eld organization of\nsimple cells in primary visual cortex of ferrets under natural scene stimulation. Journal of Neuroscience,\n23:4746\u20134759, 2003.\n\n[3] M. Sahani and J. Linden. Evidence optimization techniques for estimating stimulus-response functions.\n\nNIPS, 15, 2003.\n\n[4] S.V. David and J.L. Gallant. Predicting neuronal responses during natural vision. Network: Computation\n\nin Neural Systems, 16(2):239\u2013260, 2005.\n\n[5] M. Park and J. W. Pillow. Receptive \ufb01eld inference with localized priors. PLoS Comput Biol,\n\n7(10):e1002219, 2011.\n\n[6] Jennifer F. Linden, Robert C. Liu, Maneesh Sahani, Christoph E. Schreiner, and Michael M. Merzenich.\nSpectrotemporal structure of receptive \ufb01elds in areas ai and aaf of mouse auditory cortex. Journal of\nNeurophysiology, 90(4):2660\u20132675, 2003.\n\n[7] Anqi Qiu, Christoph E. Schreiner, and Monty A. Escab. Gabor analysis of auditory midbrain receptive\n\n\ufb01elds: Spectro-temporal and binaural composition. Journal of Neurophysiology, 90(1):456\u2013476, 2003.\n\n[8] J. W. Pillow and E. P. Simoncelli. Dimensionality reduction in neural models: An information-theoretic\ngeneralization of spike-triggered average and covariance analysis. Journal of Vision, 6(4):414\u2013428, 4\n2006.\n\n[9] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, and E. P. Chichilnisky, E. J. Simoncelli. Spatio-\ntemporal correlations and visual signaling in a complete neuronal population. Nature, 454:995\u2013999,\n2008.\n\n[10] A.J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis,\n\n5(2):248\u2013264, 1975.\n\n[11] Gregory C Reinsel and Rajabather Palani Velu. Multivariate reduced-rank regression: theory and appli-\n\ncations. Springer New York, 1998.\n\n[12] John Geweke. Bayesian reduced rank regression in econometrics. Journal of Econometrics, 75(1):121 \u2013\n\n146, 1996.\n\n[13] A.P. Dawid. Some matrix-variate distribution theory: notational considerations and a bayesian applica-\n\ntion. Biometrika, 68(1):265, 1981.\n\n[14] L. Paninski. Maximum likelihood estimation of cascade point-process neural encoding models. Network:\n\nComputation in Neural Systems, 15:243\u2013262, 2004.\n\n[15] M. Park and J. W. Pillow. Bayesian active learning with localized priors for fast receptive \ufb01eld character-\n\nization. In NIPS, pages 2357\u20132365, 2012.\n\n[16] N. C. Rust, Schwartz O., J. A. Movshon, and Simoncelli E.P. Spatiotemporal elements of macaque v1\n\nreceptive \ufb01elds. Neuron, 46(6):945\u2013956, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1259, "authors": [{"given_name": "Mijung", "family_name": "Park", "institution": "University of Texas"}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": "UT Austin"}]}