{"title": "A Tractable Approximation to Optimal Point Process Filtering: Application to Neural Encoding", "book": "Advances in Neural Information Processing Systems", "page_first": 1603, "page_last": 1611, "abstract": "The process of dynamic state estimation (filtering) based on point process observations is in general intractable. Numerical sampling techniques are often practically useful, but lead to limited conceptual insight about optimal encoding/decoding strategies, which are of significant relevance to Computational Neuroscience. We develop an analytically tractable Bayesian approximation to optimal filtering based on point process observations, which allows us to introduce distributional assumptions about sensory cell properties, that greatly facilitates the analysis of optimal encoding in situations deviating from common assumptions of uniform coding. The analytic framework leads to insights which are difficult to obtain from numerical algorithms, and is consistent with experiments about the distribution of tuning curve centers. Interestingly, we find that the information gained from the absence of spikes may be crucial to performance.", "full_text": "A Tractable Approximation to Optimal Point Process\n\nFiltering: Application to Neural Encoding\n\nYuval Harel, Ron Meir\n\nDepartment of Electrical Engineering\n\nTechnion \u2013 Israel Institute of Technology\n\nTechnion City, Haifa, Israel\n\n{yharel@tx,rmeir@ee}.technion.ac.il\n\nManfred Opper\n\nDepartment of Arti\ufb01cial Intelligence\n\nTechnical University Berlin\n\nBerlin 10587, Germany\n\nopperm@cs.tu-berlin.de\n\nAbstract\n\nThe process of dynamic state estimation (\ufb01ltering) based on point process ob-\nservations is in general intractable. Numerical sampling techniques are often\npractically useful, but lead to limited conceptual insight about optimal encod-\ning/decoding strategies, which are of signi\ufb01cant relevance to Computational Neu-\nroscience. We develop an analytically tractable Bayesian approximation to opti-\nmal \ufb01ltering based on point process observations, which allows us to introduce\ndistributional assumptions about sensory cell properties, that greatly facilitate the\nanalysis of optimal encoding in situations deviating from common assumptions of\nuniform coding. The analytic framework leads to insights which are dif\ufb01cult to\nobtain from numerical algorithms, and is consistent with experiments about the\ndistribution of tuning curve centers. Interestingly, we \ufb01nd that the information\ngained from the absence of spikes may be crucial to performance.\n\n1\n\nIntroduction\n\nThe task of inferring a hidden dynamic state based on partial noisy observations plays an important\nrole within both applied and natural domains. A widely studied problem is that of online inference\nof the hidden state at a given time based on observations up to to that time, referred to as \ufb01ltering\n[1]. For the linear setting with Gaussian noise and quadratic cost, the solution is well known since\nthe early 1960s both for discrete and continuous times, leading to the celebrated Kalman and the\nKalman-Bucy \ufb01lters [2, 3], respectively. In these cases the exact posterior distribution is Gaussian,\nresulting in closed form recursive update equations for the mean and variance of this distribution,\nimplying \ufb01nite-dimensional \ufb01lters. However, beyond some very speci\ufb01c settings [4], the optimal\n\ufb01lter is in\ufb01nite-dimensional and impossible to compute in closed form, requiring either approximate\nanalytic techniques (e.g., the extended Kalman \ufb01lter (e.g., [1]), the unscented \ufb01lter [5]) or numerical\nprocedures (e.g., particle \ufb01lters [6]). The latter usually require time discretization and a \ufb01nite number\nof particles, resulting in loss of precision . For many practical tasks (e.g., queuing [7] and optical\ncommunication [8]) and biologically motivated problems (e.g., [9]) a natural observation process is\ngiven by a point process observer, leading to a nonlinear in\ufb01nite-dimensional optimal \ufb01lter (except\nin speci\ufb01c settings, e.g., \ufb01nite state spaces, [7, 10]).\nWe consider a continuous-state and continuous-time multivariate hidden Markov process observed\nthrough a set of sensory neuron-like elements characterized by multi-dimensional unimodal tuning\nfunctions, representing the elements\u2019 average \ufb01ring rate. The tuning function parameters are char-\nacterized by a distribution allowing much \ufb02exibility. The actual \ufb01ring of each cell is random and is\ngiven by a Poisson process with rate determined by the input and by the cell\u2019s tuning function. In-\nferring the hidden state under such circumstances has been widely studied within the Computational\nNeuroscience literature, mostly for static stimuli. In the more challenging and practically important\ndynamic setting, much work has been devoted to the development of numerical sampling techniques\n\n1\n\n\ffor fast and effective approximation of the posterior distribution (e.g., [11]). In this work we are less\nconcerned with algorithmic issues, and more with establishing closed-form analytic expressions for\nan approximately optimal \ufb01lter (see [10, 12, 13] for previous work in related, but more restrictive\nsettings), and using these to characterize the nature of near-optimal encoders, namely determining\nthe structure of the tuning functions for optimal state inference. A signi\ufb01cant advantage of the closed\nform expressions over purely numerical techniques is the insight and intuition that is gained from\nthem about qualitative aspects of the system. Moreover, the leverage gained by the analytic compu-\ntation contributes to reducing the variance inherent to Monte Carlo approaches. Technically, given\nthe intractable in\ufb01nite-dimensional nature of the posterior distribution, we use a projection method\nreplacing the full posterior at each point in time by a projection onto a simple family of distributions\n(Gaussian in our case). This approach, originally developed in the Filtering literature [14, 15], and\ntermed Assumed Density Filtering (ADF), has been successfully used more recently in Machine\nLearning [16, 17]. As far as we are aware, this is the \ufb01rst application of this methodology to point\nprocess \ufb01ltering.\nThe main contributions of the paper are the following: (i) Derivation of closed form recursive expres-\nsions for the continuous time posterior mean and variance within the ADF approximation, allowing\nfor the incorporation of distributional assumptions over sensory variables. (ii) Characterization of\nthe optimal tuning curves (encoders) for sensory cells in a more general setting than hitherto con-\nsidered. Speci\ufb01cally, we study the optimal shift of tuning curve centers, providing an explanation\nfor observed experimental phenomena [18]. (iii) Demonstration that absence of spikes is informa-\ntive, and that, depending on the relationship between the tuning curve distribution and the dynamic\nprocess (the \u2018prior\u2019), may signi\ufb01cantly improve the inference. This issue has not been emphasized\nin previous studies focusing on homogeneous populations.\nWe note that most previous work in the \ufb01eld of neural encoding/decoding has dealt with static\nobservations and was based on the Fisher information, which often leads to misleading qualitative\nresults (e.g., [19, 20]). Our results address the full dynamic setting in continuous time, and provide\nresults for the posterior variance, which is shown to yield an excellent approximation of the posterior\nMean Square Error (MSE). Previous work addressing non-uniform distributions over tuning curve\nparameters [21] used static univariate observations and was based on Fisher information rather than\nthe MSE itself.\n\n2 Problem formulation\n\n2.1 Dense Gaussian neural code\nWe consider a dynamical system with state Xt \u2208 Rn, observed through an observation process N\ndescribing the \ufb01ring patterns of sensory neurons in response to the process X. The observed process\nis a diffusion process obeying the Stochastic Differential Equation (SDE)\n(t \u2265 0)\n\ndXt = A (Xt) dt + D (Xt) dWt,\n\nwhere A (\u00b7) , D (\u00b7) are arbitrary functions and Wt is standard Brownian motion. The initial condition\nX0 is assumed to have a continuous distribution with a known density. The observation process N\nis a marked point process [8] de\ufb01ned on [0,\u221e) \u00d7 Rm, meaning that each point, representing the\n\ufb01ring of a neuron, is identi\ufb01ed by its time t \u2208 [0,\u221e), and a mark \u03b8 \u2208 Rm. In this work the mark is\ninterpreted as a parameter of the \ufb01ring neuron, which we refer to as the neuron\u2019s preferred stimulus.\nSpeci\ufb01cally, a neuron with parameter \u03b8 is taken to have \ufb01ring rate\n\n(cid:18)\n\n(cid:19)\n\n,\n\n\u03bb (x; \u03b8) = h exp\n\n\u2212 1\n2\n\n(cid:107)Hx \u2212 \u03b8(cid:107)2\n\n\u22121\ntc\n\n\u03a3\n\nin response to state x, where H \u2208 Rm\u00d7n and \u03a3tc \u2208 Rm\u00d7m , m \u2264 n, are \ufb01xed matrices, and the\nnotation (cid:107)y(cid:107)2\nM denotes yT M y. The choice of Gaussian form for \u03bb facilitates analytic tractability.\nThe inclusion of the matrix H allows using high-dimensional models where only some dimensions\nare observed, for example when the full state includes velocities but only locations are directly\nobservable. We also de\ufb01ne Nt (cid:44) N ([0, t) \u00d7 Rm), i.e., Nt is the total number of points up to time\nt, regardless of their location \u03b8, and denote by Nt the sequence of points up to time t \u2014 formally,\n\n2\n\n\fthe process N restricted to [0, t) \u00d7 Rm. Following [8], we use the notation\n\n\u02c6\n\n\u02c6\n\nb\n\nf (t, \u03b8) N (dt \u00d7 d\u03b8) (cid:44)(cid:88)\n\na\n\nU\n\ni\n\n1{ti \u2208 [a, b] , \u03b8i \u2208 U} f (ti, \u03b8i) ,\n\n(1)\n\n\u22121\ntc\n\n\u03a3\n\n(cid:17)\n\n\u22121\ntc\n\n\u03a3\n\ni 1{\u03b8i\u2208A} exp\n\ngiven Xt = x, is then \u03bbA (x; \u03b8) = h(cid:80)\n\nfor U \u2286 Rm and any function f, where (ti, \u03b8i) are respectively the time and mark of the i-th point\nof the process N.\nConsider a network with M sensory neurons, having random preferred stimuli \u03b8 = {\u03b8i} M\ni=1 that are\ndrawn independently from a common distribution with probability density f (\u03b8), which we refer to\nas the population density. Positing a distribution for the preferred stimuli allows us to obtain simple\n(cid:17)\nclosed form solutions, and to optimize over distribution parameters rather than over the higher-\ndimensional space of all the \u03b8i. The total rate of spikes with preferred stimuli in a set A \u2282 Rm,\n(cid:16)\u2212 1\n. Averaging over f (\u03b8),\n2 (cid:107)Hx \u2212 \u03b8(cid:107)2\n(cid:19)\n\nwe have the expected rate \u03bbA (x) (cid:44) E\u03bbA (x; \u03b8) = hM\nd\u03b8. We\nnow obtain an in\ufb01nite neural network by considering the limit M \u2192 \u221e while holding \u03bb0 = hM\n\ufb01xed. In the limit we have \u03bbA (x; \u03b8) \u2192 \u03bbA (x), so that the process N has density\n\n(cid:18)\nmeaning that the expected number of points in a small rectangle [t, t + dt] \u00d7(cid:81)\nditioned on the history X[0,t],Nt, is \u03bbt (\u03b8, Xt) dt(cid:81)\n\n(cid:16)\u2212 1\n2 (cid:107)Hx \u2212 \u03b8i(cid:107)2\n\u00b4\nA f (\u03b8) exp\n\ni [\u03b8i, \u03b8i + d\u03b8i], con-\ni d\u03b8i + o (dt,|d\u03b8|). A \ufb01nite network can be\n\nobtained as a special case by taking f to be a sum of delta functions.\nFor analytic tractability, we assume that f (\u03b8) is Gaussian with center c and covariance \u03a3pop, namely\nf (\u03b8) = N (\u03b8; c, \u03a3pop). We refer to c as the population center. Previous work [22, 20, 23] considered\nthe case where neurons\u2019 preferred stimuli uniformly cover the space, obtained by removing the factor\nf (\u03b8) from (2). Then, the total \ufb01ring rate\n\u03bbt (\u03b8, x) d\u03b8 is independent of x, which simpli\ufb01es the\nanalysis, and leads to a Gaussian posterior (see [22]). We refer to the assumption that\n\u03bbt (\u03b8, x) d\u03b8\nis independent of x as uniform coding. The uniform coding case may be obtained from our model\nby taking the limit \u03a3\u22121\n\npop \u2192 0 with \u03bb0/(cid:112)det \u03a3pop held constant.\n\n\u00b4\n\n\u00b4\n\n\u03bbt (\u03b8, Xt) = \u03bb0f (\u03b8) exp\n\n(cid:107)HXt \u2212 \u03b8(cid:107)2\n\n\u2212 1\n2\n\n\u22121\ntc\n\n\u03a3\n\n,\n\n(2)\n\n2.2 Optimal encoding and decoding\n\nWe consider the question of optimal encoding and decoding under the above model. The process of\nneural decoding is assumed to compute (exactly or approximately) the full posterior distribution of\nXt given Nt. The problem of neural encoding is then to choose the parameters \u03c6 = (c, \u03a3pop, \u03a3tc),\nwhich govern the statistics of the observation process N, given a speci\ufb01c decoding scheme.\nTo quantify the performance of the encoding-decoding system, we summarize the result of de-\ncoding using a single estimator \u02c6Xt = \u02c6Xt (Nt), and de\ufb01ne the Mean Square Error (MSE) as\n\u0001t (cid:44) trace[(Xt \u2212 \u02c6Xt)(Xt \u2212 \u02c6Xt)T ]. We seek \u02c6Xt and \u03c6 that solve min\u03c6 limt\u2192\u221e min \u02c6Xt\nE [\u0001t] =\nE[\u0001t|Nt]]. The inner minimization problem in this equation is solved by the\nmin\u03c6 limt\u2192\u221e E[min \u02c6Xt\nMSE-optimal decoder, which is the posterior mean \u02c6Xt = \u00b5t (cid:44) E [Xt|Nt]. The posterior mean\nmay be computed from the full posterior obtained by decoding. The outer minimization problem is\nsolved by the optimal encoder. In principle, the encoding/decoding problem can be solved for any\nvalue of t. In order to assess performance it is convenient to consider the steady-state limit t \u2192 \u221e\nfor the encoding problem.\nBelow, we \ufb01nd a closed form approximate solution to the decoding problem for any t using ADF. We\nthen explore the problem of choosing the steady-state optimal encoding parameters \u03c6 using Monte\nCarlo simulations. Note that if decoding is exact, the problem of optimal encoding becomes that of\nminimizing the expected posterior variance.\n\n3\n\n\f3 Neural decoding\n\n\u02c6\n\n(cid:16)\n\n\u03bbt (\u03b8, x) \u2212 \u02c6\u03bbt (\u03b8)\n\n\u02c6\u03bbt (\u03b8)\n\n(cid:17)\n\n3.1 Exact \ufb01ltering equations\nLet P (\u00b7, t) denote the posterior density of Xt given Nt, and Et\nNt. The prior density P (\u00b7, 0) is assumed to be known.\nThe problem of \ufb01ltering a diffusion process X from a doubly stochastic Poisson process driven by\nX is formally solved in [24]. The result is extended to marked point processes in [22], where the\nauthors derive a stochastic PDE for the posterior density1,\n\nP [\u00b7] the posterior expectation given\n\ndP (x, t) = L\u2217P (x, t) dt + P (x, t)\n\nN (dt \u00d7 d\u03b8) \u2212 \u02c6\u03bbt (\u03b8) d\u03b8 dt\n\noperator),\n\nde\ufb01ned\n\ngenerator\n\n(Kolmogorov\u2019s\n\nP [\u03bbt (\u03b8, Xt)] =\n\nas Lf (x)\n\nbackward\n\u00b4\n\nP (x, t) \u03bbt (\u03b8, x) dx.\n\n, (3)\n\u03b8\u2208Rm\nto N is interpreted as in (1), L is the state\u2019s in-\n=\n(Kolmogorov\u2019s\n\nwhere the integral with respect\n\ufb01nitesimal\nlim\u2206t\u21920+ (E [f (Xt+\u2206t)|Xt = x] \u2212 f (x)) /\u2206t, L\u2217 is L\u2019s adjoint operator\nforward operator), and \u02c6\u03bbt (\u03b8) (cid:44) Et\nThe stochastic PDE (3) is usually intractable. In [22, 23] the authors consider linear dynamics with\nuniform coding and Gaussian prior. In this case, the posterior is Gaussian, and (3) leads to closed\nform ODEs for its moments. When the uniform coding assumption is violated, the posterior is no\nlonger Gaussian. Still, we can obtain exact equations for the posterior moments, as follows.\nt ]. Using (3), and the known results for L for\nLet \u00b5t = Et\ndiffusion processes (see supplementary material), the \ufb01rst two posterior moments can be shown to\nobey the following equations between spikes (see [23] for the \ufb01nite population case):\n\nP Xt, \u02dcXt = Xt \u2212 \u00b5t, \u03a3t = Et\n(cid:20)\n(cid:21)\n\u02c6 (cid:16)\u02c6\u03bbt (\u03b8) \u2212 \u03bbt (\u03b8, Xt)\n(cid:17)\n(cid:105)\n(cid:104) \u02dcXtA (Xt)T(cid:105)\nD (Xt) D (Xt)T(cid:105)\n(cid:104)\n(cid:21)\n\u02c6 (cid:16)\u02c6\u03bbt (\u03b8) \u2212 \u03bbt (\u03b8, Xt)\n(cid:17)\n\n= Et\n\nP [A (Xt)] + Et\nP\n\nd\u00b5t\ndt\nd\u03a3t\ndt\n\n= Et\nP\n\nA (Xt) \u02dcX T\nt\n\n+ Et\nP\n\n+ Et\nP\n\nd\u03b8\n\n.\n\n(cid:104)\n\n(cid:20)\n\nP [ \u02dcXt \u02dcX T\n\nXt\n\nd\u03b8\n\n+Et\nP\n\n\u02dcXt \u02dcX T\nt\n\n(4)\n\n3.2 ADF approximation\n\nP [\u00b7]. We\nWhile equations (4) are exact, they are not practical, since they require computation of Et\nnow proceed to \ufb01nd an approximate closed form for (4). Here we present the main ideas of the\nderivation. The formulation presented here assumes, for simplicity, an open-loop setting where the\nsystem is passively observed. It can be readily extended to a closed-loop control-based setting, and\nis presented in this more general framework in the supplementary material, including full details.\nTo bring (4) to a closed form, we use ADF with an assumed Gaussian density (see [16] for de-\ntails). Conceptually, this may be envisioned as integrating (4) while replacing the distribution P\nby its approximating Gaussian \u201cat each time step\u201d. Assuming the moments are known exactly, the\nGaussian is obtained by matching the \ufb01rst two moments of P [16]. Note that the solution of the\nresulting equations does not in general match the \ufb01rst two moments of the exact solution, though it\nmay approximate it.\nAbusing notation, in the sequel we use \u00b5t, \u03a3t to refer to the ADF approximation rather than to\nthe exact values. Substituting the normal distribution N (x; \u00b5t, \u03a3t) for P (x, t) to compute the ex-\npectations involving \u03bbt in (4), and using (2) and the Gaussian form of f (\u03b8), results in computable\nGaussian integrals. Other terms may also be computed in closed form if the function A, D can be ex-\npanded as power series. This computation yields approximate equations for \u00b5t, \u03a3t between spikes.\nThe updates at spike times can similarly be computed in closed form either from (3) or directly from\na Bayesian update of the posterior (see supplementary material, or e.g., [13]).\n\n\u00b4\n\n1The model considered in [22] assumes linear dynamics and uniform coding \u2013 meaning that the total rate of\n\u03b8 \u03bbt (\u03b8, Xt) d\u03b8, is independent of Xt. However, these assumption are only relevant to establish\n\nNt, namely\nother proposition in that paper. The proof of equation (3) still holds as is in our more general setting.\n\n4\n\n\fFigure 1: Left Changes to the posterior moments between spikes as a function of the current pos-\nterior mean estimate, for a static 1-d state. The parameters are a = d = 0, H = 1, \u03c32\npop =\ntc = 0.2, c = 0, \u03bb0 = 10, \u03c3t = 1. The bottom plot shows the density of preferred stim-\n1, \u03c32\nuli f (\u03b8) and tuning curve for a neuron with preferred stimulus \u03b8 = 0. Right An example of\n\ufb01ltering a linear one-dimensional process. Each dot correspond to a spike with the vertical lo-\ncation indicating the preferred stimulus \u03b8. The curves to the right of the graph show the pre-\nferred stimulus density (black), and a tuning curve centered at \u03b8 = 0 (gray). The tuning curve\nand preferred stimulus density are normalized to the same height for visualization. The bottom\ngraph shows the posterior variance, with the vertical lines showing spike times. Parameters are:\na = \u22120.1, d = 2, H = 1, \u03c32\n0 = 1. Note the decrease\nof the posterior variance following t = 4 even though no spikes are observed.\n\ntc = 0.2, c = 0, \u03bb0 = 10, \u00b50 = 0, \u03c32\n\npop = 2, \u03c32\n\nFor simplicity, we assume that the dynamics are linear, dXt = AXt dt + D dWt, resulting in the\n\ufb01ltering equations\n\nd\u00b5t = A\u00b5tdt + gt\u03a3tH T St (H\u00b5t \u2212 c) dt + \u03a3t\u2212 H T Stc\nt\u2212\n\n(\u03b8 \u2212 H\u00b5t\u2212) N (dt \u00d7 d\u03b8)\n\n(5)\n\n\u02c6\n\n\u03b8\u2208Rm\n\nd\u03a3t =(cid:0)A\u03a3t + \u03a3tAT + DDT(cid:1) dt\n+ gt\u03a3tH T(cid:104)\n(cid:44)(cid:0)\u03a3tc + H\u03a3tH T(cid:1)\u22121\n\u2212 \u03a3t\u2212 H T Stc\n\u02c6\n\n\u02c6\n\ngt (cid:44)\n\n\u02c6\u03bb (\u03b8) d\u03b8 =\n\nEt\n\nH\u03a3tdt\n\nSt \u2212 St (H\u00b5t \u2212 c) (H\u00b5t \u2212 c)T St\nt\u2212 H\u03a3t\u2212 dNt,\n\n, St (cid:44)(cid:0)\u03a3tc + \u03a3pop + H\u03a3tH T(cid:1)\u22121, and\n(cid:18)\nP [\u03bb (\u03b8, Xt)] d\u03b8 = \u03bb0(cid:112)det (\u03a3tcSt) exp\n\n(cid:105)\n\nwhere Stc\nt\n\n(6)\n\n(cid:19)\n\n(cid:107)H\u00b5t \u2212 c(cid:107)2\n\nSt\n\n\u2212 1\n2\n\nis the posterior expected total \ufb01ring rate. Expressions including t\u2212 are to be interpreted as left limits\nf (t\u2212) = lims\u2192t\u2212 f (s), which are necessary since the solution is discontinuous at spike times.\nThe last term in (5) is to be interpreted as in (1). It contributes an instantaneous jump in \u00b5t at the\ntime of a spike with preferred stimulus \u03b8, moving H\u00b5t closer to \u03b8. Similarly, the last term in (6)\ncontributes an instantaneous jump in \u03a3t at each spike time, which is the same regardless of spike\nlocation. All other terms describe the evolution of the posterior between spikes: the \ufb01rst few terms\nin (5)-(6) are the same as in the dynamics of the prior, as in [13, 23], whereas the terms involving\ngt correspond to information from the absence of spikes. Note that the latter scale with gt, the\nexpected total \ufb01ring rate, i.e., lack of spikes becomes \u201cmore informative\u201d the higher the expected\nrate of spikes.\nIt is illustrative to consider these equations in the scalar case m = n = 1, with H = 1. Letting\n\u03c32\nt = \u03a3t, \u03c32\n\npop = \u03a3pop, a = A, d = D yields\n\ntc = \u03a3tc, \u03c32\n\nd\u00b5t = a\u00b5tdt + gt\n\n(cid:32)\n\n\u03c32\nt\n\u03c32\nt + \u03c32\ntc + \u03c32\n\nd\u03c32\n\nt =\n\n2a\u03c32\n\nt + d2 + gt\n\n(\u00b5t \u2212 c) dt +\n\n(cid:34)\n\n1 \u2212\n\n\u03b8\u2208R\n\n(cid:35)\n\n(cid:33)\n\n\u03c32\nt\u2212\n\u03c32\nt\u2212 + \u03c32\ntc\n(\u00b5t \u2212 c)2\n\u03c32\nt + \u03c32\ntc + \u03c32\n\npop\n\n\u03c32\nt\n\npop\n\u03c32\nt\nt + \u03c32\n\u03c32\ntc + \u03c32\n\npop\n\n(\u03b8 \u2212 \u00b5t\u2212) N (dt \u00d7 d\u03b8)\n\n(7)\n\ndt \u2212\n\n\u03c32\nt\u2212\n\u03c32\nt\u2212 + \u03c32\ntc\n\n\u03c32\nt\u2212 dNt,\n\n(8)\n\n\u02c6\n\n5\n\n\u22126\u22124\u221220246\u00b5populationdensityfiring ratefor \u00b5=0\u22126\u22124\u221220246\u00b9t\u2212101d\u00b9t=dtd\u00bet=dt\u2212505Xt\u00b9t\u00a7\u00betN(t;\u00b5)0246810t048\u00be2t\fFigure 2: Left Illustration of information gained between spikes. A static state Xt = 0.5, shown\nin a dotted line, is observed and \ufb01ltered twice: with the correct value \u03c32\npop = 0.5 (\u201cADF\u201d, solid\npop = \u221e (\u201cUniform coding \ufb01lter\u201d, dashed line). The curves to the right of\nblue line), and with \u03c32\nthe graph show the preferred stimulus density (black), and a tuning curve centered at \u03b8 = 0 (gray).\n0 = 1. Right Comparison of MSE for the ADF \ufb01lter and\nBoth \ufb01lters are initialized with \u00b50 = 0, \u03c32\nthe uniform coding \ufb01lter. The vertical axis shows the integral of the square error integrated over the\ntime interval [5, 10], averaged over 1000 trials. Shaded areas indicate estimated errors, computed as\nthe sample standard deviation divided by the square root of the number of trials. Parameters in both\nplots are a = d = 0, c = 0, \u03c32\n\ntc = 0.1, H = 1, \u03bb0 = 10.\n\npop = 0.5, \u03c32\n\nwhere gt = \u03bb0(cid:112)2\u03c0\u03c32\n\ntcN(cid:0)\u00b5t; c, \u03c32\n\n(cid:1). Figure 1 (left) shows how \u00b5t, \u03c32\n\nt + \u03c32\n\nt change be-\ntween spikes for a static 1-dimensional state (a = d = 0). In this case, all terms in the \ufb01ltering\nequations drop out except those involving gt. The term involving gt in d\u00b5t pushes \u00b5t away from c in\nthe absence of spikes. This effect weakens as |\u00b5t \u2212 c| grows due to the factor gt, consistent with the\nidea that far from c, the lack of spikes is less surprising, hence less informative. The term involving\ngt in d\u03c32\n\nt increases the variance when \u00b5t is near c, otherwise decreases it.\n\ntc + \u03c32\n\npop\n\n3.3\n\nInformation from lack of spikes\n\nAn interesting aspect of the \ufb01ltering equations (5)-(6) is that the dynamics of the posterior density\nbetween spikes differ from the prior dynamics. This is in contrast to previous models which assumed\nuniform coding: the (exact) \ufb01ltering equations appearing in [22] and [23] have the same form as (5)-\n(6) except that they do not include the correction terms involving gt, so that between spikes the\ndynamics are identical to the prior dynamics. This re\ufb02ects the fact that lack of spikes in a time\ninterval is an indication that the total \ufb01ring rate is low; in the uniform coding case, this is not\ninformative, since the total \ufb01ring rate is independent of the state.\nFigure 2 (left) illustrates the information gained from lack of spikes. A static scalar state is observed\nby a process with rate (2), and \ufb01ltered twice: once with the correct value of \u03c3pop, and once with\n\u03c3pop \u2192 \u221e, as in the uniform coding \ufb01lter of [23]. Between spikes, the ADF estimate moves\naway from the population center c = 0, whereas the uniform coding estimate remains \ufb01xed. The\nsize of this effect decreases with time, as the posterior variance estimate (not shown) decreases.\nThe reduction in \ufb01ltering errors gained from the additional terms in (5)-(6) is illustrated in Figure\n2 (right). Despite the approximation involved, the full \ufb01lter signi\ufb01cantly outperforms the uniform\ncoding \ufb01lter. The difference disappears as \u03c3pop increases and the population becomes uniform.\nSpecial cases To gain additional insight into the \ufb01ltering equations, we consider their behavior in\npop \u2192 \u221e, spikes become rare as the density f (\u03b8) approaches 0 for any \u03b8.\nseveral limits. (i) As \u03c32\nThe total expected rate of spikes gt also approaches 0, and the terms corresponding to information\ntc \u2192 \u221e,\nfrom lack of spikes vanish. Other terms in the equations are unaffected. (ii) In the limit \u03c32\neach neuron \ufb01res as a Poisson process with a constant rate independent of the observed state. The\ntotal expected \ufb01ring rate gt saturates at its maximum, \u03bb0. Therefore the preferred stimuli of spiking\nneurons provide no information, nor does the presence or absence of spikes. Accordingly, all terms\nother than those related to the prior dynamics vanish.\n(iii) The uniform coding case [22, 23] is\npop \u2192 \u221e with \u03bb0/\u03c3pop constant. In this limit the terms\nobtained as a special case in the limit \u03c32\ninvolving gt drop out, recovering the (exact) \ufb01ltering equations in [22].\n\n6\n\n02468101214t\u22120.20.00.20.40.60.81.01.2xUniform coding filter vs. ADFTrue stateADFUniform coding filter0246810\u00be2pop100101102MSEAccumulated MSE for ADF and uniform-coding filterADFuniform\f4 Optimal neural encoding\n\nWe model the problem of optimal neural encoding as choosing the parameters c, \u03a3pop, \u03a3tc of the\npopulation and tuning curves, so as to minimize the steady-state MSE. As noted above, when the\nestimate is exactly the posterior mean, this is equivalent to minimizing the steady-state expected\nposterior variance. The posterior variance has the advantage of being less noisy than the square error\nitself, since by de\ufb01nition it is the mean of the square error (of the posterior mean) under conditioning\nby Nt. We explore the question of optimal neural encoding by measuring the steady-state variance\nthrough Monte Carlo simulations of the system dynamics and the \ufb01ltering equations (5)-(6). Since\nthe posterior mean and variance computed by ADF are approximate, we veri\ufb01ed numerically that\nthe variance closely matches the MSE in the steady state when averaged across many trials (see\nsupplementary material), suggesting that asymptotically the error in estimating \u00b5t and \u03a3t is small.\n\n4.1 Optimal population center\n\nWe now consider the question of the optimal value for the population center c. Intuitively, if the\nprior distribution of the process X is unimodal with mode x0, the optimal population center is at\nHx0, to produce the most spikes. On the other hand, the terms involving gt in the \ufb01ltering equation\n(5)-(6) suggest that the lack of spikes is also informative. Moreover, as seen in Figure 1 (left), the\nposterior variance is reduced between spikes only when the current estimate is far enough from c.\nThese considerations suggest that there is a trade-off between maximizing the frequency of spikes\nand maximizing the information obtained from lack of spikes, yielding an optimal value for c that\ndiffers from Hx0.\nWe simulated a simple one-dimensional process to determine the optimal value of c which minimizes\nthe approximate posterior variance \u03a3t. Figure 3 (left) shows the posterior variance for varying\nvalues of the population center c and base \ufb01ring rate \u03bb0. For each \ufb01ring rate, we note the value of\nc minimizing the posterior variance (the optimal population center), as well as the value of cm =\nargminc (d\u03c3t/dt|\u00b5t=0), which maximizes the reduction in the posterior variance when the current\nstate estimate \u00b5t is at the process equilibrium x0 = 0. Consistent with the discussion above, the\noptimal value lies between 0 (where spikes are most abundant) and cm (where lack of spikes is most\ninformative). As could be expected, the optimal center is closer to 0 the higher the base \ufb01ring rate.\nSimilarly, wide tuning curves, which render the spikes less informative, lead to an optimal center\nfarther from 0 (Figure 3, right).\nA shift of the population center relative to the prior mode has been observed physiologically in\nencoding of inter-aural time differences for localization of sound sources [25]. In [18], this phe-\nnomenon was explained in a \ufb01nite population model based on maximization of Fisher information.\nThis is in contrast to the results of [21], which consider a heterogeneous population where the tuning\ncurve width scales roughly inversely with neuron density. In this case, the population density max-\nimizing the Fisher information is shown to be monotonic with the prior, i.e., more neurons should\nbe assigned to more probable states. This apparent discrepancy may be due to the scaling of tuning\ncurve widths in [21], which produces roughly constant total \ufb01ring rate, i.e., uniform coding. This\ndemonstrates that a non-constant total \ufb01ring rate, which renders lack of spikes informative, may be\nnecessary to explain the physiologically observed shift phenomenon.\n\n4.2 Optimization of population distribution\n\nNext, we consider the optimization of the population distribution, namely, the simultaneous opti-\nmization of the population center c and the population variance \u03a3pop in the case of a static scalar\nstate. Previous work using a \ufb01nite neuron population and a Fisher information-based criterion [18]\nhas shown that the optimal distribution of preferred stimuli depends on the prior variance. When\nit is small relative to the tuning curve width, optimal encoding is achieved by placing all preferred\nstimuli at a \ufb01xed distance from the prior mean. On the other hand, when the prior variance is large\nrelative to the tuning curve width, optimal encoding is uniform (see \ufb01gure 2 in [18]).\nSimilar results are obtained with our model, as shown in Figure 4. Here, a static scalar state drawn\nfrom N (0, \u03c32\np) is \ufb01ltered by a population with tuning curve width \u03c3tc = 1 and preferred stimulus\ndensity N (c, \u03c32\npop). In Figure 4 (left), the prior distribution is narrow relative to the tuning curve\nwidth, leading to an optimal population with a narrow population distribution far from the origin. In\n\n7\n\n\fFigure 3: Optimal population center location for \ufb01ltering a linear one-dimensional process. Both\ngraphs show the ratio of posterior standard deviation to the prior steady-state standard devia-\ntion of the process, along with the value of c minimizing the posterior variance (blue line), and\nminimizing the reduction of posterior variance when \u00b5t = 0 (yellow line). The process is ini-\ntialized from its steady-state distribution. The posterior variance is estimated by averaging over\nthe time interval [5, 10] and across 1000 trials for each data point. Parameters for both graphs:\na = \u22121, d = 0.5, \u03c32\n\npop = 0.1. In the graph on the left, \u03c32\n\ntc = 0.01; on the right, \u03bb0 = 50.\n\nFigure 4: Optimal population distribution depends on prior variance relative to tuning curve width.\nA static scalar state drawn from N (0, \u03c32\np) is \ufb01ltered with tuning curve \u03c3tc = 1 and preferred stimulus\ndensity N (c, \u03c32\npop). Both graphs show the posterior standard deviation relative to the prior standard\np = 0.1, whereas on the right, it is\ndeviation \u03c3p. In the left graph, the prior distribution is narrow, \u03c32\nwide, \u03c32\np = 10. In both cases the \ufb01lter is initialized with the correct prior, and the square error is\naveraged over the time interval [5, 10] and across 100 trials for each data point.\n\nFigure 4 (right), the prior is wide relative to the tuning curve width, leading to an optimal population\nwith variance that roughly matches the prior variance. When both the tuning curves and the popu-\nlation density are narrow relative to the prior, so that spikes are rare (low values of \u03c3pop in Figure 4\n(right)), the ADF approximation becomes poor, resulting in MSEs larger than the prior variance.\n\n5 Conclusions\n\nWe have introduced an analytically tractable Bayesian approximation to point process \ufb01ltering, al-\nlowing us to gain insight into the generally intractable in\ufb01nite-dimensional \ufb01ltering problem. The\napproach enables the derivation of near-optimal encoding schemes going beyond previously studied\nuniform coding assumptions. The framework is presented in continuous time, circumventing tem-\nporal discretization errors and numerical imprecisions in sampling-based methods, applies to fully\ndynamic setups, and directly estimates the MSE rather than lower bounds to it. It successfully ex-\nplains observed experimental results, and opens the door to many future predictions. Future work\nwill include a development of previously successful mean \ufb01eld approaches [13] within our more\ngeneral framework, leading to further analytic insight. Moreover, the proposed strategy may lead to\npractically useful decoding of spike trains.\n\nReferences\n\n[1] B.D. Anderson and J.B. Moore. Optimal Filtering. Dover, 2005. 1\n[2] R.E. Kalman and R.S. Bucy. New results in linear \ufb01ltering and prediction theory. J. of Basic Eng., Trans.\n\nASME, Series D, 83(1):95\u2013108, 1961. 1\n\n8\n\n0.00.20.40.60.81.0c100101102\u00b80Steady-state posterior stdev / prior stdevargminc\u00be2targmincd\u00be2j\u00b9=00.400.480.560.640.720.800.880.960.00.20.40.60.81.0c0.750.800.85\u00bet=\u00be00.00.51.01.52.0c0.20.40.60.81.0\u00betcSteady-state posterior stdev / prior stdev0.540.600.660.720.780.840.900.96012345c10-210-1100101102\u00bepopSteady-state variance, narrow prior\u22120.35\u22120.30\u22120.25\u22120.20\u22120.15\u22120.10\u22120.050.00log10(post. stdev / prior stdev)012345c10-210-1100101102\u00bepopSteady-state variance, wide prior\u22121.00\u22120.75\u22120.50\u22120.250.000.250.500.75log10(post. stdev / prior stdev)\f[3] R.E. Kalman. A new approach to linear \ufb01ltering and prediction problems. J. Basic Eng., Trans. ASME,\n\nSeries D., 82(1):35\u201345, 1960. 1\n\n[4] F. Daum. Nonlinear \ufb01lters: beyond the kalman \ufb01lter. Aerospace and Electronic Systems Magazine, IEEE,\n\n20(8):57\u201369, 2005. 1\n\n[5] S. Julier, J. Uhlmann, and H. Durrant-Whyte. A new method for the nonlinear transformation of means\n\nand covariances in \ufb01lters and estimators. IEEE Trans. Autom. Control, 45(3):477\u2013482, 2000. 1\n\n[6] A. Doucet and A.M. Johansen. A tutorial on particle \ufb01ltering and smoothing: \ufb01fteen years later.\n\nIn\nD. Crisan and B. Rozovskii, editors, Handbook of Nonlinear Filtering, pages 656\u2013704. Oxford, UK:\nOxford University Press, 2009. 1\n\n[7] P. Br\u00b4emaud. Point Processes and Queues: Martingale Dynamics. Springer, New York, 1981. 1\n[8] D.L. Snyder and M.I. Miller. Random Point Processes in Time and Space. Springer, second edition\n\nedition, 1991. 1, 2.1\n\n[9] P. Dayan and L.F. Abbott. Theoretical Neuroscience: Computational and Mathematical Modeling of\n\nNeural Systems. MIT Press, 2005. 1\n\n[10] O. Bobrowski, R. Meir, and Y.C. Eldar. Bayesian \ufb01ltering in spiking neural networks: noise, adaptation,\n\nand multisensory integration. Neural Comput, 21(5):1277\u20131320, May 2009. 1\n\n[11] Y. Ahmadian, J.W. Pillow, and L. Paninski. Ef\ufb01cient markov chain monte carlo methods for decoding\n\nneural spike trains. Neural Comput, 23(1):46\u201396, Jan 2011. 1\n\n[12] A.K. Susemihl, R. Meir, and M. Opper. Analytical results for the error in \ufb01ltering of gaussian processes.\nIn J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 24, pages 2303\u20132311. 2011. 1\n\n[13] A.K. Susemihl, R. Meir, and M. Opper. Dynamic state estimation based on poisson spike trains\u2014\nJournal of Statistical Mechanics: Theory and Experiment,\n\ntowards a theory of optimal encoding.\n2013(03):P03009, 2013. 1, 3.2, 3.2, 5\n\n[14] P.S. Maybeck. Stochastic Models, Estimation, and Control. Academic Press, 1979. 1\n[15] D. Brigo, B. Hanzon, and F. LeGland. A differential geometric approach to nonlinear \ufb01ltering:\n\nprojection \ufb01lter. Automatic Control, IEEE Transactions on, 43:247\u2013252, 1998. 1\n\nthe\n\n[16] M. Opper. A Bayesian approach to online learning.\n\nIn D. Saad, editor, Online Learning in Neural\n\nNetworks, pages 363\u2013378. Cambridge university press, 1998. 1, 3.2\n\n[17] T.P. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of the Seven-\nteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 362\u2013369. Morgan Kaufmann Publishers\nInc., 2001. 1\n\n[18] N.S. Harper and D. McAlpine. Optimal neural population coding of an auditory spatial cue. Nature,\n\n430(7000):682\u2013686, Aug 2004. n1397b. 1, 4.1, 4.2\n\n[19] M. Bethge, D. Rotermund, and K. Pawelzik. Optimal short-term population coding: when \ufb01sher infor-\n\nmation fails. Neural Comput, 14(10):2317\u20132351, Oct 2002. 1\n\n[20] S. Yaeli and R. Meir. Error-based analysis of optimal tuning functions explains phenomena observed in\n\nsensory neurons. Front Comput Neurosci, 4:130, 2010. 1, 2.1\n\n[21] D. Ganguli and E.P. Simoncelli. Ef\ufb01cient sensory encoding and bayesian inference with heterogeneous\n\nneural populations. Neural Comput, 26(10):2103\u20132134, 2014. 1, 4.1\n\n[22] I. Rhodes and D. Snyder. Estimation and control performance for space-time point-process observations.\n\nIEEE Transactions on Automatic Control, 22(3):338\u2013346, 1977. 2.1, 3.1, 3.1, 1, 3.3, 3.3\n\n[23] A.K. Susemihl, R. Meir, and M. Opper. Optimal Neural Codes for Control and Estimation. Advances in\n\nNeural Information Processing Systems, pages 1\u20139, 2014. 2.1, 3.1, 3.2, 3.3, 3.3\n\n[24] D. Snyder. Filtering and detection for doubly stochastic Poisson processes. IEEE Transactions on Infor-\n\nmation Theory, 18(1):91\u2013102, January 1972. 3.1\n\n[25] A. Brand, O. Behrend, T. Marquardt, D. McAlpine, and B. Grothe. Precise inhibition is essential for\n\nmicrosecond interaural time difference coding. Nature, 417(6888):543\u2013547, 2002. 4.1\n\n9\n\n\f", "award": [], "sourceid": 996, "authors": [{"given_name": "Yuval", "family_name": "Harel", "institution": "Technion"}, {"given_name": "Ron", "family_name": "Meir", "institution": "Technion"}, {"given_name": "Manfred", "family_name": "Opper", "institution": "TU Berlin"}]}