{"title": "Short-term memory in neuronal networks through dynamical compressed sensing", "book": "Advances in Neural Information Processing Systems", "page_first": 667, "page_last": 675, "abstract": "Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that {\\it exceeds} the number of neurons. This enhanced capacity is achieved by a class of ``orthogonal recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance.\"", "full_text": "Short-term memory in neuronal networks through\n\ndynamical compressed sensing\n\nSloan-Swartz Center for Theoretical Neurobiology, UCSF, San Francisco, CA 94143\n\nSurya Ganguli\n\nsurya@phy.ucsf.edu\n\nInterdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel\nand Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA\n\nhaim@fiz.huji.ac.il\n\nHaim Sompolinsky\n\nAbstract\n\nRecent proposals suggest that large, generic neuronal networks could store mem-\nory traces of past input sequences in their instantaneous state. Such a proposal\nraises important theoretical questions about the duration of these memory traces\nand their dependence on network size, connectivity and signal statistics. Prior\nwork, in the case of gaussian input sequences and linear neuronal networks, shows\nthat the duration of memory traces in a network cannot exceed the number of neu-\nrons (in units of the neuronal time constant), and that no network can out-perform\nan equivalent feedforward network. However a more ethologically relevant sce-\nnario is that of sparse input sequences. In this scenario, we show how linear neural\nnetworks can essentially perform compressed sensing (CS) of past inputs, thereby\nattaining a memory capacity that exceeds the number of neurons. This enhanced\ncapacity is achieved by a class of \u201corthogonal\u201d recurrent networks and not by\nfeedforward networks or generic recurrent networks. We exploit techniques from\nthe statistical physics of disordered systems to analytically compute the decay of\nmemory traces in such networks as a function of network size, signal sparsity and\nintegration time. Alternately, viewed purely from the perspective of CS, this work\nintroduces a new ensemble of measurement matrices derived from dynamical sys-\ntems, and provides a theoretical analysis of their asymptotic performance.\n\n1\n\nIntroduction\n\nHow neuronal networks can store a memory trace for recent sequences of stimuli is a central question\nin theoretical neuroscience. The in\ufb02uential idea of attractor dynamics [1], suggests how single\nstimuli can be stored as stable patterns of activity, or \ufb01xed point attractors, in the dynamics of\nrecurrent networks. But, such simple \ufb01xed points are incapable of storing sequences. More recent\nproposals [2, 3, 4] suggest that recurrent networks could store temporal sequences of inputs in their\nongoing, transient activity, even if they do not have nontrivial \ufb01xed points. In principle, past inputs\ncould be read out from the instantaneous activity of the network. However, the theoretical principles\nunderlying the ability of recurrent networks to store temporal sequences in their transient dynamics\nare poorly understood. For example, how long can memory traces last in such networks, and how\ndoes memory capacity depend on parameters like network size, connectivity, or input statistics?\nSeveral recent theoretical studies have made progress on these issues in the case of linear neuronal\nnetworks and gaussian input statistics. Even in this simple setting, the relationship between the\nmemory properties of a neural network and its connectivity is nonlinear, and so understanding this\n\n1\n\n\frelationship poses an interesting challenge. Jaeger [4] proved a rigorous sum-rule (reviewed in more\ndetail below) which showed that even in the absence of noise, no recurrent network can remember\ninputs for an amount of time that exceeds the number of neurons (in units of the neuronal time\nconstant) in the network. White et al. [5] showed that in the presence of noise, a special class\nof \u201corthogonal\u201d networks, but not generic recurrent networks, could have memory that scales with\nnetwork size. And \ufb01nally, Ganguli et. al. [6] used the theory of Fisher information to show that the\nmemory of a recurrent network cannot exceed that of an equivalent feedforward network, at least for\ntimes up to the network size, in units of the neuronal time constant.\nA key reason theoretical progress was possible in these works was that even though the optimal\nestimate of past inputs was a nonlinear function of the network connectivity, it was still a linear\nfunction of the current network state, due to the gaussianity of the signal (and possible noise) and\nthe linearity of the dynamics. It is not clear for example, how these results would generalize to\nnongaussian signals, whose reconstruction from the current network state would require nonlinear\noperations. Here we report theoretical progress on understanding the memory capacity of linear\nrecurrent networks for an important class of nongaussian signals, namely sparse signals. Indeed a\nwide variety of temporal signals of interest are sparse in some basis, for example human speech\nin a wavelet basis. We use ideas from compressed sensing (CS) to de\ufb01ne memory curves which\ncapture the decay of memory traces in neural networks for sparse signals, and provide methods to\ncompute these curves analytically. We \ufb01nd strikingly different properties of memory curves in the\nsparse setting compared to the gaussian setting. Although motivated by the problem of memory, we\nalso contribute new results to the \ufb01eld of CS itself, by introducing and analyzing new classes of CS\nmeasurement matrices derived from dynamical systems. Our main results are summarized in the\ndiscussion section. In the next section, we begin by reviewing more quantitatively the problem of\nshort-term memory in neuronal networks, compressed sensing, and the relation between the two.\n\n2 Short-term memory as dynamical compressed sensing.\n\nConsider a discrete time network dynamics given by\n\nx(n) = Wx(n \u2212 1) + vs0(n).\n\n(1)\nHere a scalar, time dependent signal s0(n) drives a recurrent network of N neurons. x(n) \u2208 RN\nis the network state at time n, W is an N \u00d7 N recurrent connectivity matrix, and v is a vector of\nfeedforward connections from the signal into the network. We choose v to have norm 1, and we\ndemand that the dynamics be stable so that if \u03c1 is the squared magnitude of the largest eigenvalue\nof W, then \u03c1 < 1. If we think of the signal history {s0(n \u2212 k)|k \u2265 0} as an in\ufb01nite dimensional\nk is s(n\u2212 k), then the current network state x is linearly\ntemporal vector s0 whose k\u2019th component s0\nrelated to s through the effective N by \u221e measurement matrix A, i.e. x = As0, where the matrix\nelements\n\nA\u00b5k = (Wkv)\u00b5, \u00b5 = 1, . . . , N, k = 0, . . . ,\u221e\n\n(2)\n\nimportant sum-rule for M (k):(cid:80)\u221e\n\nre\ufb02ect the effect of an input k timesteps in the past on the activity of neuron \u00b5. The extent to\nwhich the dynamical system in (1) can remember the past can then be quanti\ufb01ed by how well one\ncan recover s0 from x [4, 5, 6].\nIn the case where the signal has zero mean gaussian statistics\nl (cid:105) = \u03b4k,l, the optimal, minimum mean squared error estimate \u02c6s of the signal\nwith covariance (cid:104)s0\nks0\nhistory is given by \u02c6s = AT (AAT )\u22121x. The correlation between the estimate \u02c6sk and the true signal\nk, averaged over the gaussian statistics of s0, then de\ufb01nes a memory curve M (k) = (cid:104)\u02c6sks0\nk(cid:105)s0,\ns0\nwhose decay as k increases quanti\ufb01es the decay of memory for past inputs in (1). Jaeger proved an\nk=0 M (k) = N for any recurrent connectivity W and feedforward\nconnectivity v. Given that M (k) cannot exceed 1 for any k, an important consequence of this sum-\nrule is that it is not possible to recover an input signal k timesteps into the past when k is much\nlarger than N in the sense that \u02c6sk will be at most weakly correlated with s0\nk.\nGenerically, one may not hope to remember sequences lasting longer than N timesteps with only N\nneurons, but in the case of temporally sparse inputs, the \ufb01eld of compressed sensing (CS) suggests\nthis may be possible. CS [7, 8] shows how to recover a sparse T dimensional signal s0, in which\nonly a fraction f of the elements are nonzero, from a set of N linear measurements x = As0 where\nA is an N by T measurement matrix with N < T . One approach to recovering an estimate \u02c6s of s0\n\n2\n\n\ffrom x involves L1 minimization,\n\n\u02c6s = arg mins\n\nT(cid:88)\n\ni=1\n\n|si|\n\nsubject to x = As,\n\n(3)\n\nwhich \ufb01nds the sparsest signal, as measured by smallest L1 norm, consistent with the measurement\nconstraints. Much of the seminal work in CS [9, 10, 11] has focused on suf\ufb01cient conditions on A\nsuch that (3) is guaranteed to perfectly recover the true signal, so that \u02c6s = s0. However, many large\nrandom measurement matrices A which violate suf\ufb01cient conditions proven in the literature still\nnevertheless typically yield perfect signal recovery. Alternate work [12, 13, 14, 15] which analyzes\nthe asymptotic performance of large random measurement matrices in which each matrix element\nis drawn i.i.d. from a gaussian distribution, has revealed a phase transition in performance as a\nfunction the signal sparsity f and the degree of subsampling \u03b1 = N/T . In the \u03b1-f plane, there is\na critical phase boundary \u03b1c(f ) such that if \u03b1 > \u03b1c(f ) then CS will typically yield perfect signal\nreconstruction, whereas if \u03b1 < \u03b1c(f ), CS will yield errors.\nMotivated by the above work in CS, we propose here that a neural network, or more generally any\ndynamical system as in (1), could in principle perform compressed sensing of its past inputs, and that\na long but sparse signal history s0 could potentially be recovered from the instantaneous network\nstate x. We quantify the memory capabilities of a neural network for sparse signals, by assessing\nour ability to reconstruct the past signal using L1 minimization. Given a network state x arising\nfrom a signal history s0 through (1), we can obtain an estimate \u02c6s of the past using (3), where the\nmeasurement matrix A is given by (2). We then de\ufb01ne a memory curve\n\nE(k) = (cid:104)(\u02c6sk \u2212 s0\n\nk)2(cid:105)s0,\n\n(4)\n\nnamely the average reconstruction error of a signal k timesteps in the past averaged over the statistics\nof s0. The rise of this error as k increases captures the decay of memory traces in (1). The central\ngoal of this paper is to obtain a deeper understanding of the memory properties of neural networks\nfor sparse signals by studying the memory curve E(k) and especially its dependence on W. In\nparticular, we are interested in classes of network connectivities W and input statistics for which\nE(k) can remain small even for k (cid:29) N. Such networks can essentially perform compressed sensing\nof their past inputs.\nFrom the perspective of CS, measurement matrices A of the form in (2), henceforth referred to as\ndynamical CS matrices, possess several new features not considered in the existing CS literature,\nfeatures which could pose severe challenges for a recurrent network W to achieve good CS per-\nformance. First, A is an N by \u221e matrix, and so from the perspective of the phase diagram for\nCS reviewed above, it is likely that A is in the error phase; thus perfect reconstruction of the true\nsignal, even for recent inputs will not be possible. Second, because we demand stable dynamics\nin (1), the columns of A decay as k increases: ||Wkv||2 < \u03c1k where again \u03c1 < 1 is the squared\nmagnitude of the largest eigenvalue of W. Such decay can compound errors. Third, the different\ncolumns of A can be correlated; if one thinks of Wkv as the state of the network k timesteps after a\nsingle unit input pulse, it is clear that temporal correlations in the evolving network response to this\npulse are equivalent to correlations in the columns of A in (2). Such correlations could potentially\nadversely affect the performance of CS based on A, as well as complicate the theoretical analysis of\nCS performance. Nevertheless, despite all these seeming dif\ufb01culties, in the following we show that\na special class of network connectivities can indeed achieve good CS performance in which errors\nare controlled and memory traces can last longer than the number of neurons.\n\n3 Memory in an Annealed Approximation to a Dynamical System\n\nIn this section, we work towards an analytic understanding of the memory curve E(k) de\ufb01ned in\n(4). This curve depends on W, v and the statistics of s0. We would like to understand its prop-\nerties for ensembles of large random networks W, just as the asymptotic performance of CS was\nanalyzed for large random measurement matrices A [12, 13, 14, 15]. However, in the dynamical\nsetting, even if W is drawn from a simple random matrix ensemble, A in (2) will have correlations\nacross its columns, making an analytical treatment of the memory curve dif\ufb01cult. Here we consider\nan ensemble of measurement matrices A which approximate dynamical CS matrices and can be\n\n3\n\n\ftreated analytically. We consider matrices in which each element A\u00b5k is drawn i.i.d from a zero\nmean gaussian distribution with variance \u03c1k. Since we are interested in memory that lasts O(N )\ntimesteps, we choose \u03c1 = e\u22121/\u03c4 N , with \u03c4 O(1). This so called annealed approximation (AA) to a\ndynamical CS matrix captures two of the salient properties of dynamical CS matrices, their in\ufb01nite\ntemporal extent and the decay of successive columns, but neglects the analytically intractable corre-\nlations across columns. Such annealed CS matrices can be thought of as arising from \u201cimaginary\u201d\ndynamical systems in which network activity patterns over time in response to a pulse decay, but\nare somehow temporally uncorrelated. \u03c4 can be thought of as the effective integration time of this\ndynamical system, in units of the number of neurons. Finally, to fully specify E(k), we must choose\nthe statistics of s0. We assume s0 has a probability f of being nonzero at any given time, and if\nnonzero, this nonzero value is drawn from a distribution P (s) which for now we take to be arbitrary.\nTo theoretically compute the memory curve E(k), we de\ufb01ne an energy function\n\nE(s) =\n\n\u03bb\n2\n\nuT AT Au +\n\n|si|,\n\n(5)\n\nwhere u \u2261 s \u2212 s0 is the residual, and we consider the Gibbs distribution PG(s) = 1\nZ e\u2212\u03b2E(s).\nWe will later take \u03bb \u2192 \u221e so that the quadratic part of the energy function enforces the constraint\nAs = As0, and then take the low temperature \u03b2 \u2192 \u221e limit so that PG(s) concentrates onto\nthe global minimum of (3). In this limit, we can extract the memory curve E(k) as the average of\n(sk\u2212s0\nk)2 over PG and the statistics of s0. Although PG depends on A, for large N, the properties of\nPG, including the memory curve E(k), do not depend on the detailed realization of A, but only on its\nstatistics. Indeed we can compute all properties of PG for any typical realization of A by averaging\nover both A and s0. This is done using the replica method [16] in our supplementary material.\nThe replica method has been used recently in several works to analyze CS for the traditional case\nof uniform random gaussian measurement matrices [14, 17, 15]. We \ufb01nd that the statistics of each\nk is well described by a mean \ufb01eld effective\ncomponent sk in PG(s), conditioned on the true value s0\nHamiltonian\nk \u2212 z\n\n+ \u03b2|s|,\n\ns \u2212 s0\n\n(cid:19)2\n\n(s) = \u03c1k\n\n(cid:113)\n\nQ0/\u03c1k\n\n(cid:18)\n\nH M F\n\n(6)\n\n\u03b2\u03bb\n\nk\n\n2(1 + \u03b2\u03bb\u2206Q)\n\nwhere z is a random variable with a standard normal distribution. Thus the mean \ufb01eld approximation\nto the marginal distribution of a reconstruction component sk is\n\nP M F\n\nk\n\n(sk = s) =\n\nDz\n\n1\n\nZM F\n\nk\n\nexp(\u2212H M F\n\nk\n\n(s)),\n\n(7)\n\nwhere Dz = dz e\u2212 1\n\n2 z2 is a Gaussian measure. The order parameters Q0 and \u2206Q \u2261 Q1 \u2212 Q0 obey\n\nQ0 =\n\n\u2206Q =\n\n1\nN\n\n1\nN\n\n\u03c1k(cid:104)(cid:104)(cid:104)u(cid:105)2\n\nHM F\n\nk\n\n(cid:105)(cid:105)\n\nz\n\n\u03c1k(cid:104)(cid:104)(cid:104)\u03b4u2(cid:105)HM F\n\nk\n\n(cid:105)(cid:105)\n\n.\n\nz\n\n(8)\n\n(9)\n\nk\n\nk\n\nand (cid:104)\u03b4u2(cid:105)HM F\n\nare the mean and variance of the residual uk = sk \u2212 s0\n\nHere (cid:104)u(cid:105)HM F\nk with re-\nspect to a Gibbs distribution with Hamiltonian given by (6), and the double angular average (cid:104)(cid:104)\u00b7(cid:105)(cid:105)z\n(cid:80)\u221e\nrefers to integrating over the Gaussian distribution of z. Q1 and Q0 have simple interpretations\nand\nin terms of the original Gibbs distribution PG de\ufb01ned above: Q1 = 1\nN\nk=1 \u03c1k(cid:104)uk(cid:105)2\n, for typical realizations of A. Thus the order parameter equations (8)-(9)\nQ0 = 1\nN\ncan be understood as self-consistency conditions for the de\ufb01nition of Q0 and \u2206Q in the mean \ufb01eld\napproximation to PG. In this approximation, the complicated constraints coupling sk for various k\nare replaced with a random gaussian force z in (6) which tends to prevent the marginal sk from as-\nsuming the true value s0\nk. This force is what remains of the measurement constraints after averaging\nover A, and its statistics are in turn a function of Q0 and Q1, as determined by the replica method.\nNow to compute the memory curve E(k), we must take the limits \u03bb, \u03b2, N \u2192 \u221e and complete the\nk. The \u03bb \u2192 \u221e limit can be taken immediately in (6) and \u03bb disappears from the\naverage over s0\nproblem. Now as \u03b2 \u2192 \u221e, self consistent solutions to (8) and (9) can be found when Q0 \u2261 q0 and\n\n(cid:80)\u221e\nk=1 \u03c1k(cid:104)u2\n\nk(cid:105)PG\n\nPG\n\nT(cid:88)\n\ni=1\n\n(cid:90)\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nk=0\n\nk=0\n\n4\n\n\f1\n\n2\u03c1\u2212k\u2206q\n\n= \u03b7(cid:0)s0\n(cid:18) 1\n\nk + z\n\n(s \u2212 x)2\n\n2\n\n\u03c3\n\n(cid:112)\n\n\u03c1\u2212kq0, \u03c1\u2212k\u2206q(cid:1),\n(cid:19)\n\n+ |s|\n\n= sgn(x)(|x| \u2212 \u03c3)+,\n\nwhere\n\n(cid:104)s(cid:105)HM F\n\nk\n\n\u03b7(x, \u03c3) = arg mins\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n\u2206Q \u2261 \u2206q/\u03b2, where q0 and \u2206q are O(1). This limit is similar to that taken in a replica analysis of\nCS for random gaussian matrices in the error regime [15]. Taking this limit, (6) becomes\n\n(cid:20)\n\n(cid:16)\n\n(cid:112)\n\n(cid:17)2\n\n(cid:21)\n\ns \u2212 s0\n\nk \u2212 z\n\n\u03c1\u2212kq0\n\n+ |s|\n\n.\n\nH M F\n\nk\n\n(s) = \u03b2\n\nSince the entire Hamiltonian is proportional to \u03b2, in the large \u03b2 limit, the statistics of sk are domi-\nnated by the global minimum of (10). In particular, we have\n\nis a soft thresholding function which also arises in message passing approaches [18] to solving the\nCS problem in (3), and (y)+ = y if y > 0 and is otherwise 0. The optimization in (12) can be\nunderstood intuitively as follows: suppose one measures a scalar value x which is a true signal\ns0 corrupted by additive gaussian noise with variance \u03c3. Under a Laplace prior e\u2212|s0| on the true\nsignal, \u03b7(x, \u03c3) is simply the MAP estimate of s0 given the data x, which basically chooses the\nestimate s = 0 unless the data exceeds the noise level \u03c3. Thus we see that in (10), \u03c1\u2212k\u2206q plays the\nrole of an effective noise level which increases with time k. Also, the variance of s at large \u03b2 is\n\n\u03c7(cid:0)s0\n\nk + z\n\n(cid:112)\n\u03c1\u2212kq0, \u03c1\u2212k\u2206q(cid:1),\n\n(cid:104)(\u03b4s)2(cid:105)HM F\n\nk\n\n=\n\n1\n\u03b2\n\nwhere\n(14)\nand \u0398(x) is a step function at 0. Inserting (11) and (13) and the ansatz \u2206Q \u2261 \u2206q/\u03b2 into (8) and\n(9) then removes \u03b2 from the problem. But before making these substitutions, we \ufb01rst take N \u2192 \u221e\nat \ufb01xed \u03c4 and f of O(1) by taking a continuum approximation for time, t = k/N, \u03c1k \u2192 e\u2212t/\u03c4 ,\nk, so that (8) and (9)\n\n0 dt. Moreover, we average over the true signal history s0\n\nk=0 \u2192 (cid:82) \u221e\n(cid:80)\u221e\n\n\u03c7(x, \u03c3) = \u03c3 \u0398(|x| \u2212 \u03c3),\n\n1\nN\nbecome,\n\n(cid:90) \u221e\n(cid:90) \u221e\n\n0\n\nq0 =\n\n(cid:112)\net/\u03c4 q0, et/\u03c4 \u2206q) \u2212 s0(cid:1)2(cid:11)(cid:11)\ndt e\u2212t/\u03c4(cid:10)(cid:10)(cid:0)\u03b7(s0 + z\n(cid:112)\net/\u03c4 q0, et/\u03c4 \u2206q)(cid:11)(cid:11)\ndt e\u2212t/\u03c4(cid:10)(cid:10) \u03c7(s0 + z\n\nz,s0,\n\nz,s0\n\n(15)\n\n\u2206q =\n\ndistribution of s0, i.e. (cid:10)(cid:10) F (z, s0)(cid:11)(cid:11)\n\n0\n\nz,s0 \u2261 (1 \u2212 f )(cid:82) Dz F (z, 0) + f(cid:82) Dz ds0 P (s0)F (z, s0).\n\n(16)\n\nwhere the double angular average re\ufb02ects an integral over the gaussian distribution of z and the full\n\nFinally the memory curve E(t) is simply the continuum limit of the averaged squared residual\n(cid:104)(cid:104)(cid:104)u(cid:105)2\n\n(cid:105)(cid:105)z,s0, and is given by\n\nHM F\n\nk\n\nE(t) =(cid:10)(cid:10)(cid:0)\u03b7(s0 + z\n\n(cid:112)\n\net/\u03c4 q0, et/\u03c4 \u2206q) \u2212 s0(cid:1)2(cid:11)(cid:11)\n\nz,s0.\n\n(17)\n\nEquations (15),(16), and (17) now depend only on \u03c4, f and P (s0), and their theoretical predictions\ncan now be compared with numerical experiments. In this work we focus on a simple class of plus-\nminus (PM) signals in which P (s0) = 1/2 \u03b4(s0 \u2212 1) + 1/2 \u03b4(s0 + 1). Fig. 1A shows an example\nof a PM signal s0 with f = 0.01, while Fig. 1B shows an example of a reconstruction of \u02c6s using\nL1 minimization in (3) where the data x used in (3) was obtained from s0 using a random annealed\nmeasurement matrix with \u03c4 = 1. Clearly there are errors in the reconstruction, but remarkably,\ndespite the decay in the columns of A, the reconstruction is well correlated with the true signal for\na time up to 4 times the number of measurements. We can derive theoretical memory curves for any\ngiven f and \u03c4 by numerically solving for q0 and \u2206q in (15),(16), and inserting the results into (17).\nExamples of the agreement between theory and simulations are shown in Fig. 1C-E.\nAs t \u2192 \u221e, L1 minimization always yields a zero signal estimate, so the memory curve asymptoti-\ncally approaches f for large t. A convenient measure of memory capacity is the time T1/2 at which\nthe memory curve reaches half its asymptotic error value, i.e. E(T1/2) = f /2. A principle feature\n\n5\n\n\fFigure 1: Memory in the annealed approximation. (A) A PM signal s0 with f = 0.01 that lasts\nT = 10N timesteps where N = 500. (B) A reconstruction of s0 from the output of an annealed\nmeasurement matrix with N = 500, \u03c4 = 1. (C,D,E) Example memory curves for f = 0.01, and\n\u03c4 = 1 (C), 2 (D), 3 (E). (F) T1/2 as a function of \u03c4. The 4 curves from top to bottom are for\nf = 0.01, 0.02, 0.03, 0.04. (G) T1/2 optimized over \u03c4 for each f. (H) The initial error as a function\nof f. The 3 curves from bottom to top are for \u03c4 = 1, 2, 3. For (C-H), red curves are theoretical\npredictions while blue curves and points are from numerical simulations of L1 minimization with\nN = 100 averaged over 300 trials. The width of the blue curves re\ufb02ects standard error.\n\nof this family of memory curves is that for any given f there is an optimal \u03c4 which maximizes T1/2\n(Fig. 1F) . The presence of this optimum arises due to a competition between decay and interference.\nIf \u03c4 is too small, signal measurements decay too quickly, thereby preventing large memory capacity.\nHowever, if \u03c4 is too large, signals from the distant past do not decay away, thereby interfering with\nthe measurements of more recent signals, and again degrading memory. As f decreases, long time\nsignal interference is reduced, thereby allowing larger values of \u03c4 to be chosen without degrading\nmemory for more recent signals. For any given f, we can compute T1/2(f ) optimized over \u03c4 (Fig.\n1G). This memory capacity, again measured in units of the number of neurons, already exceeds 1 at\nmodest values of f = 0.1, and diverges as f \u2192 0, as does the optimal value of \u03c4. By analyzing (15)\nand (16) in the limit f \u2192 0 and \u03c4 \u2192 \u221e, we \ufb01nd that \u2206q is O(1) while q0 \u2192 0. Furthermore, as\nf \u2192 0, the optimal T1/2 is O(\nThe smallest error occurs at t = 0 and it is natural to ask how this error E(0) behaves as a function\nof f for small f to see how well the most recent input can be reconstructed in the limit of sparse\nsignals. We analyze (15) and (16) in the limit f \u2192 0 and \u03c4 of O(1), and \ufb01nd that E(0) is O(f 2) as\ncon\ufb01rmed in Fig. 1F. Furthermore, E(0) monotonically increases with \u03c4 for \ufb01xed f as more signals\nfrom the past interfere with the most recent input.\n\nf log 1/f ).\n\n1\n\n4 Orthogonal Dynamical Systems\n\n\u221a\n\nWe have seen in the previous section that annealed CS matrices have remarkable memory properties,\nbut our main interest was to exhibit a dynamical CS matrix as in (2) capable of good compressed\nsensing, and therefore short-term memory, performance. Here we show that a special class of net-\n\u03c1O where O is any orthogonal matrix, and v is any random unit\nwork connectivity in which W =\nnorm vector possesses memory properties remarkably close to that of the annealed matrix ensemble.\nFig. 2A-F presents results identical to that of Fig. 1C-H except for the crucial change that all simu-\nlation results in Fig. 2 were obtained using dynamical CS matrices of the form A\u00b5k = (\u03c1k/2Okv)\u00b5,\nrather than annealed CS matrices. All red curves in Fig. 2A-F are identical to those in Fig. 1 and\nre\ufb02ect the theory of annealed CS matrices derived in the previous section.\nFor small \u03c4, we see small discrepancies between memory curves for orthogonal neural networks\nand the annealed theory (Fig. 2A-B), but as \u03c4 increases, this discrepancy decreases (Fig. 2C).\nIn particular, from the perspective of the optimal T1/2 for which larger \u03c4 is relevant, we see a\nremarkable match between the optimal memory capacity of orthogonal neural networks and that\npredicted by the annealed theory (see Fig. 2E). And there is good match in the initial error even at\nsmall \u03c4 (Fig. 2F).\n\n6\n\n0246810\u2212101ts0A0246810\u2212101tEstimateBCDEFHG24681000.51tE(t) / f24681000.51tE(t) / f24681000.51tE(t) / f0123450246810\u03c4T1/20       0.05    0.1 0246810T1/2f00.0250.0500.51fE(0) / f\fFigure 2: Memory in orthogonal neuronal networks. Panels (A-F) are identical to panels (C-H) in\nFig. 1 except now the blue curves and points are obtained from simulations of L1 minimization using\nmeasurement matrices derived from an orthogonal neuronal network. (G) The mean and standard\ndeviation of \u03c3f for 5 annealed (red) and 5 orthogonal matrices (blue) with N=200 and T=3000.\n\nThe key difference between the annealed and the dynamical CS matrices is that the former neglects\ncorrelations across columns that can arise in the latter. How strong are these correlations for the\ncase of orthogonal matrices? Motivated by the restricted isometry property [11], we consider the\nfollowing probe of the strength of correlations across columns of A. Consider an N by f T matrix\nB obtained by randomly subsampling the columns of an N by T measurement matrix A. Let \u03c3f\nbe the maximal eigenvalue of the matrix BT B of inner products of columns of B. \u03c3f is a measure\nof the strength of correlations across the f T sampled columns of A. We can estimate the mean\nand standard deviation of \u03c3f due to the random choice of f T columns of A and plot the results\nas function of f. To separate the issue of correlations from decay, we do this analysis for \u03c1 = 1\nand \ufb01nite T (similar results are obtained for large T and \u03c1 < 1). Results are shown in Fig 2 for 5\ninstances of annealed (red) and dynamical (blue) CS matrices. We see strikingly different behavior\nin the two ensembles. Correlations are much stronger in the dynamical ensemble, and \ufb02uctuate from\ninstance to instance, while they are weaker in the annealed ensemble, and do not \ufb02uctuate (the 5 red\ncurves are on top of each other). Given the very different statistical properties of the two ensembles,\nthe level of agreement between the simulated memory properties of orthogonal neural networks, and\nthe theory of annealed CS matrices is remarkable.\nWhy do orthogonal neural networks perform so well, and can more generic networks have similar\n\u221a\nperformance? The key to understanding the memory, and CS, capabilities of orthogonal neural\n\u221a\nnetworks lies in the eigenvalue spectrum of an orthogonal matrix. The eigenvalues of W =\n\u03c1O,\nwhen O is a large random orthogonal matrix, are uniformly distributed on a circle of radius\n\u03c1\nin the complex plane. Thus when \u03c1 = e\u22121/\u03c4 N , the sequence of vectors Wkv explore the full\nN dimensional space of network activity patterns for O(\u03c4 N ) time steps before decaying away. In\ncontrast, a generic random Gaussian matrix W with elements drawn i.i.d from a zero mean gaussian\n\u03c1 in the complex\nwith variance \u03c1/N has eigenvalues uniformly distributed on a solid disk of radius\nplane. Thus the sequence of vectors Wkv no longer explore a high dimensional space of activity\npatterns; components of v in the direction of eigenmodes of W with small eigenvalues will rapidly\ndecay away, and so the sequence will rapidly become con\ufb01ned to a low dimensional space. Good\ncompressed sensing matrices often have columns that are random and uncorrelated. From the above\nconsiderations, it is clear that dynamical CS matrices derived from orthogonal neural networks can\ncome close to this ideal, while those derived from generic gaussian networks cannot.\n\n\u221a\n\n5 Discussion\n\nIn this work we have made progress on the theory of short-term memory for nongaussian, sparse,\ntemporal sequences stored in the transient dynamics of neuronal networks. We used the framework\nof compressed sensing, speci\ufb01cally L1 minimization, to reconstruct the history of the past input sig-\nnal from the current network activity state. The reconstruction error as a function of time into the past\nthen yields a well-de\ufb01ned memory curve that re\ufb02ects the memory capabilities of the network. We\nstudied the properties of this memory curve and its dependence on network connectivity, and found\n\n7\n\nABCDFE24681000.51tE(t) / f24681000.51tE(t) / f24681000.51tE(t) / f0123450246810\u03c4T1/20       0.05    0.1 0246810T1/2f00.0250.0500.51fE(0) / f00.050.102468101214fMax CorrG\fresults that were qualitatively different from prior theoretical studies devoted to short-term memory\nin the setting of gaussian input statistics. In particular we found that orthogonal neural networks,\nbut importantly, not generic random gaussian networks, are capable of remembering inputs for a\ntime that exceeds the number of neurons in the network, thereby circumventing a theorem proven in\n[4], which limits the memory capacity of any network to be less than the number of neurons in the\ngaussian signal setting. Also, recurrent connectivity plays an essential role in allowing a network to\nhave a memory capacity that exceeds the number of neurons. Thus purely feedforward networks,\nwhich always outperform recurrent networks (for times less than the network size) in the scenario of\ngaussian signals and noise [6] are no longer optimal for sparse input statistics. Finally, we exploited\npowerful tools from statistical mechanics to analytically compute memory curves as a function of\nsignal sparsity and network integration time. Our theoretically computed curves matched reasonably\nwell simulations of orthogonal neural networks. To our knowledge, these results represent the \ufb01rst\ntheoretical calculations of short-term memory curves for sparse signals in neuronal networks.\nWe emphasize that we are not suggesting that biological neural systems use L1 minimization to\nreconstruct past inputs. Instead we use L1 minimization in this work simply as a theoretical tool to\nprobe the memory capabilities of neural networks. However, neural implementations of L1 mini-\nmization exist [19, 20], so if stimulus reconstruction were the goal of a neural system, reconstruction\nperformance similar to what is reported here could be obtained in a neurally plausible manner. Also,\nwe found that orthogonal neural networks, because of their eigenvalue spectrum, display remark-\nable memory properties, similar to that of an annealed approximation. Such special connectivity\nis essential for memory performance, as random gaussian networks cannot have memory similar\nto the annealed approximation. Orthogonal connectivity could be implemented in a biologically\nplausible manner using antisymmetric networks with inhibition operating in continuous time. When\nexponentiated, such connectivities yield the orthogonal networks considered here in discrete time.\nOur results are relevant not only to the \ufb01eld of short-term memory, but also to the \ufb01eld of compressed\nsensing (CS). We have introduced two new ensembles of random CS measurement matrices. The\n\ufb01rst of these, dynamical CS matrices, are the effective measurements a dynamical system makes on\na continuous temporal stream of input. Dynamical CS matrices have three properties not considered\nin the existing CS literature: they are in\ufb01nite in temporal extent, have columns that decay over time\nand exhibit correlations between columns. We also introduce annealed CS matrices, that are also\nin\ufb01nite in extent and have decaying columns, but no correlations across columns. We show how to\nanalytically calculate the time course of reconstruction error in the annealed ensemble and compare\nit to the dynamical ensemble for orthogonal dynamical systems. Our results show that orthogonal\ndynamical systems can perform CS even while operating with errors.\nThis work suggests several extensions. Given the importance of signal statistics in determining\nmemory capacity, it would be interesting to study memory for sparse nonnegative signals. The\ninequality constraints on the space of allowed signals arising from nonnegativity can have important\neffects in CS; they shift the phase boundary between perfect and error-prone reconstruction [12, 13,\n15], and they allow the existence of a new phase in which signal reconstruction is possible even\nwithout L1 minimization [15]. We have found, through simulations, dramatic improvements in\nmemory capacity in this case, and are extending the theory to explain these effects. Also, we have\nused a simple model for sparseness, in which a fraction of signal elements are nonzero. But our\ntheory is general for any signal distribution, and could be used to analyze other models of sparsity,\ni.e. signals drawn from Lp priors. Also, we have worked in the high SNR limit. However our\ntheory can be extended to analyze memory in the presence of noise by working at \ufb01nite \u03bb. But most\nimportantly, a deeper understanding of the relationship between dynamical CS matrices and their\nannealed counterparts would desirable. The effects of temporal correlations in the network activity\npatterns of orthogonal dynamical systems is central to this problem. For example, we have seen that\nthese temporal correlations introduce strong correlations between the columns of the corresponding\ndynamical CS matrix (Fig. 2G), yet the memory properties of these matrices agree well with our\nannealed theory (Fig. 2E-F), which neglects these correlations. We leave this observation as an\nintriguing puzzle for the \ufb01elds of short-term memory, dynamical systems, and compressed sensing.\n\nAcknowledgments\n\nS. G. and H. S. thank the Swartz Foundation, Burroughs Wellcome Fund, and the Israeli Science\nFoundation for support, and Daniel Lee for useful discussions.\n\n8\n\n\fReferences\n[1] J.J. Hop\ufb01eld. Neural networks and physical systems with emergent collective computational\n\nabilities. PNAS, 79(8):2554, 1982.\n\n[2] W. Maass, T. Natschlager, and H. Markram. Real-time computing without stable states: A new\nframework for neural computation based on perturbations. Neural computation, 14(11):2531\u2013\n2560, 2002.\n\n[3] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy\n\nin wireless communication. Science, 304(5667):78, 2004.\n\n[4] H. Jaeger. Short term memory in echo state networks. GMD Report 152 German National\n\nResearch Center for Information Technology, 2001.\n\n[5] O.L. White, D.D. Lee, and H. Sompolinsky. Short-term memory in orthogonal neural net-\n\nworks. Phys. Rev. Lett., 92(14):148102, 2004.\n\n[6] S. Ganguli, D. Huh, and H. Sompolinsky. Memory traces in dynamical systems. Proc. Natl.\n\nAcad. Sci., 105(48):18970, 2008.\n\n[7] A.M. Bruckstein, D.L. Donoho, and M. Elad. From sparse solutions of systems of equations\n\nto sparse modeling of signals and images. Siam Review, 51(1):34\u201381, 2009.\n\n[8] E. Candes and M. Wakin. An introduction to compressive sampling. IEEE Sig. Proc. Mag.,\n\n25(2):21\u201330, 2008.\n\n[9] D.L. Donoho and M. Elad. Optimally sparse representation in general (non-orthogonal) dic-\n\ntionaries via l1 minimization. PNAS, 100:2197\u20132202, 2003.\n\n[10] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction\nfrom highly incomplete frequency information. IEEE Trans. Inf. Theory, 52(2):489\u2013509, 2006.\n[11] E. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Inf. Theory, 51:4203\u2013\n\n4215, 2005.\n\n[12] D.L. Donoho and J. Tanner. Sparse nonnegative solution of underdetermined linear equations\n\nby linear programming. PNAS, 102:9446\u201351, 2005.\n\n[13] D.L. Donoho and J. Tanner. Neighborliness of randomly projected simplices in high dimen-\n\nsions. PNAS, 102:9452\u20137, 2005.\n\n[14] Y. Kabashima, T. Wadayama, and T. Tanaka. A typical reconstruction limit for compressed\n\nsensing based on l p-norm minimization. J. Stat. Mech., page L09003, 2009.\n\n[15] S. Ganguli and H. Sompolinsky. Statistical mechanics of compressed sensing. Phys. Rev. Lett.,\n\n104(18):188701, 2010.\n\n[16] M. Mezard, G. Parisi, and M.A. Virasoro. Spin glass theory and beyond. World scienti\ufb01c\n\nSingapore, 1987.\n\n[17] S. Rangan, A.K. Fletcher, and Goyal V.K. Asymptotic analysis of map estimation via the\n\nreplica method and applications to compressed sensing. CoRR, abs/0906.3234, 2009.\n\n[18] D.L. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed\n\nsensing. Proc. Natl. Acad. Sci., 106(45):18914, 2009.\n\n[19] Y. Xia and M.S. Kamel. A cooperative recurrent neural network for solving l 1 estimation\n\nproblems with general linear constraints. Neural computation, 20(3):844\u2013872, 2008.\n\n[20] C.J. Rozell, D.H. Johnson, R.G. Baraniuk, and B.A. Olshausen. Sparse coding via thresholding\n\nand local competition in neural circuits. Neural computation, 20(10):2526\u20132563, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1166, "authors": [{"given_name": "Surya", "family_name": "Ganguli", "institution": null}, {"given_name": "Haim", "family_name": "Sompolinsky", "institution": null}]}