{"title": "Fast Sampling-Based Inference in Balanced Neuronal Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2240, "page_last": 2248, "abstract": "Multiple lines of evidence support the notion that the brain performs probabilistic inference in multiple cognitive domains, including perception and decision making. There is also evidence that probabilistic inference may be implemented in the brain through the (quasi-)stochastic activity of neural circuits, producing samples from the appropriate posterior distributions, effectively implementing a Markov chain Monte Carlo algorithm. However, time becomes a fundamental bottleneck in such sampling-based probabilistic representations: the quality of inferences depends on how fast the neural circuit generates new, uncorrelated samples from its stationary distribution (the posterior). We explore this bottleneck in a simple, linear-Gaussian latent variable model, in which posterior sampling can be achieved by stochastic neural networks with linear dynamics. The well-known Langevin sampling (LS) recipe, so far the only sampling algorithm for continuous variables of which a neural implementation has been suggested, naturally fits into this dynamical framework. However, we first show analytically and through simulations that the symmetry of the synaptic weight matrix implied by LS yields critically slow mixing when the posterior is high-dimensional. Next, using methods from control theory, we construct and inspect networks that are optimally fast, and hence orders of magnitude faster than LS, while being far more biologically plausible. In these networks, strong -- but transient -- selective amplification of external noise generates the spatially correlated activity fluctuations prescribed by the posterior. Intriguingly, although a detailed balance of excitation and inhibition is dynamically maintained, detailed balance of Markov chain steps in the resulting sampler is violated, consistent with recent findings on how statistical irreversibility can overcome the speed limitation of random walks in other domains.", "full_text": "Fast Sampling-Based Inference in Balanced Neuronal\n\nNetworks\n\nGuillaume Hennequin1\ngjeh2@cam.ac.uk\n\nLaurence Aitchison2\n\nlaurence@gatsby.ucl.ac.uk\n\nM\u00b4at\u00b4e Lengyel1\n\nm.lengyel@eng.cam.ac.uk\n\n1Computational & Biological Learning Lab, Dept. of Engineering, University of Cambridge, UK\n2Gatsby Computational Neuroscience Unit, University College London, UK\n\nAbstract\n\nMultiple lines of evidence support the notion that the brain performs probabilistic\ninference in multiple cognitive domains, including perception and decision mak-\ning. There is also evidence that probabilistic inference may be implemented in the\nbrain through the (quasi-)stochastic activity of neural circuits, producing samples\nfrom the appropriate posterior distributions, effectively implementing a Markov\nchain Monte Carlo algorithm. However, time becomes a fundamental bottleneck\nin such sampling-based probabilistic representations: the quality of inferences de-\npends on how fast the neural circuit generates new, uncorrelated samples from\nits stationary distribution (the posterior). We explore this bottleneck in a sim-\nple, linear-Gaussian latent variable model, in which posterior sampling can be\nachieved by stochastic neural networks with linear dynamics. The well-known\nLangevin sampling (LS) recipe, so far the only sampling algorithm for continu-\nous variables of which a neural implementation has been suggested, naturally \ufb01ts\ninto this dynamical framework. However, we \ufb01rst show analytically and through\nsimulations that the symmetry of the synaptic weight matrix implied by LS yields\ncritically slow mixing when the posterior is high-dimensional. Next, using meth-\nods from control theory, we construct and inspect networks that are optimally fast,\nand hence orders of magnitude faster than LS, while being far more biologically\nplausible. In these networks, strong \u2013 but transient \u2013 selective ampli\ufb01cation of\nexternal noise generates the spatially correlated activity \ufb02uctuations prescribed by\nthe posterior. Intriguingly, although a detailed balance of excitation and inhibition\nis dynamically maintained, detailed balance of Markov chain steps in the resulting\nsampler is violated, consistent with recent \ufb01ndings on how statistical irreversibil-\nity can overcome the speed limitation of random walks in other domains.\n\n1\n\nIntroduction\n\nThe high speed of human sensory perception [1] is puzzling given its inherent computational com-\nplexity: sensory inputs are noisy and ambiguous, and therefore do not uniquely determine the state\nof the environment for the observer, which makes perception akin to a statistical inference problem.\nThus, the brain must represent and compute with complex and often high-dimensional probability\ndistributions over relevant environmental variables. Most state-of-the-art machine learning tech-\nniques for large scale inference trade inference accuracy for computing speed (e.g. [2]). The brain,\non the contrary, seems to enjoy both simultaneously [3].\nSome probabilistic computations can be made easier through an appropriate choice of representa-\ntion for the probability distributions of interest. Sampling-based representations used in Monte Carlo\n\n1\n\n\ftechniques, for example, make computing moments of the distribution or its marginals straightfor-\nward. Indeed, recent behavioural and neurophysiological evidence suggests that the brain uses such\nsampling-based representations by neural circuit dynamics implementing a Markov chain Monte\nCarlo (MCMC) algorithm such that their trajectories in state space produce sequential samples from\nthe appropriate posterior distribution [4, 5, 6].\nHowever, for sampling-based representations, speed becomes a key bottleneck: computations in-\nvolving the posterior distribution become accurate only after enough samples have been collected,\nand one has no choice but to wait for those samples to be delivered by the circuit dynamics. For\nsampling to be of any practical use, the interval that separates the generation of two independent\nsamples must be short relative to the desired behavioral timescale. Single neurons can integrate\ntheir inputs on a timescale \u03c4m \u2248 10 \u2212 50 ms, whereas we must often make decisions in less than\na second: this leaves just enough time to use (i.e. read out) a few tens of samples. What kinds of\nneural circuit dynamics are capable of producing uncorrelated samples at \u223c100 Hz remains unclear.\nHere, we introduce a simple yet non-trivial generative model and seek plausible neuronal network\ndynamics for fast sampling from the corresponding posterior distribution. While some standard\nmachine learning techniques such as Langevin or Gibbs sampling do suggest \u201cneural network\u201d-\ntype solutions to sampling, not only are the corresponding architectures implausible in fundamental\nways (e.g. they violate Dale\u2019s law), but we show here that they lead to unacceptably slow mixing\nin high dimensions. Although the issue of sampling speed in general is well appreciated in the\ncontext of machine learning, there have been no systematic approaches to tackle it owing to a large\npart to the fact that sampling speed can only be evaluated empirically in most cases. In contrast,\nthe simplicity of our generative model allowed us to draw an analytical picture of the problem\nwhich in turn suggested a systematic approach for solving it. Speci\ufb01cally, we used methods from\nrobust control to discover the fastest neural-like sampler for our generative model, and to study its\nstructure. We \ufb01nd that it corresponds to greatly non-symmetric synaptic interactions (leading to\nstatistical irreversibility), and mathematically nonnormal1 circuit dynamics [7, 8] in close analogy\nwith the dynamical regime in which the cortex has been suggested to operate [9].\n\n2 Linear networks perform sampling under a linear Gaussian model\nWe focus on a linear Gaussian latent variable model which generates observations h \u2208 RM as\nweighted sums of N features A \u2261 (a1; . . . ; aN ) \u2208 RM\u00d7N with jointly Gaussian coef\ufb01cients r \u2208\nRN , plus independent additive noise terms (Fig. 1, left). More formally:\n\np(r) = N (r; 0, C)\n\n(1)\nwhere I denotes the identity matrix. The posterior distribution is multivariate Gaussian, p(r|h) =\nN (r; \u00b5(h), \u03a3), with\n\nand\n\np(h|r) = N(cid:0)h; Ar, \u03c32\nhI(cid:1)\n\n\u03a3 =(cid:0)C\u22121 + A(cid:62)A/\u03c32\n\n(cid:1)\u22121\n\nh\n\nand\n\n\u00b5(h) = \u03a3A(cid:62)h/\u03c32\nh.\n\n(2)\n\nwhere we made explicit the fact that under this simple model, only the mean, \u00b5(h), but not the\ncovariance of the posterior, \u03a3, depends on the input, h.\nWe are interested in neural circuit dynamics for sampling from p(r|h), whereby the data (observa-\ntion) h is given as a constant feedforward input to a population of recurrently connected neurons,\neach of which encodes one of the latent variables and also receives inputs from an external, private\nsource of noise \u03be (Fig. 1, right). Our goal is to devise a network such that the activity \ufb02uctuations\nr(t) in the recurrent layer have a stationary distribution that matches the posterior, for any h.\nSpeci\ufb01cally, we consider linear recurrent stochastic dynamics of the form:\n\ndr =\n\ndt\n\u03c4m\n\n[\u2212r(t) + Wr(t) + Fh] + \u03c3\u03be\n\nd\u03be(t)\n\n(3)\n\nwhere \u03c4m = 20 ms is the single-unit \u201cmembrane\u201d time constant, and d\u03be is a Wiener process of unit\nvariance, which is scaled by a scalar noise intensity \u03c3\u03be. The activity ri(t) could represent either the\n1\u201cNonnormal\u201d should not be confused with \u201cnon-Gaussian\u201d: a matrix M is nonnormal iff MM(cid:62) (cid:54)=\n\nM(cid:62)M.\n\n2\n\n(cid:114) 2\n\n\u03c4m\n\n\fFigure 1: Sampling under a\nlinear Gaussian latent vari-\nable model using neuronal\nnetwork dynamics.\nLeft:\nschematics of the generative\nmodel. Right: schematics of\nthe recognition model. See text\nfor details.\n\nmembrane potential of neuron i, or the deviation of its momentary \ufb01ring rate from a baseline. The\nmatrices F and W contain the feedforward and recurrent connection weights, respectively.\nThe stationary distribution of r is indeed Gaussian with a mean \u00b5r(h) = (I \u2212 W)\u22121Fh and a co-\nt. For the following, we will use the dependence\n\nvariance matrix \u03a3r \u2261(cid:10)(r(t) \u2212 \u00b5r)(r(t) \u2212 \u00b5r)(cid:62)(cid:11)\n\nof \u03a3r on W (and \u03c3\u03be) given implicitly by the following Lyapunov equation [10]:\n\n(W \u2212 I)\u03a3r + \u03a3r(W \u2212 I)(cid:62) = \u22122\u03c32\n\u03be I\n\n(4)\n\nNote that in the absence of recurrent connectivity (W = 0), the variance of every ri(t) would be\nexactly \u03c32\n\u03be . Note also that, just as required (see above), only the mean, \u00b5r(h), but not the covariance,\n\u03a3r, depends on the input, h.\nIn order for the dynamics of Eq. 3 to sample from the correct posteriors, we must choose F, W and\n\u03c3\u03be such that \u00b5r(h) = \u00b5(h) for any h, and \u03a3r = \u03a3. One possible solution (which, importantly, is\nnot unique, as we show later) is\n\nF = (\u03c3\u03be/\u03c3h)2 A(cid:62)\n\nand W = WL \u2261 I \u2212 \u03c32\n\n\u03be \u03a3\u22121\n\n(5)\n\nwith arbitrary \u03c3\u03be > 0.\nIn the following, we will be interested in the likelihood matrix A only insofar as it affects the\nposterior covariance matrix \u03a3, which turns out to be the main determinant of sampling speed. We\nwill therefore directly choose some covariance matrix \u03a3, and set h = 0 without loss of generality.\n\n3 Langevin sampling is very slow\n\nLangevin sampling (LS) is a common sampling technique [2, 11, 12], and in fact the only one that\nhas been proposed to be neurally implemented for continuous variables [6, 13]. According to LS, a\nstochastic dynamical system performs \u201cnoisy gradient ascent of the log posterior\u201d:\n\nlog p(r|h) dt + d\u03be\n\n1\n2\n\n\u2202\n\u2202r\n\ndr =\n\n(6)\nwhere d\u03be is a unitary Wiener process. When r|h is Gaussian, Eq. 6 reduces to Eq. 3 for \u03c3\u03be = 1 and\nthe choice of F and W given in Eq. 5 \u2013 hence the notation WL above. Note that WL is symmetric.\nAs we show now, this choice of weight matrix leads to critically slow mixing (i.e. very long auto-\ncorrelation time scales in r(t)) when N is large. In a linear network, the average autocorrelation\nlength is dominated by the decay time constant \u03c4max of the slowest eigenmode, i.e. the eigenvector\nof (W \u2212 I) associated with the eigenvalue \u03bbW\u2212I\nmax which, of all the eigenvalues of (W \u2212 I), has the\nlargest real part (which must still be negative, to ensure stability). The contribution of the slowest\n\neigenmode to the sample autocorrelation time is \u03c4max = \u2212\u03c4m/Re(cid:0)\u03bbW\u2212I\nvery slow when Re(cid:0)\u03bbW\u2212I\n\n(cid:1), so sampling becomes\n(cid:1) approaches 0. This is, in fact, what happens with LS as N \u2192 \u221e. In-\n\ndeed, we could derive the following generic lower bound (details can be found in our Supplementary\nInformation, SI):\n\nmax\n\nmax\n\n\u03bbWL\u2212I\n\nmax\n\n(cid:112)1 + N \u03c32\n\u2265 \u2212(\u03c3\u03be/\u03c30)2\n\nr\n\n(7)\n\nwhich is shown as dashed lines in Fig. 2. Thus, LS becomes in\ufb01nitely slow in the large N limit\nwhen pairwise correlations do not vanish in that limit (or at least not as fast as N\u2212 1\nSlowing becomes even worse when \u03a3 is drawn from the inverse Wishart distribution with \u03bd degrees\nof freedom and scale matrix \u03c9\u22122I (Fig. 2). We choose \u03bd = N\u22121+(cid:98)\u03c3\u22122\n0(\u03bd\u2212N\u22121)\n\nr (cid:99) and \u03c9\u22122 = \u03c32\n\n2 in their std.).\n\n3\n\nLinearGaussianlatentvariablemodel:rlatentvariableshobservationsP(r)=N(r;0,C)P(h|r)=N(cid:16)h;Ar,\u03c32hI(cid:17)Posteriorsampling:networkr(t)inputh(t)noise\u03beWF\fFigure 2: Langevin sampling (LS) is slow in high-dimension. Random covariance matrices \u03a3 of\nsize N are drawn from an inverse Wishart distribution with parameters chosen such that the average\ndiagonal element (variance) is \u03c32\n0 = 1 and the distribution of pairwise correlations has zero mean\nr (right). Sampling from N (0, \u03a3) using a stochastic neural network (cf. Fig. 1) with\nand variance \u03c32\nW = WL (LS, symmetric solution) becomes increasingly slow as N grows, as indicated by the\nrelative decay time constant \u03c4max/\u03c4m of the slowest eigenmode of (WL \u2212 I) (left), which is also\nthe negative inverse of its largest eigenvalue (middle). Dots indicate the numerical evaluation of the\ncorresponding quantities, and errorbars (barely noticeable) denote standard deviation across several\nrandom realizations of \u03a3. Dashed lines correspond to the generic bound in Eq. 7. Solid lines are\nobtained from random matrix theory under the asssumption that \u03a3 is drawn from an inverse Wishart\ndistribution (Eq. 8). Parameters: \u03c3\u03be = \u03c30 = 1.\n\n0, and the distribution of\nsuch that the expected value of a diagonal element (variance) in \u03a3 is \u03c32\npairwise correlations is centered on zero with variance \u03c32\nr. The asymptotic behavior of the largest\neigenvalue of \u03a3\u22121 (the square of the smallest singular value of a random \u03bd \u00d7 N rectangular matrix)\nis known from random matrix theory (e.g. [14]), and we have for large N:\n\n(cid:19)2\n\n\u221a\n\nN\n\n(cid:18) 1\n\n(cid:19)\n\nN\n\n(8)\n\n(cid:18)(cid:113)\n\nmax \u2248 \u2212 (\u03c3\u03be/\u03c30)2\n\u03bbWL\u2212I\nr (cid:99) \u2212 2\n\n(cid:98)\u03c3\u22122\n\nN \u2212 1 + (cid:98)\u03c3\u22122\n\nr (cid:99) \u2212\n\n\u223c \u2212O\n\n\u221a\n\nThis scaling behavior is shown in Fig. 2 (solid lines). In fact, we can also show (cf. SI) that LS is\n(locally) the slowest possible choice (see Sec. 4 below for a precise de\ufb01nition of \u201cslowest\u201d, and SI\nfor details).\nNote that both Eqs. 7-8 are inversely proportional to the ratio (\u03c30/\u03c3\u03be), which tells us how much\nthe recurrent interactions must amplify the external noise in order to produce samples from the\nright stationary activity distribution. The more ampli\ufb01cation is required (\u03c30 (cid:29) \u03c3\u03be), the slower the\ndynamics of LS. Conversely, one could potentially make Langevin sampling faster by increasing \u03c3\u03be,\nbut \u03c3\u03be would need to scale as\nN to annihilate the critical slowing problem. This \u2013 in itself \u2013 is\nunrealistic; moreover, it would also require the resulting connectivity matrix to have a large negative\ndiagonal (O(\u2212N )) \u2013 ie. the intrinsic neuronal time constant \u03c4m to scale as O(1/N ) \u2013, which is\nperhaps even more unrealistic.2\nNote also that LS can be sped up by appropriate \u201cpreconditioning\u201d (e.g. [15, 16]), for example using\nthe inverse Hessian of the log-posterior. In our case, a simple calculation shows that this corresponds\nto removing all recurrent connections, and pushing the posterior covariance matrix to the external\nnoise sources, which is only postponing the problem to some other brain network.\nFinally, LS is fundamentally implausible as a neuronal implementation: it imposes symmetric synap-\ntic interactions, which is simply not possible in the brain due to the existence of distinct classes of\nexcitatory and inhibitory neurons (Dale\u2019s principle). In the following section, we show that networks\ncan be constructed that overcome all the above limitations of LS in a principled way.\n\n4 General solution and quanti\ufb01cation of sampling speed\n\nWhile Langevin dynamics (Eq. 6) provide a general recipe for sampling from any given posterior\ndensity, they unduly constrain the recurrent interactions to be symmetric \u2013 at least in the Gaussian\n\n2From a pure machine learning perspective, increasing \u03c3\u03be is not an option either: the increasing stiffness of\nEq. 6 would either require the use of a very small integration step, or would lead to arbitrarily small acceptance\nratios in the context of Metropolis-Hastings proposals.\n\n4\n\n11010010001101001000slowingfactor\u03c4max/\u03c4m\u03c3r=0.10\u03c3r=0.20-1-0.8-0.6-0.4-0.201101001000\u03bbWL\u2212Imax-1-0.500.51(\u2248N(0,\u03c3r))networksizeNsimulation(inverseWishart)theory(inverseWishart)lowerbound(general)networksizeNpairwisecorr.\fFigure 3: How fast is the fastest sampler? (A) Scalar measure of the statistical dependency be-\ntween any two samples collected k\u03c4m seconds apart (cf. main text), for Langevin sampling (black),\nGibbs sampling (blue, assuming a full update sweep is done every \u03c4m), a series of networks (brown\nto red) with connectivities given by Eq. 9 where the elements of the skew-symmetric matrix S were\ndrawn iid. from N (0, \u03b6 2) for different values of \u03b6 (see also panel B), the unconstrained optimized\nnetwork (yellow), and the optimized E/I network (green). For reference, the dashed gray line shows\nthe behavior of a network in which there are no recurrent interactions, and the posterior covariance\nis encoded in the covariance of the input noise, which in fact corresponds to Langevin sampling\nwith inverse Hessian (\u201cNewton\u201d-like) preconditioning [16]. (B) Total slowing cost \u03c8slow(S) when\nSi<j \u223c N (0, \u03b6 2), for increasing values of \u03b6. The Langevin and the two optimized networks are\nshown as horizontal lines for comparison. (C) Same as in (B), showing the root mean square (RMS)\nvalue of the synaptic weights. Parameter values: N = 200, NI = 100, \u03c3\u03be = 1, \u03c4m = 20 ms.\n\nW(S) = I +(cid:0)\u2212\u03c32\n\n\u03be I + S(cid:1) \u03a3\u22121\n\ncase. To see why this is a drastic restriction, let us observe that any connectivity matrix of the form\n(9)\nwhere S is an arbitrary skew-symmetric matrix (S(cid:62) = \u2212S), solves Eq. 4, and therefore induces\nthe correct stationary distribution N (\u00b7, \u03a3) under the linear stochastic dynamics of Eq. 3. Note that\nLangevin sampling corresponds to S = 0 (cf. Eq. 5). In general, though, there are O(N 2) degrees of\nfreedom in the skew-symmetric matrix S, which could perhaps be exploited to increase the mixing\nrate. In Sec. 5, we will show that indeed a large gain in sampling speed can be obtained through an\nappropriate choice of S. For now, let us quantify slowness.\nLet \u039b \u2261 diag (\u03a3) be the diagonal matrix that contains all the posterior variances, and K(S, \u03c4 ) \u2261\nt be the matrix of lagged covariances among neurons under the sta-\ntionary distribution of the dynamics (so that \u039b\u2212 1\n2 is the autocorrelation matrix of the\n\u03be and\nnetwork). Note that K(S, 0) = \u03a3 is the posterior covariance matrix, and that for \ufb01xed \u03a3, \u03c32\n\u03c4m, K(S, \u03c4 ) depends only on the lag \u03c4 and on the matrix of recurrent weights W, which itself\ndepends only on the skew-symmetric matrix S of free parameters. We then de\ufb01ne a \u201ctotal slowing\ncost\u201d\n\n(cid:10)(r(t + \u03c4 ) \u2212 \u00b5)(r(t) \u2212 \u00b5)(cid:62)(cid:11)\n\n2 K(S, \u03c4 )\u039b\u2212 1\n\n\u03c8slow(S) =\n\n1\n\n2\u03c4mN 2\n\n0\n\n2 K(S, \u03c4 )\u039b\u2212 1\n\n2\n\nd\u03c4\n\n(10)\n\n(cid:90) \u221e\n\n(cid:13)(cid:13)(cid:13)\u039b\u2212 1\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\nwhich penalizes the magnitude of the temporal (normalized) autocorrelations and pairwise cross-\ncorrelations in the sequence of samples generated by the circuit dynamics. Here (cid:107)M(cid:107)2\nF \u2261\n\ntrace(MM(cid:62)) =(cid:80)\n\nij M 2\n\nij is the squared Frobenius norm of M.\n\nUsing the above measure of slowness, we revisit the mixing behavior of LS on a toy covariance\nmatrix \u03a3 drawn from the same inverse Wishart distribution mentioned above with parameters N =\n0 = 2 and \u03c3r = 0.2. We further regularize \u03a3 by adding the identity matrix to it, which\n200, \u03c32\ndoes not change anything in terms of the scaling law of Eq. 8 but ensures that the diagonal of WL\nremains bounded as N grows large. We will use the same \u03a3 in the rest of the paper. Figure 3A\n\nshows(cid:13)(cid:13)\u039b\u22121/2K(S, \u03c4 )\u039b\u22121/2(cid:13)(cid:13)F as a function of the time lag \u03c4: as predicted in Sec. 3, mixing is\n\nindeed an order of magnitude slower for LS (S = 0, solid black line) than the single-neuron time\nconstant \u03c4m (grey dashed line). Note that \u03c8slow (Eq. 10, Fig. 3B) is proportional to the area under\nthe squared curve shown in Fig. 3A. Sample activity traces for this network, implementing LS, can\nbe found in Fig. 4B (top).\nUsing the same measure of slowness, we also inspected the speed of Gibbs sampling, another widely\nused sampling technique (e.g. [17]) inspiring neural network dynamics for sampling from distribu-\ntions over binary variables [18, 19, 20]. Gibbs sampling de\ufb01nes a Markov chain that operates in\n\n5\n\n00.20.40.60.810246810ABC0.010.110.010.1110S\u223cN(0,\u03b62)0.010.11100.010.1110S\u223cN(0,\u03b62)(cid:107)K(k\u03c4m)(cid:107)F(cid:107)K(0)(cid:107)Ftimelagk(unitsof\u03c4m)LangevinoptimaloptimalE/IrandomS(\u03b6=0.2;0.4;0.8;1.6)Newton(unconnectednet)Gibbs(updatetime\u03c4m)\u03c8slow\u03b6weightRMS\u03b6\fdiscrete time, and also uses a symmetric weight matrix. In order to compare its mixing speed with\nthat of our continuous stochastic dynamics, we assume that a full update step (in which all neurons\nhave been updated once) takes time \u03c4m. We estimated the integrand of the slowing cost (Eq. 10)\nnumerically using 30\u2019000 samples generated by the Gibbs chain (Fig. 3A, blue). Gibbs sampling is\ncomparable to LS here: samples are still correlated on a timescale of order \u223c 50 \u03c4m.\nFinally, one may wonder how a random choice of S would perform in terms of decorrelation speed.\nWe drew random skew-symmetric S matrices from the Gaussian ensemble, Si<j \u223c N (0, \u03b6 2), and\ncomputed the slowing cost (Fig. 3, red). As the magnitude \u03b6 of S increases, sampling becomes\nfaster and faster until the dynamics is about as fast as the single-neuron time constant \u03c4m. However,\nthe synaptic weights also grow with \u03b6 (Fig. 3C), and we show in Sec. 5 that an even faster sampler\nexists that has comparatively weaker synapses. It is also interesting to note that the slope of \u03c8slow at\n\u03b6 = 0 is zero, suggesting that LS is in fact maximally slow (we prove this formally in the SI).\n\n5 What is the fastest sampler?\n\nWe now show that the skew-symmetric matrix S can be optimized for sampling speed, by directly\nminimizing the slowing cost \u03c8slow(S) (Eq. 10), subject to an L2-norm penalty. We thus seek to\nminimize:\n\nL(S) \u2261 \u03c8slow(S) +\n\n\u03bbL2\n\n2N 2 (cid:107)W(S)(cid:107)2\nF .\n\n(11)\n\n(12)\n\nThe key to performing this minimization is to use classical Ornstein-Uhlenbeck theory (e.g. [10]) to\nbring our slowness cost under a form mathematically analogous to a different optimization problem\nthat has arisen recently in the \ufb01eld of robust control [21]. We can then use analytical results obtained\nthere concerning the gradient of \u03c8slow, and obtain the overall gradient:\n\n(cid:2)(\u03a3\u22121PQ)(cid:62) \u2212 (\u03a3\u22121PQ)(cid:3) +\n\n(cid:2)S\u03a3\u22122 + \u03a3\u22122S(cid:3)\n\n\u2202L(S)\n\u2202S\n\n=\n\n1\nN 2\n\n\u03bbL2\nN 2\n\nwhere matrices P and Q are obtained by solving two dual Lyapunov equations. All details can be\nfound in our SI.\nWe initialized S with random, weak and uncorrelated elements (cf. the end of Sec. 4, with \u03b6 = 0.01),\nand ran the L-BFGS optimization algorithm using the gradient of Eq. 12 to minimize L(S) (with\n\u03bbL2 = 0.1). The resulting, optimal sampler is an order of magnitude faster than either Langevin or\nGibbs sampling: samples are decorrelated on a timescale that is even faster than the single-neuron\ntime constant \u03c4m (Fig. 3A, orange). We also found that fast solutions (with correlation length \u223c \u03c4m)\ncan be found irrespective of the size N of the state space (not shown), meaning that the relative\nspeed-up between the optimal solution and LS grows with N (cf. Fig. 2).\nThe optimal Sopt induces a weight matrix Wopt given by Eq. 9 and shown in Fig. 4A (middle).\nNotably, Wopt is no longer symmetric, and its elements are much larger than in the Langevin\nsymmetric solution WL with the same stationary covariance, albeit orders of magnitude smaller\nthan in random networks of comparable decorrelation speed (Fig. 3C).\nIt is illuminating to visualize activity trajectories in the plane de\ufb01ned by the topmost and bottommost\neigenvectors of \u03a3, i.e. the \ufb01rst and last principal components (PCs) of the network activity (Fig. 4C).\nThe distribution of interest is broad along some dimensions, and narrow along others. In order to\nsample ef\ufb01ciently, large steps ought to be taken along directions in which the distribution is broad,\nand small steps along directions in which the distribution is narrow. This is exactly what our optimal\nsampler does, whereas LS takes small steps along both broad and narrow directions (Fig. 4C).\n\n6 Balanced E/I networks for fast sampling\n\nWe can further constrain our network to obey Dale\u2019s law, i.e. the separation of neurons into separate\nexcitatory (E) and inhibitory (I) groups. The main dif\ufb01culty in building such networks is that picking\nan arbitrary skew-symmetric matrix S in Eq. 9 will not yield the column sign structure of an E/I\nnetwork in general. Therefore, we no longer have a parametric form for the solution matrix manifold\non which to \ufb01nd the fastest network. However, by extending the methods of Sec. 5, described in\n\n6\n\n\fFigure 4: Fast sampling with optimized networks. (A) Synaptic weight matrices for the Langevin\nnetwork (top), the fastest sampler (middle) and the fastest sampler that obeys Dale\u2019s law (bottom).\nNote that the synaptic weights in both optimized networks are an order of magnitude larger than in\nthe symmetric Langevin solution. The \ufb01rst two networks are of size N = 200, while the optimized\nE/I network has size N + NI = 300. (B) 500 ms of spontaneous network activity (h = 0) in each of\nthe three networks, for all of which the stationary distribution of r (restricted here to the \ufb01rst 40 neu-\nrons) is the same multivariate Gaussian. (C) Left: activity trajectories (the same 500 ms as shown\nin (B)) in the plane de\ufb01ned by the topmost and bottommost eigenvectors of the posterior covari-\nance matrix \u03a3 (corresponding to the \ufb01rst and last principal components of the activity \ufb02uctuations\nr(t)). For the E/I network, the projection is restricted to the excitatory neurons. Right: distribu-\ntion of increments along both axes, measured in 1 ms time steps. Langevin sampling takes steps of\ncomparable size along all directions, while the optimized networks take much larger steps along the\ndirections of large variance prescribed by the posterior. (D) Distributions of correlations between\nthe time courses of total excitatory and inhibitory input in individual neurons.\n\ndetail in our SI, we can still formulate the problem as one of unconstrained optimization, and obtain\nthe fastest, balanced E/I sampler.\nWe consider the posterior to be encoded in the activity of the N = 200 excitatory neurons, and add\nNI = 100 inhibitory neurons which we regard as auxiliary variables, in the spirit of Hamiltonian\nMonte Carlo methods [11]. Consequently, the E-I and I-I covariances are free parameters, while\nthe E-E covariance is given by the target posterior. For additional biological realism, we also forbid\nself-connections as they can be interpreted as a modi\ufb01cation of the intrinsic membrane time constant\nof the single neurons, which in principle cannot be arbitrarily learned.\nThe speed optimization yields the connectivity matrix shown in Fig. 4A (bottom). Results for this\nnetwork are presented in a similar format as before, in the same \ufb01gures. Sampling is almost as fast\nas in the best (regularized) unconstrained network (compare yellow and green in Fig. 3), indicating\nthat Dale\u2019s law \u2013 unlike the symmetry constraint implicitly present in Langevin sampling \u2013 is not\nfundamentally detrimental to mixing speed. Moreover, the network operates in a regime of excita-\ntion/inhibition balance, whereby the total E and I input time courses are correlated in single cells\n(Fig. 4D, bottom). This is true also in the unconstrained optimal sampler. In contrast, E and I inputs\nare strongly anti-correlated in LS.\n\n7\n\nLangevinweightmatricessampleactivitytracestrajectoriesinstatespace(1mssteps)dist.ofincrements(1mssteps)ABCDoptimizednet.optimizedE/Inet.12040100msri(t)ri(t)ri(t)12040100ms12040100ms-404-200200500ms-404-20020-404-20020-303-303-303-101-101-101postsynaptic-0.100.1postsynaptic-101postsynapticpresynaptic-1-0.500.5neuron#-8-4048neuron#-8-4048neuron#-8-4048lastPClastPClastPC\ufb01rstPCstepalong{\ufb01rst|last}PCE/Icorr.\f7 Discussion\n\nWe have studied sampling for Bayesian inference in neural circuits, and observed that a linear\nstochastic network is able to sample from the posterior under a linear Gaussian latent variable model.\nHidden variables are directly encoded in the activity of single neurons, and their joint activity un-\ndergoes moment-to-moment \ufb02uctuations that visit each portion of the state space at a frequency\ngiven by the target posterior density. To achieve this, external noise sources fed into the network are\nampli\ufb01ed by the recurrent circuitry, but preferentially ampli\ufb01ed along the state-space directions of\nlarge posterior variance. Although, for the very simple linear Gaussian model we considered here,\na purely feed-forward architecture would also trivially be able to provide independent samples (ie.\nprovide samples that are decorrelated at the time scale of \u03c4m), the network required to achieve this\nis deeply biologically implausible (see SI).\nWe have shown that the choice of a symmetric weight matrix \u2013 equivalent to LS, a popular ma-\nchine learning technique [2, 11, 12] that has been suggested to underlie neuronal network dynamics\nsampling continuous variables [6, 13] \u2013 is most unfortunate. We presented an analytical argument\npredicting dramatic slowing in high-dimensional latent spaces, supported by numerical simulations.\nEven in moderately large networks, samples were correlated on timescales much longer than the\nsingle-neuron decay time constant.\nWe have also shown that when the above symmetry constraint is relaxed, a family of other solutions\nopens up that can potentially lead to much faster sampling. We chose to explore this possibility\nfrom a normative viewpoint, optimizing the network connectivity directly for sampling speed. The\nfastest sampler turned out to be highly asymmetric and typically an order of magnitude faster than\nLangevin sampling. Notably, we also found that constraining each neuron to be either excitatory\nor inhibitory does not impair performance while giving a far more biologically plausible sampler.\nDale\u2019s law could even provide a natural safeguard against reaching slow symmetric solutions such\nas Langevin sampling, which we saw was the worst-case scenario (cf. also SI).\nIt is worth noting that Wopt is strongly nonnormal.3 Deviation from normality has important con-\nsequences for the dynamics of our networks: it makes the network sensitive to perturbations along\nsome directions in state space. Such perturbations are rapidly ampli\ufb01ed into large, transient ex-\ncursions along other, relevant directions. This phenomenon has been shown to explain some key\nfeatures of spontaneous activity in primary visual cortex [9] and primary motor cortex [22].\nSeveral aspects would need to be addressed before our proposal can crystalize into a more thorough\nunderstanding of the neural implementation of the sampling hypothesis. First, can local synaptic\nplasticity rules perform the optimization that we have approached from an algorithmic viewpoint?\nSecond, what is the origin of the noise that we have hypothesized to come from external sources?\nThird, what kind of nonlinearity must be added in order to allow sampling from non-Gaussian distri-\nbutions, whose shapes may have non-trivial dependencies on the observations? Also, does the main\ninsight reached here \u2013 namely that fast samplers are to be found among nonsymmetric, nonnormal\nnetworks \u2013 carry over to the nonlinear case? As a proof of principle, in preliminary simulations, we\nhave shown that speed optimization in a linearized version of a nonlinear network (with a tanh gain\nfunction) does yield fast sampling in the nonlinear regime, even when \ufb02uctuations are strong enough\nto trigger the nonlinearity and make the resulting sampled distribution non-Gaussian (details in SI).\nFinally, we have also shown (see SI) that the Langevin solution is the only network that satis\ufb01es the\ndetailed balance condition [23] in our model class; reversibility is violated in all other stochastic net-\nworks we have presented here (random, optimal, optimal E/I). The fact that these networks are faster\nsamplers is in line with recent machine learning studies on how non-reversible Markov chains can\nmix faster than their reversible counterparts [24]. The construction of such Monte-Carlo algorithms\nhas proven challenging [25, 26, 27], suggesting that the brain \u2013 if it does indeed use sampling-based\nrepresentations \u2013 might have something yet to teach us about machine learning.\n\nAcknowledgements This work was supported by the Wellcome Trust (GH, ML), the Swiss Na-\ntional Science Foundation (GH) and the Gatsby Charitable Foundation (LA). Our code will be made\nfreely available from GH\u2019s personal webpage.\n\nnormal matrix W (such as the Langevin solution, WL),(cid:80)\n\n3Indeed, the sum of the squared moduli of its eigenvalues accounts for only 25% of (cid:107)Wopt(cid:107)2\nF, i.e. this ratio is 100%.\n\ni |\u03bbi|2 = (cid:107)W(cid:107)2\n\n8\n\nF [7]. For a\n\n\fReferences\n[1] S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 381:520\u2014\n\n522, 1996.\n\n[2] D. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.\n[3] D. Knill and A. Pouget. The Bayesian brain: the role of uncertainty in neural coding and computation.\n\nTrends in Neurosciences, 27:712\u2013719, 2004.\n\n[4] J. Fiser, P. Berkes, G. Orb\u00b4an, and M. Lengyel. Statistically optimal perception and learning: from behavior\n\nto neural representations. Trends in Cognitive Sciences, 14:119\u2013130, 2010.\n\n[5] P. Berkes, G. Orb\u00b4an, M. Lengyel, and J. Fiser. Spontaneous cortical activity reveals hallmarks of an\n\noptimal internal model of the environment. Science, 331:83\u201387, 2011.\n\n[6] R. Moreno-Bote, D. C. Knill, and A. Pouget. Bayesian sampling in visual perception. Proceedings of the\n\nNational Academy of Sciences, 108:12491\u201312496, 2011.\n\n[7] L. N. Trefethen and M. Embree. Spectra and pseudospectra: the behavior of nonnormal matrices and\n\noperators. Princeton University Press, 2005.\n\n[8] G. Hennequin, T. P. Vogels, and W. Gerstner. Non-normal ampli\ufb01cation in random balanced neuronal\n\nnetworks. Physical Review E, 86:011909, 2012.\n\n[9] B. K. Murphy and K. D. Miller. Balanced ampli\ufb01cation: A new mechanism of selective ampli\ufb01cation of\n\nneural activity patterns. Neuron, 61:635\u2013648, 2009.\n\n[10] C. W. Gardiner. Handbook of stochastic methods: for physics, chemistry, and the natural sciences. Berlin:\n\nSpringer, 1985.\n\n[11] R. Neal. MCMC using Hamiltonian dynamics. Handbook of MCMC, pages 113\u2013162, 2011.\n[12] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings\n\nof the International Conference on Machine Learning, 2011.\n\n[13] A. Grabska-Barwinska, J. Beck, A. Pouget, and P. Latham. Demixing odors - fast inference in olfaction.\nIn C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 26, pages 1968\u20131976. Curran Associates, Inc., 2013.\n\n[14] Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Commu-\n\nnications on Pure and Applied Mathematics, 62:1707\u20131739, 2009.\n\n[15] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 73:123\u2013214, 2011.\n\n[16] J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas. A stochastic Newton MCMC method for large-\nscale statistical inverse problems with application to seismic inversion. SIAM Journal on Scienti\ufb01c Com-\nputing, 34:A1460\u2013A1487, 2012.\n\n[17] M. Mezard and A. Montanari. Information, physics, and computation. Oxford University Press, 2009.\n[18] G. E. Hinton and T. J. Sejnowski. Learning and relearning in Boltzmann machines. In D E Rumelhart, J L\nMcClelland, and the PDP Research Group, editors, Parallel Distributed Processing: Explorations in the\nMicrostructure of Cognition, volume 1: Foundations, chapter 7, pages 282\u2013317. MIT Press, Cambridge,\nMA, 1986.\n\n[19] G E Hinton, P Dayan, B J Frey, and R M Neal. The \u201dwake-sleep\u201d algorithm for unsupervised neural\n\nnetworks. Science, 268(5214):1158\u20131161, 1995.\n\n[20] L. Buesing, J. Bill, B. Nessler, and W. Maass. Neural dynamics as sampling: a model for stochastic\n\ncomputation in recurrent networks of spiking neurons. PLoS Computational Biology, 7:1\u201322, 2011.\n\n[21] J. Vanbiervliet, B. Vandereycken, W. Michiels, S. Vandewalle, and M. Diehl. The smoothed spectral\n\nabscissa for robust stability optimization. SIAM Journal on Optimization, 20:156\u2013171, 2009.\n\n[22] G. Hennequin, T. P. Vogels, and W. Gerstner. Optimal control of transient dynamics in balanced networks\n\nsupports generation of complex movements. Neuron, 82, 2014.\n\n[23] W. Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.\n\nBiometrika, 57:97\u2013109, 1970.\n\n[24] Akihisa Ichiki and Masayuki Ohzeki. Violation of detailed balance accelerates relaxation. Physical\n\nReview E, 88:020101, 2013.\n\n[25] Y. Sun, J. Schmidhuber, and F. J. Gomez.\n\nImproving the asymptotic performance of Markov Chain\nMonte-Carlo by inserting vortices. In J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, and\nA. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2235\u20132243. 2010.\n\n[26] K. S. Turitsyn, M. Chertkov, and M. Vucelja. Irreversible Monte Carlo algorithms for ef\ufb01cient sampling.\n\nPhysica D: Nonlinear Phenomena, 240:410\u2013414, 2011.\n\n[27] Joris Bierkens. Non-reversible Metropolis-Hastings. arXiv:1401.8087 [math], 2014.\n\n9\n\n\f", "award": [], "sourceid": 1176, "authors": [{"given_name": "Guillaume", "family_name": "Hennequin", "institution": "University of Cambridge"}, {"given_name": "Laurence", "family_name": "Aitchison", "institution": "University College London"}, {"given_name": "Mate", "family_name": "Lengyel", "institution": "University of Cambridge"}]}