{"title": "Fast Sampling-Based Inference in Balanced Neuronal Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2240, "page_last": 2248, "abstract": "Multiple lines of evidence support the notion that the brain performs probabilistic inference in multiple cognitive domains, including perception and decision making. There is also evidence that probabilistic inference may be implemented in the brain through the (quasi-)stochastic activity of neural circuits, producing samples from the appropriate posterior distributions, effectively implementing a Markov chain Monte Carlo algorithm. However, time becomes a fundamental bottleneck in such sampling-based probabilistic representations: the quality of inferences depends on how fast the neural circuit generates new, uncorrelated samples from its stationary distribution (the posterior). We explore this bottleneck in a simple, linear-Gaussian latent variable model, in which posterior sampling can be achieved by stochastic neural networks with linear dynamics. The well-known Langevin sampling (LS) recipe, so far the only sampling algorithm for continuous variables of which a neural implementation has been suggested, naturally fits into this dynamical framework. However, we first show analytically and through simulations that the symmetry of the synaptic weight matrix implied by LS yields critically slow mixing when the posterior is high-dimensional. Next, using methods from control theory, we construct and inspect networks that are optimally fast, and hence orders of magnitude faster than LS, while being far more biologically plausible. In these networks, strong -- but transient -- selective amplification of external noise generates the spatially correlated activity fluctuations prescribed by the posterior. Intriguingly, although a detailed balance of excitation and inhibition is dynamically maintained, detailed balance of Markov chain steps in the resulting sampler is violated, consistent with recent findings on how statistical irreversibility can overcome the speed limitation of random walks in other domains.", "full_text": "Fast Sampling-Based Inference in Balanced Neuronal\n\nNetworks\n\nGuillaume Hennequin1\ngjeh2@cam.ac.uk\n\nLaurence Aitchison2\n\nlaurence@gatsby.ucl.ac.uk\n\nM\u00b4at\u00b4e Lengyel1\n\nm.lengyel@eng.cam.ac.uk\n\n1Computational & Biological Learning Lab, Dept. of Engineering, University of Cambridge, UK\n2Gatsby Computational Neuroscience Unit, University College London, UK\n\nAbstract\n\nMultiple lines of evidence support the notion that the brain performs probabilistic\ninference in multiple cognitive domains, including perception and decision mak-\ning. There is also evidence that probabilistic inference may be implemented in the\nbrain through the (quasi-)stochastic activity of neural circuits, producing samples\nfrom the appropriate posterior distributions, effectively implementing a Markov\nchain Monte Carlo algorithm. However, time becomes a fundamental bottleneck\nin such sampling-based probabilistic representations: the quality of inferences de-\npends on how fast the neural circuit generates new, uncorrelated samples from\nits stationary distribution (the posterior). We explore this bottleneck in a sim-\nple, linear-Gaussian latent variable model, in which posterior sampling can be\nachieved by stochastic neural networks with linear dynamics. The well-known\nLangevin sampling (LS) recipe, so far the only sampling algorithm for continu-\nous variables of which a neural implementation has been suggested, naturally \ufb01ts\ninto this dynamical framework. However, we \ufb01rst show analytically and through\nsimulations that the symmetry of the synaptic weight matrix implied by LS yields\ncritically slow mixing when the posterior is high-dimensional. Next, using meth-\nods from control theory, we construct and inspect networks that are optimally fast,\nand hence orders of magnitude faster than LS, while being far more biologically\nplausible. In these networks, strong \u2013 but transient \u2013 selective ampli\ufb01cation of\nexternal noise generates the spatially correlated activity \ufb02uctuations prescribed by\nthe posterior. Intriguingly, although a detailed balance of excitation and inhibition\nis dynamically maintained, detailed balance of Markov chain steps in the resulting\nsampler is violated, consistent with recent \ufb01ndings on how statistical irreversibil-\nity can overcome the speed limitation of random walks in other domains.\n\n1\n\nIntroduction\n\nThe high speed of human sensory perception [1] is puzzling given its inherent computational com-\nplexity: sensory inputs are noisy and ambiguous, and therefore do not uniquely determine the state\nof the environment for the observer, which makes perception akin to a statistical inference problem.\nThus, the brain must represent and compute with complex and often high-dimensional probability\ndistributions over relevant environmental variables. Most state-of-the-art machine learning tech-\nniques for large scale inference trade inference accuracy for computing speed (e.g. [2]). The brain,\non the contrary, seems to enjoy both simultaneously [3].\nSome probabilistic computations can be made easier through an appropriate choice of representa-\ntion for the probability distributions of interest. Sampling-based representations used in Monte Carlo\n\n1\n\n\ftechniques, for example, make computing moments of the distribution or its marginals straightfor-\nward. Indeed, recent behavioural and neurophysiological evidence suggests that the brain uses such\nsampling-based representations by neural circuit dynamics implementing a Markov chain Monte\nCarlo (MCMC) algorithm such that their trajectories in state space produce sequential samples from\nthe appropriate posterior distribution [4, 5, 6].\nHowever, for sampling-based representations, speed becomes a key bottleneck: computations in-\nvolving the posterior distribution become accurate only after enough samples have been collected,\nand one has no choice but to wait for those samples to be delivered by the circuit dynamics. For\nsampling to be of any practical use, the interval that separates the generation of two independent\nsamples must be short relative to the desired behavioral timescale. Single neurons can integrate\ntheir inputs on a timescale \u03c4m \u2248 10 \u2212 50 ms, whereas we must often make decisions in less than\na second: this leaves just enough time to use (i.e. read out) a few tens of samples. What kinds of\nneural circuit dynamics are capable of producing uncorrelated samples at \u223c100 Hz remains unclear.\nHere, we introduce a simple yet non-trivial generative model and seek plausible neuronal network\ndynamics for fast sampling from the corresponding posterior distribution. While some standard\nmachine learning techniques such as Langevin or Gibbs sampling do suggest \u201cneural network\u201d-\ntype solutions to sampling, not only are the corresponding architectures implausible in fundamental\nways (e.g. they violate Dale\u2019s law), but we show here that they lead to unacceptably slow mixing\nin high dimensions. Although the issue of sampling speed in general is well appreciated in the\ncontext of machine learning, there have been no systematic approaches to tackle it owing to a large\npart to the fact that sampling speed can only be evaluated empirically in most cases. In contrast,\nthe simplicity of our generative model allowed us to draw an analytical picture of the problem\nwhich in turn suggested a systematic approach for solving it. Speci\ufb01cally, we used methods from\nrobust control to discover the fastest neural-like sampler for our generative model, and to study its\nstructure. We \ufb01nd that it corresponds to greatly non-symmetric synaptic interactions (leading to\nstatistical irreversibility), and mathematically nonnormal1 circuit dynamics [7, 8] in close analogy\nwith the dynamical regime in which the cortex has been suggested to operate [9].\n\n2 Linear networks perform sampling under a linear Gaussian model\nWe focus on a linear Gaussian latent variable model which generates observations h \u2208 RM as\nweighted sums of N features A \u2261 (a1; . . . ; aN ) \u2208 RM\u00d7N with jointly Gaussian coef\ufb01cients r \u2208\nRN , plus independent additive noise terms (Fig. 1, left). More formally:\n\np(r) = N (r; 0, C)\n\n(1)\nwhere I denotes the identity matrix. The posterior distribution is multivariate Gaussian, p(r|h) =\nN (r; \u00b5(h), \u03a3), with\n\nand\n\np(h|r) = N(cid:0)h; Ar, \u03c32\nhI(cid:1)\n\n\u03a3 =(cid:0)C\u22121 + A(cid:62)A/\u03c32\n\n(cid:1)\u22121\n\nh\n\nand\n\n\u00b5(h) = \u03a3A(cid:62)h/\u03c32\nh.\n\n(2)\n\nwhere we made explicit the fact that under this simple model, only the mean, \u00b5(h), but not the\ncovariance of the posterior, \u03a3, depends on the input, h.\nWe are interested in neural circuit dynamics for sampling from p(r|h), whereby the data (observa-\ntion) h is given as a constant feedforward input to a population of recurrently connected neurons,\neach of which encodes one of the latent variables and also receives inputs from an external, private\nsource of noise \u03be (Fig. 1, right). Our goal is to devise a network such that the activity \ufb02uctuations\nr(t) in the recurrent layer have a stationary distribution that matches the posterior, for any h.\nSpeci\ufb01cally, we consider linear recurrent stochastic dynamics of the form:\n\ndr =\n\ndt\n\u03c4m\n\n[\u2212r(t) + Wr(t) + Fh] + \u03c3\u03be\n\nd\u03be(t)\n\n(3)\n\nwhere \u03c4m = 20 ms is the single-unit \u201cmembrane\u201d time constant, and d\u03be is a Wiener process of unit\nvariance, which is scaled by a scalar noise intensity \u03c3\u03be. The activity ri(t) could represent either the\n1\u201cNonnormal\u201d should not be confused with \u201cnon-Gaussian\u201d: a matrix M is nonnormal iff MM(cid:62) (cid:54)=\n\nM(cid:62)M.\n\n2\n\n(cid:114) 2\n\n\u03c4m\n\n\fFigure 1: Sampling under a\nlinear Gaussian latent vari-\nable model using neuronal\nnetwork dynamics.\nLeft:\nschematics of the generative\nmodel. Right: schematics of\nthe recognition model. See text\nfor details.\n\nmembrane potential of neuron i, or the deviation of its momentary \ufb01ring rate from a baseline. The\nmatrices F and W contain the feedforward and recurrent connection weights, respectively.\nThe stationary distribution of r is indeed Gaussian with a mean \u00b5r(h) = (I \u2212 W)\u22121Fh and a co-\nt. For the following, we will use the dependence\n\nvariance matrix \u03a3r \u2261(cid:10)(r(t) \u2212 \u00b5r)(r(t) \u2212 \u00b5r)(cid:62)(cid:11)\n\nof \u03a3r on W (and \u03c3\u03be) given implicitly by the following Lyapunov equation [10]:\n\n(W \u2212 I)\u03a3r + \u03a3r(W \u2212 I)(cid:62) = \u22122\u03c32\n\u03be I\n\n(4)\n\nNote that in the absence of recurrent connectivity (W = 0), the variance of every ri(t) would be\nexactly \u03c32\n\u03be . Note also that, just as required (see above), only the mean, \u00b5r(h), but not the covariance,\n\u03a3r, depends on the input, h.\nIn order for the dynamics of Eq. 3 to sample from the correct posteriors, we must choose F, W and\n\u03c3\u03be such that \u00b5r(h) = \u00b5(h) for any h, and \u03a3r = \u03a3. One possible solution (which, importantly, is\nnot unique, as we show later) is\n\nF = (\u03c3\u03be/\u03c3h)2 A(cid:62)\n\nand W = WL \u2261 I \u2212 \u03c32\n\n\u03be \u03a3\u22121\n\n(5)\n\nwith arbitrary \u03c3\u03be > 0.\nIn the following, we will be interested in the likelihood matrix A only insofar as it affects the\nposterior covariance matrix \u03a3, which turns out to be the main determinant of sampling speed. We\nwill therefore directly choose some covariance matrix \u03a3, and set h = 0 without loss of generality.\n\n3 Langevin sampling is very slow\n\nLangevin sampling (LS) is a common sampling technique [2, 11, 12], and in fact the only one that\nhas been proposed to be neurally implemented for continuous variables [6, 13]. According to LS, a\nstochastic dynamical system performs \u201cnoisy gradient ascent of the log posterior\u201d:\n\nlog p(r|h) dt + d\u03be\n\n1\n2\n\n\u2202\n\u2202r\n\ndr =\n\n(6)\nwhere d\u03be is a unitary Wiener process. When r|h is Gaussian, Eq. 6 reduces to Eq. 3 for \u03c3\u03be = 1 and\nthe choice of F and W given in Eq. 5 \u2013 hence the notation WL above. Note that WL is symmetric.\nAs we show now, this choice of weight matrix leads to critically slow mixing (i.e. very long auto-\ncorrelation time scales in r(t)) when N is large. In a linear network, the average autocorrelation\nlength is dominated by the decay time constant \u03c4max of the slowest eigenmode, i.e. the eigenvector\nof (W \u2212 I) associated with the eigenvalue \u03bbW\u2212I\nmax which, of all the eigenvalues of (W \u2212 I), has the\nlargest real part (which must still be negative, to ensure stability). The contribution of the slowest\n\neigenmode to the sample autocorrelation time is \u03c4max = \u2212\u03c4m/Re(cid:0)\u03bbW\u2212I\nvery slow when Re(cid:0)\u03bbW\u2212I\n\n(cid:1), so sampling becomes\n(cid:1) approaches 0. This is, in fact, what happens with LS as N \u2192 \u221e. In-\n\ndeed, we could derive the following generic lower bound (details can be found in our Supplementary\nInformation, SI):\n\nmax\n\nmax\n\n\u03bbWL\u2212I\n\nmax\n\n(cid:112)1 + N \u03c32\n\u2265 \u2212(\u03c3\u03be/\u03c30)2\n\nr\n\n(7)\n\nwhich is shown as dashed lines in Fig. 2. Thus, LS becomes in\ufb01nitely slow in the large N limit\nwhen pairwise correlations do not vanish in that limit (or at least not as fast as N\u2212 1\nSlowing becomes even worse when \u03a3 is drawn from the inverse Wishart distribution with \u03bd degrees\nof freedom and scale matrix \u03c9\u22122I (Fig. 2). We choose \u03bd = N\u22121+(cid:98)\u03c3\u22122\n0(\u03bd\u2212N\u22121)\n\nr (cid:99) and \u03c9\u22122 = \u03c32\n\n2 in their std.).\n\n3\n\nLinearGaussianlatentvariablemodel:rlatentvariableshobservationsP(r)=N(r;0,C)P(h|r)=N(cid:16)h;Ar,\u03c32hI(cid:17)Posteriorsampling:networkr(t)inputh(t)noise\u03beWF\fFigure 2: Langevin sampling (LS) is slow in high-dimension. Random covariance matrices \u03a3 of\nsize N are drawn from an inverse Wishart distribution with parameters chosen such that the average\ndiagonal element (variance) is \u03c32\n0 = 1 and the distribution of pairwise correlations has zero mean\nr (right). Sampling from N (0, \u03a3) using a stochastic neural network (cf. Fig. 1) with\nand variance \u03c32\nW = WL (LS, symmetric solution) becomes increasingly slow as N grows, as indicated by the\nrelative decay time constant \u03c4max/\u03c4m of the slowest eigenmode of (WL \u2212 I) (left), which is also\nthe negative inverse of its largest eigenvalue (middle). Dots indicate the numerical evaluation of the\ncorresponding quantities, and errorbars (barely noticeable) denote standard deviation across several\nrandom realizations of \u03a3. Dashed lines correspond to the generic bound in Eq. 7. Solid lines are\nobtained from random matrix theory under the asssumption that \u03a3 is drawn from an inverse Wishart\ndistribution (Eq. 8). Parameters: \u03c3\u03be = \u03c30 = 1.\n\n0, and the distribution of\nsuch that the expected value of a diagonal element (variance) in \u03a3 is \u03c32\npairwise correlations is centered on zero with variance \u03c32\nr. The asymptotic behavior of the largest\neigenvalue of \u03a3\u22121 (the square of the smallest singular value of a random \u03bd \u00d7 N rectangular matrix)\nis known from random matrix theory (e.g. [14]), and we have for large N:\n\n(cid:19)2\n\n\u221a\n\nN\n\n(cid:18) 1\n\n(cid:19)\n\nN\n\n(8)\n\n(cid:18)(cid:113)\n\nmax \u2248 \u2212 (\u03c3\u03be/\u03c30)2\n\u03bbWL\u2212I\nr (cid:99) \u2212 2\n\n(cid:98)\u03c3\u22122\n\nN \u2212 1 + (cid:98)\u03c3\u22122\n\nr (cid:99) \u2212\n\n\u223c \u2212O\n\n\u221a\n\nThis scaling behavior is shown in Fig. 2 (solid lines). In fact, we can also show (cf. SI) that LS is\n(locally) the slowest possible choice (see Sec. 4 below for a precise de\ufb01nition of \u201cslowest\u201d, and SI\nfor details).\nNote that both Eqs. 7-8 are inversely proportional to the ratio (\u03c30/\u03c3\u03be), which tells us how much\nthe recurrent interactions must amplify the external noise in order to produce samples from the\nright stationary activity distribution. The more ampli\ufb01cation is required (\u03c30 (cid:29) \u03c3\u03be), the slower the\ndynamics of LS. Conversely, one could potentially make Langevin sampling faster by increasing \u03c3\u03be,\nbut \u03c3\u03be would need to scale as\nN to annihilate the critical slowing problem. This \u2013 in itself \u2013 is\nunrealistic; moreover, it would also require the resulting connectivity matrix to have a large negative\ndiagonal (O(\u2212N )) \u2013 ie. the intrinsic neuronal time constant \u03c4m to scale as O(1/N ) \u2013, which is\nperhaps even more unrealistic.2\nNote also that LS can be sped up by appropriate \u201cpreconditioning\u201d (e.g. [15, 16]), for example using\nthe inverse Hessian of the log-posterior. In our case, a simple calculation shows that this corresponds\nto removing all recurrent connections, and pushing the posterior covariance matrix to the external\nnoise sources, which is only postponing the problem to some other brain network.\nFinally, LS is fundamentally implausible as a neuronal implementation: it imposes symmetric synap-\ntic interactions, which is simply not possible in the brain due to the existence of distinct classes of\nexcitatory and inhibitory neurons (Dale\u2019s principle). In the following section, we show that networks\ncan be constructed that overcome all the above limitations of LS in a principled way.\n\n4 General solution and quanti\ufb01cation of sampling speed\n\nWhile Langevin dynamics (Eq. 6) provide a general recipe for sampling from any given posterior\ndensity, they unduly constrain the recurrent interactions to be symmetric \u2013 at least in the Gaussian\n\n2From a pure machine learning perspective, increasing \u03c3\u03be is not an option either: the increasing stiffness of\nEq. 6 would either require the use of a very small integration step, or would lead to arbitrarily small acceptance\nratios in the context of Metropolis-Hastings proposals.\n\n4\n\n11010010001101001000slowingfactor\u03c4max/\u03c4m\u03c3r=0.10\u03c3r=0.20-1-0.8-0.6-0.4-0.201101001000\u03bbWL\u2212Imax-1-0.500.51(\u2248N(0,\u03c3r))networksizeNsimulation(inverseWishart)theory(inverseWishart)lowerbound(general)networksizeNpairwisecorr.\fFigure 3: How fast is the fastest sampler? (A) Scalar measure of the statistical dependency be-\ntween any two samples collected k\u03c4m seconds apart (cf. main text), for Langevin sampling (black),\nGibbs sampling (blue, assuming a full update sweep is done every \u03c4m), a series of networks (brown\nto red) with connectivities given by Eq. 9 where the elements of the skew-symmetric matrix S were\ndrawn iid. from N (0, \u03b6 2) for different values of \u03b6 (see also panel B), the unconstrained optimized\nnetwork (yellow), and the optimized E/I network (green). For reference, the dashed gray line shows\nthe behavior of a network in which there are no recurrent interactions, and the posterior covariance\nis encoded in the covariance of the input noise, which in fact corresponds to Langevin sampling\nwith inverse Hessian (\u201cNewton\u201d-like) preconditioning [16]. (B) Total slowing cost \u03c8slow(S) when\nSi