{"title": "Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 854, "page_last": 862, "abstract": "To learn reliable rules that can generalize to novel situations, the brain must be capable of imposing some form of regularization. Here we suggest, through theoretical and computational arguments, that the combination of noise with synchronization provides a plausible mechanism for regularization in the nervous system. The functional role of regularization is considered in a general context in which coupled computational systems receive inputs corrupted by correlated noise. Noise on the inputs is shown to impose regularization, and when synchronization upstream induces time-varying correlations across noise variables, the degree of regularization can be calibrated over time. The resulting qualitative behavior matches experimental data from visual cortex.", "full_text": "Synchronization can Control Regularization in\nNeural Systems via Correlated Noise Processes\n\nJake Bouvrie\n\nDepartment of Mathematics\n\nDuke University\n\nDurham, NC 27708\n\njvb@math.duke.edu\n\nJean-Jacques Slotine\n\nNonlinear Systems Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02138\n\njjs@mit.edu\n\nAbstract\n\nTo learn reliable rules that can generalize to novel situations, the brain must be ca-\npable of imposing some form of regularization. Here we suggest, through theoreti-\ncal and computational arguments, that the combination of noise with synchroniza-\ntion provides a plausible mechanism for regularization in the nervous system. The\nfunctional role of regularization is considered in a general context in which cou-\npled computational systems receive inputs corrupted by correlated noise. Noise on\nthe inputs is shown to impose regularization, and when synchronization upstream\ninduces time-varying correlations across noise variables, the degree of regular-\nization can be calibrated over time. The resulting qualitative behavior matches\nexperimental data from visual cortex.\n\nIntroduction\n\n1\nThe problem of learning from examples is in most circumstances ill-posed. This is particularly true\nfor biological organisms, where the \u201cexamples\u201d are often complex and few in number, and the abil-\nity to adapt is a matter of survival. Theoretical work in inverse problems has long established that\nregularization restores well-posedness [5, 20] and furthermore, implies stability and generalization\nof a learned rule [2]. How the nervous system imposes regularization is not entirely clear, however.\nBayesian theories of learning and decision making [14, 12, 29] hold that that brain is able to repre-\nsent prior distributions and assign (time-varying) uncertainty to sensory measurements. By way of\na Bayesian integration, the brain may effectively work with hypothesis spaces of limited complexity\nwhen appropriate, trading off prior knowledge against new evidence [9]. But while these mecha-\nnisms can effect regularization, it is still not clear how to calibrate it: when to cease adaptation or\nhow to \ufb01x a hypothesis space suited to a given task. A second possible explanation is that regulariza-\ntion \u2013 and a representation of uncertainty \u2013 may emerge naturally due to noise. Intuitively, if noise\nis allowed to \u201csmear\u201d observations presented to a learning apparatus, over\ufb01tting may be mitigated \u2013\na well known phenomenon in arti\ufb01cial neural networks [1].\nIn this paper we argue that noise provides an appealing, plausible mechanism for regularization in\nthe nervous system. We consider a general context in which coupled computational circuits subject\nto independent noise receive common inputs corrupted by spatially correlated noise. Information\nprocessing pathways in the mammalian visual cortex, for instance, fall under such an organizational\npattern [10, 24, 7]. The computational systems in this setting represent high-level processing stages,\ndownstream from localized populations of neurons which encode sensory input. Noise correlations\nin the latter arise from, for instance, within-population recurrent connections, shared feed-forward\ninputs, and common stimulus preferences [24]. Independent noise impacting higher-level computa-\ntional elements may arise from more intrinsic, ambient neuronal noise sources, and may be roughly\nindependent due to broader spatial distribution [6].\nTo help understand the functional role of noise in inducing regularization, we propose a high-level\nmodel that can explain quantitatively how noise translates into regularization, and how regularization\nmay be calibrated over time. The ability to adjust regularization is key: as an organism accumulates\n\n1\n\n\fexperience, its models of the world should be able to adjust to the complexity of the relationships\nand phenomena it encounters, as well as reconcile new information with prior probabilities. Our\npoint of view is complementary to Bayesian theories of learning; the representation and integration\nof sensory uncertainty is closely related to a regularization interpretation of learning in ill-posed\nsettings. We postulate that regularization may be plausibly controlled by one of the most ubiquitous\nmechanisms in the brain: synchronization. A simple, one-dimensional regression (association) prob-\nlem in the presence of both independent ambient noise and correlated measurement noise suf\ufb01ces to\nillustrate the core ideas.\nWhen a learner is presented with a collection of noisy observations, we show that synchroniza-\ntion may be used to adjust the dependence between observational noise variables, and that this\nin turn leads to a quanti\ufb01able change in the degree of regularization imposed upon the learning\ntask. Regularization is further shown to both improve the convergence rate towards the solution\nto the regression problem, and reduce the negative impact of ambient noise. The model\u2019s qualita-\ntive behavior coincides with experimental data from visual tracking tasks [10] (area MT) and from\nanesthetized animals [24] (area V1), in which correlated noise impacts sensory measurements and\ncorrelations increase over short time scales. Other experiments involving perceptual learning tasks\nhave shown that noise correlations decrease with long-term training [8]. The mechanism we propose\nsuggests that changes in noise correlations arising from feedback synchronization can calibrate reg-\nularization, possibly leading to improved convergence properties or better solutions. Collectively,\nthe experimental evidence lends credence to the hypothesis that, at a high level, the brain may be op-\ntimizing its learning processes by adapting dependence among noise variables, with regularization\nan underlying computational theme.\nLastly, we consider how continuous dynamics solving a given learning problem might be ef\ufb01ciently\ncomputed in cortex. In addition to supporting regularization, noise can be harnessed to facilitate\ndistributed computation of the gradients needed to implement a dynamic optimization process. Fol-\nlowing from this observation, we analyze a stochastic \ufb01nite difference scheme approximating deriva-\ntives of quadratic objectives. Difference signals and approximately independent perturbations are\nthe only required computational components. This distributed approach to the implementation of\ndynamic learning processes further highlights a connection between parallel stochastic gradient de-\nscent algorithms [25, 15, 28], and neural computation.\n2 Learning as noisy gradient descent on a network\nThe learning process we will consider is that of a one-dimensional linear \ufb01tting problem described\nby a dynamic gradient based minimization of a square loss objective, in the spirit of Rao & Bal-\nlard [21]. This is perhaps the simplest and most fundamental abstract learning problem that an\norganism might be confronted with \u2013 that of using experiential evidence to infer correlations and\nultimately discover causal relationships which govern the environment and which can be used to\nmake predictions about the future. The model realizing this learning process is also simple, in\nthat we capture neural communication as an abstract process \u201cin which a neural element (a single\nneuron or a population of neurons) conveys certain aspects of its functional state to another neural\nelement\u201d [22]. In doing so, we focus on the underlying computations taking place in the nervous\nsystem rather than particular neural representations. The analysis that follows, however, may be\nextended more generally to multi-layer feedback hierarchies.\nTo make the setting more concrete, assume that we have observed a set of input-output examples\n{xi \u2208 R, yi \u2208 R}m\ni=1, with each xi representing a generic unit of sensory experience, and want to\nestimate the linear regression function fw(x) = wx (we assume the intercept is 0 for simplicity).\nAdopting the square loss, the total prediction error incurred on the observations by the rule fw is\ngiven by\n\nm(cid:88)\nwith respect to the slope parameter is given by \u2207wE = \u2212(cid:80)m\n\n(yi \u2212 fw(xi))2 = 1\n\nNote that there is no explicit regularization penalty here. We will model adaptation (training) by\na noisy gradient descent process on this squared prediction error loss function. The gradient of E\ni=1(yi \u2212 wxi)xi, and generates the\n\n(yi \u2212 wxi)2.\n\n2\n\ni=1\n\nm(cid:88)\n\ni=1\n\nE(w) = 1\n2\n\n(1)\n\n(2)\n\ncontinuous-time, noise-free gradient dynamics\n\n\u02d9w = \u2212\u2207wE(w).\n\nThe learning dynamics we will consider, however, are assumed to be corrupted by two distinct kinds\nof noise:\n\n2\n\n\f(N1) Sensory observations (xi)i are corrupted by time-varying, correlated noise processes.\n(N2) The dynamics are themselves corrupted by additive \u201cambient\u201d noise.\n\nTo accommodate (N1) we will borrow an averaging or, homogenization, technique for multi-scale\nsystems of stochastic differential equations (SDEs) that will drastically simplify analysis. We have\ndiscussed the origins of (N1) above. The noise (N2) may be signi\ufb01cant (we do not take small noise\nlimits) and can be attributed to some or all of: error in computing and sensing a gradient, intrinsic\nneuronal noise [6] (aggregated or localized), or interference between large assemblies of neurons or\ncircuits.\nSynchronization among circuits and/or populations will be modeled by considering multiple coupled\ndynamical systems, each receiving the same noisy observations. Such networks of systems capture\ncommon pooling or averaging computations, and provides a means for studying variance reduction.\nThe collective enhancement of precision hypothesis suggests that the nervous system copes with\nnoise by averaging over collections of signals in order to reduce variation in behavior and improve\ncomputational accuracy [23, 13, 26, 3]. Coupling synchronizes the collection of dynamical systems\nso that each tends to a common \u201cconsensus\u201d trajectory having reduced variance. If the coupling is\nstrong enough, then the variance of the consensus trajectory decreases as O(1/n) after transients,\nif there are n signals or circuits [23, 17, 19, 3]. We will consider regularization in the context of\nnetworks of coupled SDEs, and investigate the impact of coupling, redundancy (n) and regulariza-\ntion upon the convergence behavior of the system. Considering networks will allow a more general\nanalysis of the interplay between different mechanisms for coping with noise, however n can be\nsmall or 1 in some situations.\nFormally, the noise-free \ufb02ow (2) can be modi\ufb01ed to include noise sources (N1) and (N2) as follows.\nNoise (N1) may be modeled as a white-noise limit of Ornstein-Uhlenbeck (OU) processes (Zt)i,\nand (N2) as an additive diffusive noise term. In differential form, we have\n\ndwt = \u2212(cid:0)wt(cid:107)x + Zt(cid:107)2 \u2212 (cid:104)x + Zt, y(cid:105)(cid:1)dt + \u03c3dBt\n\ndZ i\n\nt = \u2212 Z i\nt\n\u03b5\n\ndt +\n\n\u221a\n2\u03b3\u221a\n\u03b5\n\ndBi\nt,\n\ni = 1, . . . , m.\n\n(3a)\n\n(3b)\n\nHere, Bt denotes the standard 1-dimensional Brownian motion and captures noise source (N2). The\nt, following (N1). For the\nobservations (x)i = xi are corrupted by the noise processes (Zt)i = Z i\nt are independent, but we will relax this assumption later. The parameter 0 < \u03b5 (cid:28) 1\nmoment, the Z i\ncontrols the correlation time of a given noise process. In the limit as \u03b5 \u2192 0, Z i\nt may be viewed as a\nfamily of independent zero-mean Gaussian random variables indexed by t. Characterizing the noise\nZt as (3b) with \u03b5 \u2192 0 serves as both a modeling approximation/idealization and an analytical tool.\n2.1 Homogenization\n\nThe system (3a)-(3b) above is a classic \u201cfast-slow\u201d system:\nthe gradient descent trajectory wt\nevolves on a timescale much longer than the O(\u03b5) stochastic perturbations Zt. Homogenization\nconsiders the dynamics of wt after averaging out the effect of the fast variable Zt. In the limit as\n\u03b5 \u2192 0 in (3b), the solution to the averaged SDE converges (in a sense to be discussed below) to the\nsolution of the original SDE (3a).\nThe following Theorem is an instance of [18, Thm. 3], adapted to the present setting.\nTheorem 2.1. Let 0 < \u03b5 (cid:28) 1, \u03c3, \u03b3 > 0 and let X ,Y denote \ufb01nite-dimensional Euclidean spaces.\nConsider the system\n\n(4a)\n(4b)\nwhere x \u2208 X , y \u2208 Y, and Wt \u2208 X , Bt \u2208 Y are independent multivariate Brownian motions.\nAssume that for all x \u2208 X , y \u2208 Y the following conditions on (4) hold:\n\ndx = f (x, y)dt + \u03b3dWt,\ndy = \u03b5\u22121g(y)dt + \u03b5\u22121/2\u03c3dBt,\n\ny(0) = y0,\n\nx(0) = x0\n\n(cid:104)g(y), y/(cid:107)y(cid:107)(cid:105) \u2264 \u2212r(cid:107)y(cid:107)\u03b1,\n\n(cid:107)f (x, y) \u2212 f (x(cid:48), y)(cid:107) \u2264 C(y)(cid:107)x \u2212 x(cid:48)(cid:107)\n\n(cid:107)f (x, y)(cid:107) \u2264 K(1 + (cid:107)x(cid:107))(1 + (cid:107)y(cid:107)q),\n\nwith r > 0, \u03b1 \u2265 0, q < \u221e, and where C(y) is a constant depending on y. If the SDE (4b) is ergodic,\nthen there exists a unique invariant measure \u00b5\u221e characterizing the probability distribution of yt in\n\n3\n\n\fthe steady state, and we may de\ufb01ne the vector \ufb01eld F (x) (cid:44) E\u00b5\u221e [f (x, y)] = (cid:82)\n\nFurthermore, x(t) solving (4a) is closely approximated by X(t) solving\n\nY f (x, y)\u00b5\u221e(dy).\n\ndX = F (X)dt + \u03b3dWt, X(0) = x0\n\nin the sense that, for any t \u2208 [0, T ], x(t) \u21d2 X(t) in C([0, T ],X ) as \u03b5 \u2192 0.\nIt may be readily shown that the system (3) satis\ufb01es the conditions of Theorem 2.1. Moreover, the\nOU process (3b) on Rm is known to be ergodic with stationary distribution Z\u221e \u223c N (0, \u03b32I) (see\ne.g. [11]), where N (\u00b5, \u03a3) denotes the multivariate Gaussian distribution with mean \u00b5 and covariance\n\u03a3. Averaging over the fast variable Zt appearing in (3a) with respect to this distribution gives\n\ndwt = \u2212(cid:2)wt((cid:107)x(cid:107)2 + m\u03b32) \u2212 (cid:104)x, y(cid:105)(cid:3)dt + \u03c3dBt,\n\n(5)\nand by Theorem 2.1, we can conclude that Equation (5) well-approximates (3a) when \u03b5 \u2192 0 in (3b)\nin the sense of weak convergence of probability measures.\n2.2 Network structure\nNow consider n \u2265 1 diffusively coupled neural systems implementing the dynamics (5), with asso-\ni and j, L = diag(W 1) \u2212 W is the network Laplacian [16]. We assume here that L is symmetric\nand de\ufb01nes a connected network graph. Letting \u03b1 := (cid:107)x(cid:107)2 + m\u03b32, \u03b2 := (cid:104)x, y(cid:105) and \u00b5 := (\u03b2/\u03b1)1,\nthe coupled system can be written concisely as\n\nciated parameters w(t) =(cid:0)w1(t), . . . , wn(t)(cid:1). If Wij \u2265 0 is the coupling strength between systems\n\ndwt = \u2212(L + \u03b1I)wtdt + \u03b21dt + \u03c3dBt\n= (L + \u03b1I)(\u00b5 \u2212 wt)dt + \u03c3dBt ,\n\n(6)\n\nwith Bt an n-dimensional Brownian motion. The diffusive couplings here should be interpreted\nas modeling abstract intercommunication between and among different neural circuits, populations,\nor pathways. In such a general setting, diffusive coupling is a natural and mathematically tractable\nchoice that can capture the key, aggregate aspects of communication among neural systems. Note\nthat one can equivalently consider n systems (3a) and then homogenize assuming n copies of the\nsame noise process Zt, or n independent noise processes {Z(i)\n3 Learning with noisy data imposes regularization\nEquation (6) is seen by inspection to be an OU process, and has solution (see e.g. [11])\ne\u2212(L+\u03b1I)(t\u2212s)dBs.\n\nt }i; either choice also leads to (6).\n(cid:90) t\n\ncharacterized entirely by its time-dependent mean and covariance, w(t) \u223c N(cid:0)\u00b5w(t), \u03a3w(t)(cid:1). A\n\nIntegrals of Brownian motion are normally distributed, so w(t) is a Gaussian process and can be\n\nw(t) = e\u2212(L+\u03b1I)tw(0) +(cid:0)I \u2212 e\u2212(L+\u03b1I)t(cid:1)\u00b5 + \u03c3\n\u00b5w(t) : = E[w(t)] = e\u2212(L+\u03b1I)t E[w(0)] +(cid:0)I \u2212 e\u2212(L+\u03b1I)t(cid:1)\u00b5\n\u03a3w(t) : = E(cid:104)(cid:0)w(t) \u2212 E w(t)(cid:1)(cid:0)w(t) \u2212 E w(t)(cid:1)(cid:62)(cid:105)\n\nstraightforward manipulation (details omitted due to lack of space) gives\n\n(8)\n\n(7)\n\n0\n\n(L + \u03b1I)\u22121(cid:0)I \u2212 e\u22122(L+\u03b1I)t(cid:1).\n\n\u03c32\n2\n\n= e\u2212(L+\u03b1I)t E[w(0)w(0)(cid:62)]e\u2212(L+\u03b1I)t +\n\nThe solution to the noise-free regression problem (minimizing (1)) is given by w\u2217 = (cid:104)x, y(cid:105)/(cid:107)x(cid:107)2,\nhowever (7) together with (8) reveals that, for any i \u2208 {1, . . . , n},\n(cid:104)x, y(cid:105)\n\n(cid:107)x(cid:107)2 + m\u03b32\nwhich is exactly the solution to the regularized regression problem\n\nE[wi(t)] t\u2192\u221e\u2212\u2212\u2212\u2192 (\u00b5)i =\n\n(9)\n\nw\u2208R (cid:107)y \u2212 wx(cid:107)2 + \u03bbw2\n\nmin\n\nwith regularization parameter \u03bb := m\u03b32. To summarize, we have considered a network of cou-\npled, noisy gradient \ufb02ows implementing unregularized linear regression. When the observations\nx are noisy, all elements of the network converge in expectation to a common equilibrium point\nrepresenting a regularized solution to the original regression problem.\n\n4\n\n\f3.1 Convergence behavior\n\nIn the previous section we showed that the network converges to the solution of a regularized re-\ngression problem, but left open a few important questions: What determines the convergence rate?\nHow does the noise (N1),(N2) impact convergence? How does coupling and redundancy (number\nof circuits n) impact convergence? How do these quantities affect the variance of the error? We can\n\naddress these questions by decomposing w(t) into orthogonal components, w(t) = \u00afw(t)1 + (cid:101)w(t),\nn 1(cid:62)w, and \ufb02uctuations about the mean (cid:101)w = w \u2212 \u00afw1.\nn(cid:107)(cid:101)w(t)(cid:107)2(cid:3) + E(cid:2) 1\n\nE(cid:2) 1\nn(cid:107)w(t) \u2212 \u00b5(cid:107)2(cid:3) = E(cid:2) 1\n\nrepresenting the mean-\ufb01eld trajectory \u00afw = 1\nWe may then study the error\n\nn(cid:107) \u00afw(t)1 \u2212 \u00b5(cid:107)2(cid:3)\n\n(10)\nby studying each term separately. Decomposing the error into \ufb02uctuations about the average and the\ndistance between the average and the noise-free equilibrium allows one to see that there are actually\ntwo different convergence rates governing the system: one determines convergence towards the syn-\n\nchronization subspace (where (cid:101)w = 0), and the another determines convergence to the equilibrium\nTheorem 3.1. Let (cid:101)C, C be constants which do not depend on time, and let \u03bb denote the smallest\n\npoint \u00b5. The following result provides quantitative answers to the questions posed above:\n\nnon-zero eigenvalue of L. Set \u03b1 := (cid:107)x(cid:107)2 + m\u03b32 and \u00b5 := ((cid:104)x, y(cid:105)/\u03b1)1, as before. Then for all\nt > 0,\n\nE(cid:2) 1\nn(cid:107)w(t) \u2212 \u00b5(cid:107)2(cid:3) \u2264 (cid:101)Ce\u22122(\u03bb+\u03b1)t + Ce\u22122\u03b1t +\n\n(cid:18) 1\n\n.\n\n(11)\n\n(cid:19)\n\n\u03c32\n2\n\n+\n\n1\n\u03b1n\n\n\u03bb + \u03b1\n\nA proof is given in the supplementary material. The \ufb01rst term of (11) estimates the transient part\nof the \ufb02uctuations term in (10), and we \ufb01nd that the rate of convergence to the synchronization\nsubspace is 2(\u03bb + \u03b1). The second term term estimates the transient part of the centroid\u2019s trajectory,\nand we see that the rate of convergence of the mean trajectory to equilibrium is 2\u03b1. In the presence of\nnoise, however, the system will neither synchronize nor reach equilibrium exactly. After transients,\nwe see that the residual error is given by the last term in (11). This term quanti\ufb01es the steady-state\ninteraction between: gradient noise (\u03c3); regularization (\u03b1, via the observation noise \u03b3); network\ntopology (via \u03bb), coupling strength (via \u03bb), and redundancy (n; possibly \u03bb).\n3.2 Discussion\n\nFrom the results above we can draw a few conclusions about networks of noisy learning systems:\n\n1. Regularization improves both the synchronization rate and the rate of convergence to equilibrium.\n2. Regularization contributes towards reducing the effect of the gradient noise \u03c3: (N1) counteracts\n\n(N2).\n\n3. Regularization changes the solution, so we cannot view regularization as a \u201cfree-parameter\u201d that\ncan be used solely to improve convergence or reduce noise. Faster convergence rates and noise\nreduction should be viewed as bene\ufb01cial side-effects, while the appropriate degree of regulariza-\ntion primarily depends on the learning problem at hand.\n\n4. The number of circuits n and the coupling strength contribute towards reducing the effect of the\ngradient noise (N2) (that is, the variance of the error) and improve the synchronization rate, but\ndo not affect the rate of convergence toward equilibrium.\n\n5. Coupling strength and redundancy cannot be used to control the degree of regularization, since\nthe equilibrium solution \u00b5 does not depend on n or the spectrum of L. This is true no matter how\nthe coupling weights Wij are chosen, since constants will always be in the null space of L and \u00b5\nis a constant vector.\n\nIn the next section we will show that if the noise processes {Z i\nt}i are themselves trajectories of\na coupled network, then synchronization can be a mechanism for controlling the regularization\nimposed on a learning process.\n4 Calibrating regularization with synchronization\n\nIf instead of assuming independent noise processes corrupting the data as in (3b), we consider cor-\nrelated noise variables (Z i\ni=1, it is possible for synchronization to control the regularization which\nthe noise imposes on a learning system of the form (3a). A collection of dependent observational\nnoise processes is perhaps most conveniently modeled by coupling the OU dynamics (3b) introduced\n\nt )m\n\n5\n\n\fbefore through another (symmetric) network Laplacian Lz:\n\ndZt = \u2212 1\n\u03b5\n\n(Lz + \u03b7I)Ztdt +\n\n\u221a\n2\u03b3\u221a\n\u03b5\n\ndBt,\n\n(12)\n\nfor some \u03b7 > 0. We now have two networks: the \ufb01rst network of gradient systems is the same as\nbefore, but the observational noise process Zt is now generated by another network. For purposes\nof analysis, this model suf\ufb01ces to capture generalized correlated noise sources. In the actual biol-\nogy, however, correlations may arise in a number of possible ways, which may or may not include\ndiffusively coupled dynamic noise processes.\nTo analyze what happens when a network of learning systems (3a) is driven by observation noise of\nthe form (12), we take an approach similar to that of the previous Section. The \ufb01rst step is again\nhomogenization. The system (12) may be viewed as a zero-mean variation of (6), and its solution\n\nZt \u223c N(cid:0)\u00b5z(t), \u03a3z(t)(cid:1) is a Gaussian process characterized by\n\u03a3z(t) = e\u2212(Lz+\u03b7I)t/\u03b5 E[Z(0)Z(0)(cid:62)]e\u2212(Lz+\u03b7I)t/\u03b5 + \u03b32(Lz + \u03b7I)\u22121(cid:0)I \u2212 e\u22122(Lz+\u03b7I)t/\u03b5(cid:1).\nTaking t \u2192 \u221e in (13) yields the stationary distribution \u00b5\u221e = N(cid:0)0, \u03b32(Lz + \u03b7I)\u22121(cid:1). We can now\n\n\u00b5z(t) = e\u2212(Lz+\u03b7I)t/\u03b5 E[Z(0)]\n\n(13a)\n(13b)\n\nconsider (3a) de\ufb01ned with Zt governed by (12), and average with respect to \u00b5\u221e:\n\n(cid:110)(cid:0)wt(cid:107)x + Zt(cid:107)2 \u2212 (cid:104)x + Zt, y(cid:105)(cid:1)(cid:111)\n(cid:0)(cid:107)x(cid:107)2 + \u03b32 tr(Lz + \u03b7I)\u22121(cid:1) \u2212 (cid:104)x, y(cid:105)(cid:105)\n\ndwt = \u2212 E\u00b5\u221e\n\n= \u2212(cid:104)\n\nwt\n\ndt + \u03c3dBt\n\ndt + \u03c3dBt\n\nwhere we have used that E[(cid:107)Zt(cid:107)2] = \u03b32 tr(Lz + \u03b7I)\u22121. As before, the averaged approximation is\ngood when \u03b5 \u2192 0. An expression identical to (6),\n\ndwt = (L + \u03b1I)(\u00b5 \u2212 wt)dt + \u03c3dBt\n\n\u03bb = \u03b1 \u2212 (cid:107)x(cid:107)2 = \u03b32 tr(Lz + \u03b7I)\u22121.\n\nis obtained by rede\ufb01ning \u03b1 := (cid:107)x(cid:107)2 + \u03b32 tr(Lz + \u03b7I)\u22121 and \u00b5 := ((cid:104)x, y(cid:105)/\u03b1)1. In this case,\n\n(14)\n\nTheorem 3.1 may be immediately applied to understand (14). As before, the covariance of Zt\n\ufb01gures into the regularization parameter. However now the covariance of Zt is a function of the\nnetwork Laplacian Lz = Lz(t), which is de\ufb01ned by the topology and potentially time-varying\ncoupling strengths of the noise network. By adjusting the coupling in (12), we adjust the regulariza-\nt increases and\ntion \u03bb imposed upon (14). When coupling increases, the dependence among the Z i\ntr(Lz + \u03b7I)\u22121 (and therefore \u03b1) decreases. Thus, increased correlation among observational noise\nvariables implies decreased regularization.\nIn the case of all-to-all coupling with uniform strength \u03ba \u2265 0, for example, Lz has eigenvalues\n0 = \u03bb0 < \u03bb1 = \u00b7\u00b7\u00b7 = \u03bbm = m\u03ba. The regularization may in this case range over the interval\n\ntr(Lz + \u03b7I)\u22121 =\n\ninf\n\u03ba\n\n1\n\u03b7\n\n<\n\n\u03bb\n\n\u03b32 \u2264 m\n\n\u03b7\n\n= sup\n\n\u03ba\n\ntr(Lz + \u03b7I)\u22121\n\nby adjusting the coupling strength \u03ba \u2208 [0,\u221e). Note that all-to-all coupling may be plausibly imple-\nmented with O(n) connections using mechanisms such as quorum sensing (see [3, \u00a72.3], [27]).\n5 Distributed computation with noise\nWe have argued that noise can serve as a mechanism for regularization. Noise may also be har-\nnessed, in a different sense, to compute dynamics of the type discussed above. The distributed\nnature of the mechanism we will explore adheres to the general theme of parallel computation in the\nbrain, and provides one possible explanation for how the gradients introduced previously might be\nestimated. The development is closely related to stochastic gradient descent (SGD) ideas appearing\nin stochastic approximation [25, 15] and adaptive optics [28].\n5.1 Parallel stochastic gradient descent\nLet J(u) : Rd \u2192 R be a locally Lipschitz Lyapunov cost functional we wish to minimize with\nrespect to some set of control signals u(t) \u2208 Rd. Gradient descent on J can be described by the\ncollection of \ufb02ows\n\ndui(t)\n\n= \u2212\u03b3\n\ndt\n\n\u2202J\n\u2202ui\n\n(u1, . . . , ud),\n\ni = 1, . . . , d.\n\n6\n\n\fWe consider the case where the gradients above are estimated via \ufb01nite difference approximations\nof the form\n\n\u2248 J(u1, . . . , ui + \u03b4ui, . . . , ud) \u2212 J(u1, . . . , ui, . . . , ud)\n\n,\n\n\u2202J(u)\n\n\u2202ui\n\n\u03b4ui\n\nwhere \u03b4ui is a small perturbation applied to the i-th input. Parallel stochastic gradient descent\n(PSGD, see e.g. [28]) involves applying i.i.d. stochastic perturbations \u03b4ui simultaneously to all in-\nputs in parallel, so that the gradients \u2202iJ(u) are estimated as\n\n\u2202J(u)\n\n\u2248 \u03b4J\u03b4ui,\n\n(15)\nwhere \u03b4J = J(u1 +\u03b4u1, . . . , ui +\u03b4ui, . . . , ud +\u03b4ud)\u2212J(u1, . . . , ui, . . . , ud). If \u03b4ui are symmetric\nrandom variables with mean zero and variance \u03c32, then \u03c3\u22122 E[\u03b4J\u03b4ui] is accurate to O(\u03c32) [28].\n5.2 Stochastic gradient model\n\ni = 1, . . . , d\n\n\u2202ui\n\nThe parallel \ufb01nite difference approximation (15) suggests a more biologically plausible mechanism\nfor implementing gradient dynamics. If the perturbations \u03b4ui are taken to be Gaussian i.i.d. random\nvariables, we can model parallel stochastic gradient descent as an Ito process:\n\ndut = \u2212\u03b3(cid:2)J(ut + Zt) \u2212 J(ut)(cid:3)Ztdt,\n\ndZt = \u2212 1\n\u03b5\n\nZtdt +\n\n\u03c3\u221a\n\u03b5\n\ndBt,\n\nu(0) = u0\n\nZ(0) = z0\n\n(16a)\n\n(16b)\n\nwhere Bt is a standard d-dimensional Brownian motion. Additive noise affecting the gradient has\nbeen omitted from (16a) for simplicity, and does not change the fundamental results discussed in\nthis section. The perturbation noise Zt has again been modeled as a white-noise limit of Ornstein-\nUhlenbeck processes (16b). When \u03b5 \u2192 0, Equation (16a) implements PSGD using the approxima-\ntion given by Equation (15) with \u03b4ui zero-mean i.i.d. Gaussian random variables.\nWe will proceed with an analysis of (16) in the particular case where J is chosen from the quadratic\nfamily of cost functionals of the form J(u) = u(cid:62)Au where A is a symmetric, bounded and strictly\npositive de\ufb01nite matrix1. In this setting the analysis is simpler and suf\ufb01ces to illustrate the main\npoints. This cost function satis\ufb01es minu\u2208Rd J(u) = 0 with minimizer u\u2217 = 0, and J is a Lyapunov\nfunction. Equation (16a) now takes the form\nt AZt + Z(cid:62)\n\ndut = \u2212\u03b3(cid:0)2u(cid:62)\n\n(cid:1)Ztdt,\n\nu(0) = u0.\n\nt AZt\n\n(17)\n\n5.3 Convergence of continuous-time PSGD with quadratic cost\n\nWe turn to studying the convergence behavior of (17) and the precise role of the stochastic per-\nturbations Zt used to estimate the gradients. These perturbations must be small in order to obtain\naccurate approximations of the gradients. However, one may also expect that the noise will play\nan important role in determining convergence properties since it is the noise that ultimately kicks\nthe system \u201cdownhill\u201d towards equilibrium. Homogenizing (17) with respect to Zt leads to the\nfollowing Theorem, the proof of which is given in the supplementary material.\nTheorem 5.1. For any 0 \u2264 t \u2264 T < \u221e, the solution u(t) to (17) satis\ufb01es\n\nE[u(t)] = e\u2212\u03b3\u03c32Atu(0).\n\nlim\n\u03b5\u21920\n\n(18)\n\nIt is clear from this result that the PSGD system (16), for \u03b5 \u2192 0, converges in expectation globally\nand exponentially to the minimum of J when J is a positive de\ufb01nite quadratic form. Our earlier\nintuition that the perturbation noise \u03c3 should play a role in the rate of convergence is also con\ufb01rmed:\ngreater noise amplitudes lead to faster convergence. However this comes at a price. The covariance\nof u(t) after transients is exactly the covariance of Zt. Thus an inherent tradeoff between speed and\naccuracy must be resolved by any organism implementing PSGD-like mechanisms.\n\n1Without loss of generality we may assume A is symmetric since the antisymmetric part does not contribute\nto the quadratic form. In addition, objectives of the form u(cid:62)Au+b(cid:62)u+c may be expressed in the homogeneous\nform u(cid:62)Au by a suitable change of variables.\n\n7\n\n\fFigure 1: (Left stack) Increased observation noise imposes greater regularization, and leads to a reduction\nin ambient noise. (Right stack) Stronger coupling/correlation between observation noise processes decreases\nregularization. See text for details.\n\n6 Simulations\n\nWe \ufb01rst simulated a network of gradient dynamics with uncoupled observation noise processes obey-\ning (3). To illustrate the effect of increasing observation noise variance, the parameter \u03b3 in (3b) was\nincreased from 0.5 to 7 along a monotonic, sigmoidal path over the duration of the simulation. We\nused n = 5 systems (3a) with \u03c3 = 4, coupled all-to-all with uniform strength \u03ba = 2. Observa-\ntions were sampled according to (x)i \u223c N (0, 0.04), (y)i \u223c Uniform[0, 20] with m = 20 entries,\nonce and for all, at the beginning of the experiment. Initial conditions were drawn according to\nw(0) \u223c Uniform[\u22123, 3], and Z(0) was set to 0. Figure 1 (left three plots) veri\ufb01es some of main\nconclusions of Section 3.2. The top plot shows the sample paths w(t) and time course of the obser-\nvational noise deviation \u03b3(t) (grey labeled trace). When the noise increases near t = 2.5s, a dra-\nmatic drop in the variance of w(t) is visible. The middle plot shows the center of mass (mean-\ufb01eld)\ntrajectory \u00afw(t) superimposed upon the time-varying noise-free solution \u00b5(t) (gray labeled trace).\nBecause the observation noise is increasing, the regularization \u03bb = m\u03b32 increases and the solution\n\u00b5(t) to the regularized problem decreases in magnitude following (9). The bottom plot shows the\nmean-squared distance to the time-dependent noise-free solution \u00b5(t), and the mean-squared size of\nthe \ufb02uctuations about the centroid \u00afw2. It is clear that the error rapidly drops off when \u03b3(t) increases,\ncon\ufb01rming the apparent reduction in the variance of w(t) in the top plot.\nA second experiment, described by the right-hand stack of plots in Figure 1, shows how synchroniza-\ntion can function to adjust regularization over time. This simulation is inspired by the experimental\nstudy of noise correlations in cortical area MT due to [10], where it was suggested that time-varying\ncorrelations between pairs of neurons play a signi\ufb01cant role in explaining behavioral variation in\nsmooth-pursuit eye movements. In particular, the \ufb01ndings in [10] and [4] suggest that short-term\nincreases in noise correlations are likely to occur after feedback arrives and neurons within and up-\nstream from MT synchronize. We simulated a collection of correlated observation noise processes\nobeying (12) (\u03b5 = 10\u22123, \u03b7 = 3) with all-to-all topology and uniform coupling strength \u03baz(t) in-\ncreasing from 0 to 2 along the pro\ufb01le shown in Figure 1 (top-right plot, labeled gray trace). This\nnoise process Zt was then fed to a population of n = 5 units obeying (3a), with ambient noise \u03c3 = 1\nand all-to-all coupling at \ufb01xed strength Wij = \u03ba = 2. New data x, y and initial conditions were\nchosen as in the previous experiment. The middle plot on the right-hand side shows the effect of\nincreasing synchronization among the observation noise processes. As the coupling increases, the\nnoise becomes more correlated and regularization decreases. This in turn causes the desired solution\n\u00b5(t) to the regression problem to increase in magnitude (labeled gray trace). With decreased regu-\nlarization, the ambient noise is more pronounced. The bottom-right plot shows the mean \ufb02uctuation\nsize and distance to the noise-free solution (total error). An increase in the noise variance is apparent\nfollowing the increase in observational noise correlation.\n\n2These quantities are similar to those de\ufb01ned in (10), but represent only this single simulation \u2013 not in\nexpectation. Here, ergodic theory allows one to (very roughly) infer ensemble averages by visually estimating\ntime averages.\n\n8\n\n00.511.522.533.544.55\u2212505time(s)w(t)00.511.522.533.544.55\u22124\u221220time(s)\u00afw(t)00.511.522.533.544.55024time(s)Error Total ErrorFluctuations ErrorSteady-statesolution\u00b5(t)Noiseamplitude\u03b3(t)00.511.522.533.544.55024w(t)time(s)00.511.522.533.544.550123\u00afw(t)time(s)00.511.522.533.544.5500.51Errortime(s) Total ErrorFluctuations ErrorCouplingstrength\u03baz(t)Steady-statesolution\u00b5(t)\fAcknowledgments\nThe authors are grateful to Rodolfo Llinas for pointing out the plausible analogy between gradient\nsearch in adaptive optics and learning mechanisms in the brain. JB was supported under DARPA\nFA8650-11-1-7150 SUB#7-3130298, NSF IIS-08-03293 and WA State U. SUB#113054 G002745.\nReferences\n\n[1] C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation,\n\n7(1):108\u2013116, 1995.\n\n[2] O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2(3):499\u2013526, 2002.\n[3] J. Bouvrie and J.-J. Slotine. Synchronization and redundancy: Implications for robustness of neural\n\nlearning and decision making. Neural Computation, 23(11):2915\u20132941, 2011.\n\n[4] S. C. de Oliveira, A. Thiele, and K. P. Hoffmann. Synchronization of neuronal activity during stimulus\n\nexpectation in a direction discrimination task. J Neurosci., 17(23):9248\u201360, 1997.\n\n[5] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer, 1996.\n[6] A. Faisal, L. Selen, and D. Wolpert. Noise in the nervous system. Nat. Rev. Neurosci., 9:292\u2013303, April\n\n2008.\n\n[7] T. J. Gawne and B. J. Richmond. How independent are the messages carried by adjacent inferior temporal\n\ncortical neurons? J Neurosci., 13(7):2758\u201371, 1993.\n\n[8] Y. Gu, S. Liu, C. R. Fetsch, Y. Yang, S. Fok, A. Sunkara, G. C. DeAngelis, and D.E. Angelaki. Perceptual\n\nlearning reduces interneuronal correlations in macaque visual cortex. Neuron, 71(4):750 \u2013 761, 2011.\n\n[9] T. D. Hanks, M. E. Mazurek, R. Kiani, E. Hopp, and M. N. Shadlen. Elapsed decision time affects the\n\nweighting of prior probability in a perceptual decision task. J. Neurosci., 31(17):6339\u201352, 2011.\n\n[10] X. Huang and S. G. Lisberger. Noise correlations in cortical area MT and their potential impact on\ntrial-by-trial variation in the direction and speed of smooth-pursuit eye movements. J. Neurophysiol,\n101:3012\u20133030, 2009.\n\n[11] O. Kallenberg. Foundations of Modern Probability. Springer, 2002.\n[12] R. Kiani and M. N. Shadlen. Representation of con\ufb01dence associated with a decision by neurons in the\n\nparietal cortex. Science, 324(5928):759\u2013764, 2009.\n\n[13] T. Kinard, G. De Vries, A. Sherman, and L. Satin. Modulation of the bursting properties of single mouse\n\npancreatic \u03b2-cells by arti\ufb01cial conductances. Biophysical Journal, 76(3):1423\u20131435, 1999.\n\n[14] K. P. K\u00a8ording and D. M. Wolpert. Bayesian decision theory in sensorimotor control. Trends in Cognitive\n\nSciences, 10(7):319\u2013326, 2006.\n\n[15] H. J. Kushner and G. Yin.\nSpringer, 2nd edition, 2003.\n\nStochastic Approximation and Recursive Algorithms and Applications.\n\n[16] M. Mesbahi and M. Egerstedt. Graph Theoretic Methods in Multiagent Networks. Princeton U. Press,\n\n2010.\n\n[17] D. J. Needleman, P. H. Tiesinga, and T. J. Sejnowski. Collective enhancement of precision in networks of\n\ncoupled oscillators. Physica D: Nonlinear Phenomena, 155(3-4):324\u2013336, 2001.\n\n[18] E. Pardoux and A. Yu. Veretennikov. On the Poisson equation and diffusion approximation. I. Annals of\n\nProbability, 29(3):1061\u20131085, 2001.\n\n[19] Q.-C. Pham, N. Tabareau, and J.-J. Slotine. A contraction theory approach to stochastic incremental\n\nstability. IEEE Transactions on Automatic Control, 54(4):816\u2013820, April 2009.\n\n[20] T. Poggio and S. Smale. The mathematics of learning: dealing with data. Notices Amer. Math. Soc.,\n\n50(5):537\u2013544, 2003.\n\n[21] R. P. Rao and D. H. Ballard. Predictive coding in the visual cortex: A functional interpretation of some\n\nextra-classical receptive-\ufb01eld effects. Nat. Neurosci., 2:79\u201387, 1999.\n\n[22] A. Schnitzler and J. Gross. Normal and pathological oscillatory communication in the brain. Nature\n\nReviews Neuroscience, 6:285\u2013296, 2005.\n\n[23] A. Sherman and J. Rinzel. Model for synchronization of pancreatic beta-cells by gap junction coupling.\n\nBiophysical Journal, 59(3):547\u2013559, 1991.\n\n[24] M. A. Smith and A. Kohn. Spatial and temporal scales of neuronal correlation in primary visual cortex. J\n\nNeurosci., 28(48):12591\u201312603, 2008.\n\n[25] J.C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approxima-\n\ntion. IEEE Transactions on Automatic Control, 37:332\u2013341, 1992.\n\n[26] N. Tabareau, J.-J. Slotine, and Q.-C. Pham. How synchronization protects from noise. PLoS Comput Biol,\n\n6(1):e1000637, Jan 2010.\n\n[27] A. Taylor, M. Tinsley, F. Wang, Z. Huang, and K. Showalter. Dynamical quorum sensing and synchro-\n\nnization in large populations of chemical oscillators. Science, 323(5914):614\u2013617, 2009.\n\n[28] M. A. Vorontsov, G. W. Carhart, and J. C. Ricklin. Adaptive phase-distortion correction based on parallel\n\ngradient-descent optimization. Opt. Lett., 22(12):907\u2013909, Jun 1997.\n\n[29] T. Yang and M. N. Shadlen. Probabilistic reasoning by neurons. Nature, 447(7148):1075\u20131080, 2007.\n\n9\n\n\f", "award": [], "sourceid": 396, "authors": [{"given_name": "Jake", "family_name": "Bouvrie", "institution": null}, {"given_name": "Jean-jeacques", "family_name": "Slotine", "institution": null}]}