{"title": "Neurons Equipped with Intrinsic Plasticity Learn Stimulus Intensity Statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 4278, "page_last": 4286, "abstract": "Experience constantly shapes neural circuits through a variety of plasticity mechanisms. While the functional roles of some plasticity mechanisms are well-understood, it remains unclear how changes in neural excitability contribute to learning. Here, we develop a normative interpretation of intrinsic plasticity (IP) as a key component of unsupervised learning. We introduce a novel generative mixture model that accounts for the class-specific statistics of stimulus intensities, and we derive a neural circuit that learns the input classes and their intensities. We will analytically show that inference and learning for our generative model can be achieved by a neural circuit with intensity-sensitive neurons equipped with a specific form of IP. Numerical experiments verify our analytical derivations and show robust behavior for artificial and natural stimuli. Our results link IP to non-trivial input statistics, in particular the statistics of stimulus intensities for classes to which a neuron is sensitive. More generally, our work paves the way toward new classification algorithms that are robust to intensity variations.", "full_text": "Neurons Equipped with Intrinsic Plasticity\n\nLearn Stimulus Intensity Statistics\n\nTravis Monk\n\nCluster of Excellence Hearing4all\n\nUniversity of Oldenburg\n\n26129 Oldenburg, Germany\n\nCristina Savin\n\nIST Austria\n\n3400 Klosterneuburg\n\nAustria\n\ntravis.monk@uol.de\n\ncsavin@ist.ac.at\n\nJ\u00a8org L\u00a8ucke\n\nCluster of Excellence Hearing4all\n\nUniversity of Oldenburg\n\n26129 Oldenburg, Germany\n\njoerg.luecke@uol.de\n\nAbstract\n\nExperience constantly shapes neural circuits through a variety of plasticity mech-\nanisms. While the functional roles of some plasticity mechanisms are well-\nunderstood, it remains unclear how changes in neural excitability contribute to\nlearning. Here, we develop a normative interpretation of intrinsic plasticity (IP)\nas a key component of unsupervised learning. We introduce a novel generative\nmixture model that accounts for the class-speci\ufb01c statistics of stimulus intensities,\nand we derive a neural circuit that learns the input classes and their intensities.\nWe will analytically show that inference and learning for our generative model\ncan be achieved by a neural circuit with intensity-sensitive neurons equipped with\na speci\ufb01c form of IP. Numerical experiments verify our analytical derivations and\nshow robust behavior for arti\ufb01cial and natural stimuli. Our results link IP to non-\ntrivial input statistics, in particular the statistics of stimulus intensities for classes\nto which a neuron is sensitive. More generally, our work paves the way toward\nnew classi\ufb01cation algorithms that are robust to intensity variations.\n\n1\n\nIntroduction\n\nConfronted with the continuous \ufb02ow of experience, the brain takes amorphous sensory inputs and\ntranslates them into coherent objects and scenes. This process requires neural circuits to extract key\nregularities from their inputs and to use those regularities to interpret novel experiences. Such learn-\ning is enabled by a variety of plasticity mechanisms which allow neural networks to represent the\nstatistics of the world. The most well-studied plasticity mechanism is synaptic plasticity, where the\nstrength of connections between neurons changes as a function of their activity [1]. Other plasticity\nmechanisms exist and operate in tandem. One example is intrinsic plasticity (IP), where a neuron\u2019s\nresponse to inputs changes as a function of its own past activity. It is a challenge for computational\nneuroscience to understand how different plasticity rules jointly contribute to circuit computation.\nWhile much is known about the contribution of Hebbian plasticity to different variants of unsuper-\nvised learning, including linear and non-linear sparse coding [2\u20135], ICA [6], PCA [7] or cluster-\ning [8\u201312], other aspects of unsupervised learning remain unclear. First, on the computational side,\nthere are many situations in which the meaning of inputs should be invariant to its overall gain. For\nexample, a visual scene\u2019s content does not depend on light intensity, and a word utterance should\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fbe recognized irrespective of its volume. Current models do not explicitly take into account such\ngain variations, and often eliminate them using an ad hoc preprocessing step that normalizes in-\nputs [8, 9, 13]. Second, on the biological side, the roles of other plasticity mechanisms such as IP,\nand their potential contributions to unsupervised learning, remain poorly understood.\nIP changes the input-output function of a neuron depending on its past activity. Typically, IP is a\nhomeostatic negative feedback loop that preserves a neuron\u2019s activation levels despite its changing\ninput [14, 15]. There is no consensus on which quantities IP regulates, e.g. a neuron\u2019s \ufb01ring rate, its\ninternal Ca concentration, its spiking threshold, etc. In modeling work, IP is usually implemented\nas a simple threshold change that controls the mean \ufb01ring rate, although some models propose\nmore sophisticated rules that also constrain higher order statistics of the neuron\u2019s output [6, 16].\nFunctionally, while there have been suggestions that IP can play an important role in circuit function\n[6, 10, 11, 17], its role in unsupervised learning is still not fully understood.\nHere we show that a neural network that combines speci\ufb01c forms of Hebbian plasticity and IP\ncan learn the statistics of inputs with variable gain. We propose a novel generative model named\nProduct-Poisson-Gamma (PPG) that explicitly accounts for class-speci\ufb01c variation in input gain.\nWe then derive, from \ufb01rst principles, a neural circuit that implements inference and learning for this\nmodel. Our derivation yields a novel IP rule as a required component of unsupervised learning given\ngain variations. Our model is unique in that it directly links IP to the gain variations of the pattern to\nwhich a neuron is sensitive, which may be tested experimentally. Beyond neurobiology, the models\nprovide a new class of ef\ufb01cient clustering algorithms that do not require data preprocessing. The\nlearned representations also permit ef\ufb01cient classi\ufb01cation from very little labeled data.\n\n2 The Product-Poisson-Gamma model\n\nIntensity can vary drastically across images although the features present in it are the same.1 This\nvariability constitutes a challenge for learning and is typically eliminated through a preprocessing\nstage in which the inputs are normalized [9]. While such preprocessing can make learning easier, ad\nhoc normalizations may be suboptimal, or may require additional parameters to be set by hand. More\nimportantly, input normalization has the side-effect of losing information about intensity, which\nmight have helped identify the features themselves. For instance, in computer vision objects of the\nsame class are likely to have similar surface properties, resulting in a characteristic distribution of\nlight intensities. Light intensities can therefore aid classi\ufb01cation. In the neural context, the overall\ndrive to neurons may vary, e.g. due to attentional gain modulation, despite the underlying encoded\nfeatures being the same.\nA principled way to address intensity variations is to explicitly model them in a generative model\ndescribing the data. Then we can use that generative model to derive optimal inference and learning\nfor such data and map them to a corresponding neural circuit implementation. Let us assume the\nstimuli are drawn from one of C classes, and let us denote a stimulus by (cid:126)y. Given a stimulus /\ndata point (cid:126)y, we wish to infer the class c that generated it (see Figure 1). Let (cid:126)y depend not only on\nthe class c, but also on a continuous random variable z, representing the intensity of the stimulus,\nthat itself depends on c as well as some parameters \u03b8. Given these dependencies Pr((cid:126)y|c, z, \u03b8) and\nPr(z|c, \u03b8), Bayes\u2019 rule speci\ufb01es how to infer the class c and hidden variable z given an observation\nof (cid:126)y:\n\nPr(c, z|(cid:126)y, \u03b8) =\n\nc(cid:48)(cid:82) Pr((cid:126)y|c(cid:48), z(cid:48), \u03b8) Pr(z(cid:48)|c(cid:48), \u03b8) Pr(c(cid:48)|\u03b8)dz\n(cid:80)\n\nPr((cid:126)y|c, z, \u03b8) Pr(z|c, \u03b8) Pr(c|\u03b8)\n\n.\n\n(1)\n\nWe can obtain neurally-implementable expressions for the posterior if our data generative model is\na mixture model with non-negative noise, e.g. a Poisson mixture model [9]. We extend the Poisson\nmixture model by including an additional statistical description of stimulus intensity. The Gamma\ndistribution is a natural choice due to its conjugacy with the Poisson distribution. Let each of the D\nelements in the vector (cid:126)y|z, c, \u03b8 (e.g. pixels in an image) be independent and Poisson-distributed, let\nz|c, \u03b8 be Gamma-distributed, and let the prior of each class be uniform:\n\nPr((cid:126)y|c, z, \u03b8) =\n\nPois(yd; zWcd); Pr(z|c, \u03b8) = Gam(z; \u03b1c, \u03b2c); Pr(c|\u03b8) =\n\n1\nC\n\nD(cid:89)\n\nd=1\n\n1We use images as inputs and intensity as a measure of input gain as a running example. Our arguments\n\napply regardless of the type of sensory input, e.g. the volume of sound or the concentration of odor.\n\n2\n\n\fconstrain the weights of the model to sum to one, (cid:80)\n\nwhere all W , \u03b1, and \u03b2 represent the parameters of the model. To avoid ambiguity in scales, we\nd Wcd = 1. We call this generative model\na Product-Poisson-Gamma (PPG). While the multiplicative interaction between features and the\nintensity or gain variable is reminiscent of the Gaussian Scale Mixture (GSM) generative model, note\nthat PPG has separate intensity distributions for each of the classes; each is a Gamma distribution\nwith a (possibly unique) shape parameter \u03b1c and rate parameter \u03b2c. Furthermore, the non-gaussian\nobservation noise is critical for deriving the circuit dynamics.\nThe model is general and \ufb02exible, yet it is suf\ufb01ciently constrained to allow for closed-form joint\nposteriors. As shown in Appendix A, the joint posterior of the class and intensity is:\n\n(cid:80)\n\nNB(\u02c6y; \u03b1c,\nc(cid:48) NB(\u02c6y; \u03b1c(cid:48),\n\n\u03b2c+1 ) exp ((cid:80)\n\u03b2c(cid:48) +1 ) exp ((cid:80)\n\n1\n\n1\n\nd yd ln Wcd)\n\nd yd ln Wc(cid:48)d)\n\nGam(z; \u03b1c + \u02c6y, \u03b2c + 1),\n\nPr(c, z|(cid:126)y, \u03b8) =\n\nwhere \u02c6y =(cid:80)\n\nd yd, and NB represents the negative binomial distribution.\n\nWe also obtain a closed-form expression of the posterior marginalized over z, which takes the form\nof a softmax function weighted by negative binomials:\n\n(cid:80)\n\nNB(\u02c6y; \u03b1c,\nc(cid:48) NB(\u02c6y; \u03b1c(cid:48),\n\n\u03b2c+1 ) exp ((cid:80)\n\u03b2c(cid:48) +1 ) exp ((cid:80)\n\n1\n\n1\n\nd yd ln Wcd)\n\nd(cid:48) yd(cid:48) ln Wc(cid:48)d(cid:48))\n\nPr(c|(cid:126)y, \u03b8) =\n\n(2)\n\nThis is a straightforward generalization of the standard softmax, used for optimal learning in winner-\ntake-all (WTA) networks [2,8,9,11] and WTA-based microcircuits [18]. Note that Eqn. 2 represents\nthe optimal way to integrate evidence for class membership originating from stimulus intensity (pa-\nrameterized by (cid:126)\u03b1 and (cid:126)\u03b2) and pattern \u2018shape\u2019 (parameterized by W ). If one of the two is not instruc-\ntive, then the corresponding terms cancel out: if the patterns have identical shape (W with identical\nrows), then the softmax drops out and only negative binomial terms remain, and if all pattern classes\nhave the same intensity distribution, then the posterior reduces to the standard softmax function as\nin previous work [2, 8\u201311].\nTo facilitate the link to neural dynamics, Eqn. 2 can be simpli\ufb01ed by approximating the negative\nbinomial distribution as Poisson. In the limit that \u03b1c \u2192 \u221e and the mean \u03bbc \u2261 \u03b1c/\u03b2c = constant,\nthe negative binomial distribution is:\n\nlim\n\n\u03b1c\u2192\u221e,\u03b1c/\u03b2c=const.\n\nNB(\u02c6y; \u03b1c,\n\n1\n\n) = Pois(\u02c6y;\n\n) \u2261 Pois(\u02c6y; \u03bbc).\n\n\u03b1c\n\u03b2c\n\nIn this limit, Eqn. 2 becomes:\n\nd(cid:48) yd(cid:48) ln(Wcd(cid:48)\u03bbc) \u2212 \u03bbc)\nd(cid:48) yd(cid:48) ln(Wc(cid:48)d(cid:48)\u03bbc(cid:48)) \u2212 \u03bbc(cid:48))\nwhich can be evaluated by a neural network using soft-WTA dynamics [9].\n\nPr(c|(cid:126)y, \u03b8) \u2248\n\n\u03b2c + 1\n\nexp((cid:80)\n(cid:80)\nc(cid:48) exp((cid:80)\n\n3 Expectation-Maximization of PPG-generated data\n\n(3)\n\nAs a starting point for deriving a biologically-plausible neural network for learning PPG-generated\ndata, let us \ufb01rst consider optimal learning derived from the Expectation-Maximization (EM) algo-\nrithm [19]. Given a set of N data points (cid:126)y(n), we seek the parameters \u03b8 = {W, \u03bb} that maximize\nthe data likelihood given the PPG-model de\ufb01ned above. We use the EM formulation introduced\nin [20] and optimize the free-energy given by:\n\nF(\u03b8t, \u03b8t-1) =\n\nPr(c(cid:48)|(cid:126)y(n), \u03b8t-1)(ln Pr((cid:126)y(n)|c(cid:48), \u03b8t) + ln Pr(c(cid:48)|\u03b8t)) + H(\u03b8t-1).\n\n(cid:88)\n\n(cid:88)\n\nn\n\nc(cid:48)\n\nHere, H(\u03b8t-1) is the Shannon entropy of the posterior as a function of the previous parameter values.\nWe can \ufb01nd the M-step update rules for the parameters of the model \u03bbc and Wcd by taking the partial\nderivative of F(\u03b8t, \u03b8t-1) w.r.t. the desired parameter and setting it to zero. As shown in Appendix B,\nthe resultant update rule for \u03bbc,t is:\n\u2202F(\u03b8t, \u03b8t-1)\n\n(cid:80)\n(cid:80)\nn Pr(c|(cid:126)y(n), \u03b8t-1)\u02c6y(n)\nn Pr(c|(cid:126)y(n), \u03b8t-1)\n\n= 0 \u21d2 \u03bbc,t =\n\n\u2202\u03bbc,t\n\n(4)\n\n3\n\n\fThe M-step update rules for the weights Wcd are found by setting the corresponding partial derivative\nd Wcd = 1. Using Lagrange multipliers \u039bc yields\n\nthe following update rule (see Appendix B):\n\nof F(\u03b8t, \u03b8t-1) to zero, under the constraint that(cid:80)\n(cid:88)\n(cid:80)\n(cid:80)\n(cid:80)\nn yd Pr(c|(cid:126)y(n), \u03b8t-1)\nn yd Pr(c|(cid:126)y(n), \u03b8t-1)\n\n\u2202Wcd,t\n\u21d2 Wcd,t =\n\n(cid:32)(cid:88)\n\n\u2202F(\u03b8t, \u03b8t-1)\n\n\u039bc(cid:48)\n\n\u2202Wcd,t\n\nc(cid:48)\n\n\u2202\n\nd\n\n+\n\n(cid:33)\n\nWc(cid:48)d(cid:48),t \u2212 1\n\n= 0\n\nd(cid:48)\n\n.\n\n(5)\n\nc\n\n.\n\nAs numerical veri\ufb01cation, Figure 1 illustrates the evolution of parameters \u03bbc and Wcd yielded by\nthe EM algorithm on arti\ufb01cial data. Our arti\ufb01cial data set consists of four classes of rectangles on\na grid of 10x10 pixels. Rectangles from different classes have different sizes and positions and are\nrepresented by a generative vector W gen\nWe generate a data set by drawing a large number N of observations of W gen\n, with each class\nequiprobable. We then draw a random variable z from a Gamma distribution with parameters \u03b1c\nand \u03b2c that depend on the class of each observation. Then, given W gen\nand z, we create a data vector\n(cid:126)y(n) by adding Poisson noise to each pixel. With a set of N data vectors (cid:126)y(n), we then perform EM\nto \ufb01nd the parameters Wcd and \u03bbc that maximize the likelihood of the data set (at least locally). The\nE-step evaluates Equation 2 for each data vector, and the M-step evaluates Equations 4 and 5. Figure\n1 shows that, after about \ufb01ve iterations, the EM algorithm returns the values of Wcd and \u03bbc that were\nused to generate the data set, i.e. the parameter values that maximize the data likelihood.\n\nc\n\nc\n\nFigure 1: The evolution of model parameters yielded by the EM algorithm on arti\ufb01cial data. A: Four classes of\nrectangles represented by the vector W gen\n, with the values of \u03bbc for each class displayed to the left. B: Evolution\nof the parameters Wcd for successive iterations of the EM algorithm. C: Evolution of the parameters \u03bbc, with\ndashed lines indicating the values from the generative model. The EM algorithm returns the values of Wcd and\n\u03bbc that were used to generate the data set, i.e. the parameter values that maximize the data likelihood. For these\nplots, we generated a data set of 2000 inputs. W gen\nc = 100 for white pixels and 1 for black pixels. The shape\nand rate parameters of the Gamma distributions, from the top class to the bottom, are \u03b1 = [98, 112, 128, 144]\nand \u03b2 = [7, 7.5, 8, 8.5], giving \u03bbc = \u03b1c/\u03b2c = [14, 15, 16, 17].\n\nc\n\n4 Optimal neural learning for varying stimulus intensities\n\nFor PPG-generated data, the posterior distribution of the class given an observation is approximately\nthe softmax function (or soft-WTA, Eqn. 3). Neural networks that implement the softmax function,\nusually via some form of lateral inhibition, have been extensively investigated [2, 8\u201311, 21]. Thus,\ninference in our model reduces to well-understood neural circuit dynamics.\nThe key remaining challenge is to analytically relate optimal learning as derived by EM to circuit\nplasticity. To map abstract random variables to neural counterparts, we consider a complete bipar-\ntite neural network, with the input layer corresponding to the observables y and the hidden layer\nrepresenting the latent causes of the observables, i.e. classes.2 The network is feedforward; each\n\n2The number of hidden neurons does not necessarily need to equal the number of classes; see Figure 3.\n\n4\n\n\fneuron in the input layer connects to each neuron in the hidden layer via synaptic weights Wcd,\nwhere c \u2208 [1, C] indexes the C hidden neurons and d \u2208 [1, D] indexes the D input neurons.\nLet each of the hidden neurons have a standard activity variable, sc, and additionally an intrinsic\nparameter \u03bbc that represents its excitability. Let the activity of each hidden neuron be given by\nEqn. 2. The activity of each hidden neuron is then the posterior distribution for one particular class,\ngiven the inputs it receives from the input layer, its synaptic weights, and its excitability:\n\n(cid:80)\n\nsc =\n\n;\n\nIc =\n\nexp(Ic)\nc(cid:48) exp(Ic(cid:48))\n\nyd(cid:48) ln(Wcd(cid:48)\u03bbc) \u2212 \u03bbc.\n\n(cid:88)\n\nd(cid:48)\n\n(cid:88)\n\nd\n\nThe weights of the neural network Wcd are plastic and change according to a Hebbian learning rule\nwith synaptic scaling [22]:\n\nwhere \u0001W is a small and positive learning rate, and \u00afWc =(cid:80)\n\n\u2206Wcd = \u0001W (scyd \u2212 sc\u03bbc \u00afWcWcd),\nd Wcd.\n\nThe intrinsic parameters \u03bbc are also plastic and change according to a similar learning rule:\n\n\u2206\u03bbc = \u0001\u03bbsc(\n\nyd \u2212 \u03bbc),\n\n(6)\n\n(7)\n\nwhere \u0001\u03bb is another small positive learning rate. This type of regulation of excitability is homeo-\nstatic in form, but differs from standard implementations in that the excitability changes not only\ndepending on the neuron output, s, but also on the net input to the neuron (see also [17] for a formal\n\nlink between(cid:80)\n\nd yd and average incoming inputs).\n\nAppendix C shows that these online update rules enforce the desired weight normalization, with \u00afWc\nconverging to one. Assuming weight convergence, and assuming a small learning rate and a large\nset of data points, the weights and intrinsic parameters converge to (see [9] and Appendix C):\n\n(cid:80)\n(cid:80)\nd(cid:48)(cid:80)\n\nn y(n)\n\nd sc\nn y(n)\n\nd sc\n\ncd \u2248\nW conv\n\n;\n\n\u03bbconv\nc =\n\n(cid:80)\nn sc \u02c6y(n)(cid:80)\n\nn sc\n\n.\n\nComparing these convergence expressions with the EM updates (Eqns. 5 and 4) and inserting the\nde\ufb01nition sc = Pr(c|(cid:126)y, \u03b8), we see that the neural dynamics given in Eqns. 6 and 7 have the same\n\ufb01xed points as optimal EM learning. The network can therefore \ufb01nd the parameter values that op-\ntimize the data likelihood using compact and neurally-plausible learning rules. Eqn. 6 is a standard\nform of Hebbian plasticity with synaptic scaling, while Eqn. 7 states how the excitability of hidden\nneurons should be governed by the gain of the inputs and the current to the neuron.\n\n5 Numerical Experiments\n\nTo verify our analytical results, we \ufb01rst investigated learning in the derived neural network using\ndata generated according to the PPG model. Figure 2 illustrates the evolution of parameters \u03bbc and\nWcd yielded by the neural network on arti\ufb01cial data (the same as used for Figure 1). The neural\nnetwork learns the synaptic weights and intrinsic parameters that were used to generate the data set,\ni.e. the parameter values that maximize the data likelihood.\nSince our arti\ufb01cial data was PPG-generated, one can expect the neural network to learn the classes\nand intensities quickly and accurately. To test the neural network on more realistic data, we followed\na number of related studies [8\u201312] and used the MNIST as a standard dataset containing different\nstimulus classes. The input to the network was 28x28 pixel images (converted to vectors) from\nthe MNIST dataset. We present our results for the digits 0-3 for visual ease and simulation speed;\nour results on the full dataset are qualitatively similar. We added an offset of 1 to all pixels and\nrescaled them so that no pixel was greater than 1. The \u03bbc were initialized to be the mean intensity\nof all digit classes as calculated from our modi\ufb01ed MNIST training set. Each Wcd was initialized\nas Wcd \u223c Pois(Wcd; \u00b5d) + 1, where \u00b5d is the mean of each pixel over all classes and is calculated\nfrom our modi\ufb01ed MNIST training set.\nFigure 3 shows an example run using C = 16 hidden neurons. It shows the change in both neural\nweights and intrisic excitabilities \u03bbc during learning. We observe that the weights change to repre-\nsent the digit classes and converge relatively quickly (panels A, B). We veri\ufb01ed that they sum to 1\n\n5\n\n\fFigure 2: The evolution of model parameters yielded by the neural network on arti\ufb01cial data generated from\nthe same model as that used in Figure 1. A: Four classes of rectangles with the values of \u03bbc for each class\ndisplayed to the left. B: Evolution of the synaptic weights Wcd that feed each hidden unit after 0, 20, 40,\n. . . , 120 time steps, respectively. C: Evolution of the intrinsic parameters \u03bbc over 4000 time steps, with dashed\nlines indicating the values from the generative model. The neural network returns the values of Wcd and \u03bbc that\nwere used to generate the data set, i.e. the parameter values that maximize the data likelihood. For these plots,\n\u0001W = \u0001\u03bb = .005, D = 100 (for a 10x10 pixel grid), C = 4, initialized weights were uniformly-distributed\nbetween .01 and .06, and initialized intrinsic parameters were uniformly-distributed between 10 and 20.\n\nFigure 3: The neural network\u2019s performance on a reduced MNIST dataset (the digits 0 to 3). A: Representa-\ntives of the input digits. B: The network\u2019s synaptic weights during training. Each square represents the weights\nfeeding one hidden neuron. Each box of 16 squares represents the weights feeding each of the C = 16 hidden\nneurons after initialization, and after subsequent iterations over the training set. The network learns different\nwriting styles for different digits. C: The network learns the average intensities, i.e. the sum of the pixels in\nan image, of each class of digit in MNIST. Algorithms that impose ad hoc intensity normalization in their pre-\nprocessing cannot learn these intensities. The horizontal dashed lines are the average intensities of each digit,\nwith 1 having the lowest overall luminance and 0 the largest. The average \u03bbc for all hidden units representing\na given digit converge to those ground truth values. D: The network\u2019s learned intensity differences improve\nclassi\ufb01cation performance. The percentage of correct digit classi\ufb01cations by a network with IP (solid lines) is\nhigher than that by a network without IP (dashed lines). This result is robust to the number of iterations over\nthe dataset and the number of labels used to calculate the Bayesian classi\ufb01er used in [9].\n\n6\n\n\ffor each class at convergence (not shown). We also observe that the network\u2019s IP dynamics allow it\nto learn the average intensities of each class of digit (panel C). The thin horizontal dashed lines are\nthe true values for \u03bbc as calculated from the MNIST test set using its ground-truth label information.\nIP modi\ufb01es the network\u2019s excitability parameters \u03bb to converge to their true values. Our network is\nnot only robust to variations in intensity, but learns their class-speci\ufb01c values.\nA network that learns the excitability parameters \u03bb exhibits a higher classi\ufb01cation rate than a network\nwithout IP (panel D). We computed the performance of the network derived in Sec. 4 on unnormal-\nized data in comparison with a network without IP (all else being equal). As a performance measure\nwe used the classi\ufb01cation error (computed using the same Bayesian classi\ufb01er as used in [9]). Clas-\nsi\ufb01cation success rates were calculated with very few labels, using 0.5% (thin lines) and 5% (thick\nlines) of labels in the training set (both settings for both networks). The classi\ufb01cation performance\nof the network with IP outperforms that of the network without it. This result suggests that the\ndifferences in intensities in MNIST, albeit visually small, are suf\ufb01cient to aid classi\ufb01cation.\nFinally, Figure 4 shows that the neural network can learn classes that differ only in their intensities.\nThe dataset used for Figure 4 comprises 40000 images of two types of sphere: dull and shiny. The\nspheres were identical in shape and position, and we generated data points (i.e. images) under a\nvariety of lighting conditions. On average, the shiny spheres were brighter (\u03bbshiny \u2248 720) than\nthe dull spheres (\u03bbdull \u2248 620). The network represents the two classes in its learned weights and\nintensities. Algorithms that utilize ad hoc normalization preprocessing schemes would have serious\ndif\ufb01culties learning input statistics for datasets of this kind.\n\nFigure 4: The neural network can learn classes that differ only in their intensities. The dataset consisted of\neither dull or shiny spheres. The network had C = 2 hidden neurons. A: Three pairs of squares represent\nthe weights feeding each hidden neuron after initialization (leftmost pair), 10 iterations (center pair), and 200\niterations (rightmost pair) over the training set. Note the rightmost pair, particularly how the right sphere\nappears brighter than the left sphere. The right sphere corresponds to the shiny class and the left sphere to\nthe dull class. B: Learned mean intensities as a function of iterations over the training set. The dull spheres\nhave an average intensity of 620, and the shiny spheres 720. The network learns the classes and their average\nintensities, even when data points from different classes have the same sizes and positions.\n\n6 Discussion\n\nNeural circuit models are powerful tools for understanding neural learning and information process-\ning. They have attracted attention as inherently parallel information processing devices for analog\nVLSI, a fast and power-ef\ufb01cient alternative to standard processor architectures [12,23]. Much work\nhas investigated learning with winner-take-all (WTA) type networks [2, 8\u201312, 18, 21, 24]. A subset\nof these studies [2, 8\u201311, 21] link synaptic plasticity in WTA networks to optimal learning, mostly\nusing mixture distributions to model input stimuli [8\u201311, 21]. Our contribution expands on these\nresults both computationally, by allowing for a robust treatment of variability in input gain, and\nbiologically, by providing a normative justi\ufb01cation for intrinsic plasticity during learning. Our an-\nalytical results show that the PPG-generative model is tractable and neurally-implementable, while\nour numerical results show that it is \ufb02exible and robust.\nOur model provides a principled treatment of intensity variations, something ubiquitous in realistic\ndatasets. As a result, it allows for robust learning without requiring normalized input data. This ad-\ndresses the criticisms (see [10]) of earlier WTA-like circuits [8,9] that required normalized data. We\nfound that explicitly accounting for intensity improves classi\ufb01cation performance even for datasets\nthat have been size-normalized (e.g. MNIST), presumably by providing an additional dimension for\ndiscriminating across latent features. Furthermore, we found that the learned representation of the\nMNIST data allows for good classi\ufb01cation in a semi-supervised setting, when only a small fraction\n\n7\n\n\fof the data is labeled. Thus, our model provides a starting point for constructing novel clustering\nand classi\ufb01cation algorithms following the general approach in [9].\nThe treatment of intensity as an explicit variable is not new. The well-investigated class of Gaussian\nScale Mixtures (GSM) is built on that idea. Nonetheless, while GSM and PPG share some con-\nceptual similarities, they are mathematically distinct. While GSMs assume 1) Gaussian distributed\nrandom variables and 2) a common scale variable [25], PPG assumes 1\u2019) Poisson observation noise\nand 2\u2019) class-speci\ufb01c scale variables. Consequently, none of the GSM results carry over to our\nwork, and our PPG assumptions are critical for our derived intrinsic plasticity and Hebbian plastic-\nity rules. It would be interesting to investigate a circuit analog of intensity parameter learning in a\nGSM. Since this class of models is known to capture many features of afferent sensory neurons, we\nmight make more speci\ufb01c predictions concerning IP in V1. It would also be interesting to compare\nthe classi\ufb01cation performance of a GSM with that of PPG on the same dataset. The nature of the\nGSM generative model (linear combination of features with multiplicative gain modulation) makes\nit an unusual choice for a classi\ufb01cation task. However, in principle, one could use a GSM to learn a\nrepresentation of a dataset and train a classi\ufb01er on it.\nThe optimal circuit implementation of learning in our generative model requires a particular form of\nIP. The formulation of IP is a phenomenological one, re\ufb02ecting the biological observation that the\nexcitability of a neuron changes in a negative feedback loop as a function of past activity [14, 15].\nMathematically, our model shares similarities with past IP models [6, 10, 17] with the important\ndifference that the controlled variable is the input current, rather than the output \ufb01ring rate. Since\nthe two quantities are closely related, we expect it will be dif\ufb01cult to directly disambiguate between\nIP models experimentally. Nonetheless, our model makes potentially testable predictions in terms\nof the functional role of IP, by directly linking the excitability of individual neurons to nontrivial\nstatistics of their inputs, namely their average intensity under a Gamma distribution. Since past IP\nwork invariably assumes the target excitability is a \ufb01xed parameter, usually shared across neurons,\nthe link between neural excitability and real world statistics is very speci\ufb01c to our model and po-\ntentially testable experimentally. Furthermore, our work provides a computational rationale for the\ndramatic variations in excitability across neurons, even within a local cortical circuit, which could\nnot be explained by traditional models.\nThe functional role for IP identi\ufb01ed here complements previous proposals linking the regulation\nof neuronal excitability to learning priors [11] or as posterior constraints [10, 26]. Ultimately, it\nis likely that the role of IP is manifold. Recent theoretical work suggests that the net effect of\ninputs on neural excitability may arise as a complex interaction between several forms of IP, some\nhomeostatic and others not [17]. Furthermore, different experimental paradigms may preferentially\nexpose one IP process over the others, which would explain the confusion within the literature on\nthe exact nature of biological IP. Taken together, these models point to a fundamental role of IP for\ncircuit computation in a variety of setups. Given its many possible roles, any approach based on\n\ufb01rst principles is valuable, as it tightly connects IP to concrete stimulus properties in a way that can\ntranslate into better-constrained experiments.\nAcknowledgements. We acknowledge funding by the DFG within the Cluster of Excellence EXC\n1077/1 (Hearing4all) and by grant LU 1196/5-1 (JL and TM) and the People Programme (Marie\nCurie Actions) of the European Union\u2019s Seventh Framework Programme (FP7/2007-2013) under\nREA grant agreement no. 291734 (CS).\n\nReferences\n\n[1] L F Abbott and S B Nelson. Synaptic plasticity: taming the beast. Nat Neurosci, 3:1178 \u2013\n\n1183, 2000.\n\n[2] J L\u00a8ucke and M Sahani. Maximal causes for non-linear component extraction. J Mach Learn\n\nRes, 9:1227\u201367, 2008.\n\n[3] C J Rozell, D H Johnson, R G Baraniuk, and B A Olshausen. Sparse coding via thresholding\n\nand local competition in neural circuits. Neural Comput, 20(10):2526\u201363, October 2008.\n\n[4] J L\u00a8ucke. Receptive \ufb01eld self-organization in a model of the \ufb01ne-structure in V1 cortical\n\ncolumns. Neural Comput, 21(10):2805\u201345, 2009.\n\n8\n\n\f[5] J Zylberberg, J T Murphy, and M R Deweese. A Sparse Coding Model with Synaptically\nLocal Plasticity and Spiking Neurons Can Account for the Diverse Shapes of V1 Simple Cell\nReceptive Fields. PLoS Comp Biol, 7(10):e1002250, 2011.\n\n[6] C Savin, P Joshi, and J Triesch. Independent Component Analysis in Spiking Neurons. PLoS\n\nComp Biol, 6(4):e1000757, April 2010.\n\n[7] E Oja. A simpli\ufb01ed neuron model as a principal component analyzer. J Math Biol, 15:267 \u2013\n\n273, 1982.\n\n[8] B Nessler, M Pfeiffer, and W Maass. Stdp enables spiking neurons to detect hidden causes of\n\ntheir inputs. In Adv Neural Inf Process Syst, pages 1357\u20131365, 2009.\n\n[9] C Keck, C Savin, and J L\u00a8ucke. Feedforward inhibition and synaptic scaling\u2013two sides of the\n\nsame coin? PLoS Comp Biol, 8(3):e1002432, 2012.\n\n[10] S Habenschuss, J Bill, and B Nessler. Homeostatic plasticity in bayesian spiking networks as\nexpectation maximization with posterior constraints. In Adv Neural Inf Process Syst, pages\n773\u2013781, 2012.\n\n[11] B Nessler, M Pfeiffer, L Buesing, and W Maass. Bayesian computation emerges in\ngeneric cortical microcircuits through spike-timing-dependent plasticity. PLoS Comp Biol,\n9(4):e1003037, 2013.\n\n[12] M Schmuker, T Pfeil, and M P Nawrot. A neuromorphic network for generic multivariate data\n\nclassi\ufb01cation. Proc Natl Acad Sci, 111(6):2081\u20132086, 2014.\n\n[13] O Schwartz and E P Simoncelli. Natural sound statistics and divisive normalization in the\n\nauditory system. Adv Neural Inf Process Syst, pages 166\u2013172, 2000.\n\n[14] G Daoudal and D Debanne. Long-term plasticity of intrinsic excitability: learning rules and\n\nmechanisms. Learn Memory, 10(6):456\u2013465, 2003.\n\n[15] R H Cudmore and G G Turrigiano. Long-term potentiation of intrinsic excitability in lv visual\n\ncortical neurons. J Neurophysiol, 92(1):341\u2013348, 2004.\n\n[16] M Stemmler and C Koch. How voltage-dependent conductances can adapt to maximize the\n\ninformation encoded by neuronal \ufb01ring rate. Nat Neurosci, 2(6):521\u2013527, 1999.\n\n[17] C Savin, P Dayan, and M Lengyel. Optimal Recall from Bounded Metaplastic Synapses: Pre-\ndicting Functional Adaptations in Hippocampal Area CA3. PLoS Comp Biol, 10(2):e1003489,\nFebruary 2014.\n\n[18] Rodney J Douglas and Kevan AC Martin. Neuronal circuits of the neocortex. Annu Rev\n\nNeurosci, 27:419\u2013451, 2004.\n\n[19] A P Dempster, N M Laird, and D B Rubin. Maximum likelihood from incomplete data via the\n\nEM algorithm (with discussion). J R Stat Soc Series B, 39:1\u201338, 1977.\n\n[20] R Neal and G Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer, 1998.\n\n[21] D J Rezende, D Wierstra, and W Gerstner. Variational learning for recurrent spiking networks.\n\nAdv Neural Inf Process Syst, pages 136\u2013144, 2011.\n\n[22] L F Abbott and S B Nelson. Synaptic plasticity: taming the beast. Nat Neurosci, 3(Supp):1178\u2013\n\n1183, November 2000.\n\n[23] E Neftci, J Binas, U Rutishauser, E Chicca, G Indiveri, and R J Douglas. Synthesizing cogni-\n\ntion in neuromorphic electronic systems. Proc Natl Acad Sci, 110(37):E3468\u2013E3476, 2013.\n\n[24] J L\u00a8ucke and C Malsburg. Rapid processing and unsupervised learning in a model of the cortical\n\nmacrocolumn. Neural Comput, 16:501\u201333, 2004.\n\n[25] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and\ntheir use in analyzing and modeling natural images. Appl Comput Harmon Anal, 11(1):89\u2013\n123, 2001.\n\n[26] S Deneve. Bayesian spiking neurons i: inference. Neural Comput, 20(1):91\u2013117, 2008.\n\n9\n\n\f", "award": [], "sourceid": 2126, "authors": [{"given_name": "Travis", "family_name": "Monk", "institution": "University of Oldenburg"}, {"given_name": "Cristina", "family_name": "Savin", "institution": "IST Austria"}, {"given_name": "J\u00f6rg", "family_name": "L\u00fccke", "institution": "University of Oldenburg"}]}