{"title": "Hebbian Learning of Bayes Optimal Decisions", "book": "Advances in Neural Information Processing Systems", "page_first": 1169, "page_last": 1176, "abstract": "Uncertainty is omnipresent when we perceive or interact with our environment, and the Bayesian framework provides computational methods for dealing with it. Mathematical models for Bayesian decision making typically require datastructures that are hard to implement in neural networks. This article shows that even the simplest and experimentally best supported type of synaptic plasticity, Hebbian learning, in combination with a sparse, redundant neural code, can in principle learn to infer optimal Bayesian decisions. We present a concrete Hebbian learning rule operating on log-probability ratios. Modulated by reward-signals, this Hebbian plasticity rule also provides a new perspective for understanding how Bayesian inference could support fast reinforcement learning in the brain. In particular we show that recent experimental results by Yang and Shadlen [1] on reinforcement learning of probabilistic inference in primates can be modeled in this way.", "full_text": "Hebbian Learning of Bayes Optimal Decisions\n\nBernhard Nessler\u2217, Michael Pfeiffer\u2217, and Wolfgang Maass\n\nInstitute for Theoretical Computer Science\n\nGraz University of Technology\n\nA-8010 Graz, Austria\n\n{nessler,pfeiffer,maass}@igi.tugraz.at\n\nAbstract\n\nUncertainty is omnipresent when we perceive or interact with our environment,\nand the Bayesian framework provides computational methods for dealing with\nit. Mathematical models for Bayesian decision making typically require data-\nstructures that are hard to implement in neural networks. This article shows that\neven the simplest and experimentally best supported type of synaptic plasticity,\nHebbian learning, in combination with a sparse, redundant neural code, can in\nprinciple learn to infer optimal Bayesian decisions. We present a concrete Hebbian\nlearning rule operating on log-probability ratios. Modulated by reward-signals,\nthis Hebbian plasticity rule also provides a new perspective for understanding\nhow Bayesian inference could support fast reinforcement learning in the brain.\nIn particular we show that recent experimental results by Yang and Shadlen [1] on\nreinforcement learning of probabilistic inference in primates can be modeled in\nthis way.\n\n1 Introduction\n\nEvolution is likely to favor those biological organisms which are able to maximize the chance of\nachieving correct decisions in response to multiple unreliable sources of evidence. Hence one may\nargue that probabilistic inference, rather than logical inference, is the \u201dmathematics of the mind\u201d,\nand that this perspective may help us to understand the principles of computation and learning in\nthe brain [2]. Bayesian inference, or equivalently inference in Bayesian networks [3] is the most\ncommonly considered framework for probabilistic inference, and a mathematical theory for learning\nin Bayesian networks has been developed.\n\nVarious attempts to relate these theoretically optimal models to experimentally supported models for\ncomputation and plasticity in networks of neurons in the brain have been made. [2] models Bayesian\ninference through an approximate implementation of the Belief Propagation algorithm (see [3]) in a\nnetwork of spiking neurons. For reduced classes of probability distributions, [4] proposed a method\nfor spiking network models to learn Bayesian inference with an online approximation to an EM\nalgorithm. The approach of [5] interprets the weight wji of a synaptic connection between neurons\nrepresenting the random variables xi and xj as log p(xi,xj )\np(xi)\u00b7p(xj ) , and presents algorithms for learning\nthese weights.\n\nNeural correlates of variables that are important for decision making under uncertainty had been\npresented e.g. in the recent experimental study by Yang and Shadlen [1]. In their study they found\nthat \ufb01ring rates of neurons in area LIP of macaque monkeys re\ufb02ect the log-likelihood ratio (or log-\nodd) of the outcome of a binary decision, given visual evidence. The learning of such log-odds\nfor Bayesian decision making can be reduced to learning weights for a linear classi\ufb01er, given an\nappropriate but \ufb01xed transformation from the input to possibly nonlinear features [6]. We show\n\n\u2217Both authors contributed equally to this work.\n\n1\n\n\fthat the optimal weights for the linear decision function are actually log-odds themselves, and the\nde\ufb01nition of the features determines the assumptions of the learner about statistical dependencies\namong inputs.\n\nIn this work we show that simple Hebbian learning [7] is suf\ufb01cient to implement learning of Bayes\noptimal decisions for arbitrarily complex probability distributions. We present and analyze a con-\ncrete learning rule, which we call the Bayesian Hebb rule, and show that it provably converges\ntowards correct log-odds. In combination with appropriate preprocessing networks this implements\nlearning of different probabilistic decision making processes like e.g. Naive Bayesian classi\ufb01cation.\nFinally we show that a reward-modulated version of this Hebbian learning rule can solve simple\nreinforcement learning tasks, and also provides a model for the experimental results of [1].\n\n2 A Hebbian rule for learning log-odds\n\nWe consider the model of a linear threshold neuron with output y0, where y0 = 1 means that the\nneuron is \ufb01ring and y0 = 0 means non-\ufb01ring. The neuron\u2019s current decision \u02c6y0 whether to \ufb01re or not\ni=1 wiyi), where the yi are the\ncurrent \ufb01ring states of all presynaptic neurons and wi are the weights of the corresponding synapses.\nWe propose the following learning rule, which we call the Bayesian Hebb rule:\n\nis given by a linear decision function \u02c6y0 = sign(w0 \u00b7 constant +Pn\n\n\u2206wi =(\u03b7 (1 + e\u2212wi ),\n\n\u2212\u03b7 (1 + ewi ),\n\n0,\n\nif y0 = 1 and yi = 1\nif y0 = 0 and yi = 1\n\nif yi = 0.\n\n(1)\n\nThis learning rule is purely local, i.e.\nit depends only on the binary \ufb01ring state of the pre- and\npostsynaptic neuron yi and y0, the current weight wi and a learning rate \u03b7. Under the assumption\nof a stationary joint probability distribution of the pre- and postsynaptic \ufb01ring states y0, y1, . . . , yn\nthe Bayesian Hebb rule learns log-probability ratios of the postsynaptic \ufb01ring state y0, conditioned\non a corresponding presynaptic \ufb01ring state yi. We consider in this article the use of the rule in a\nsupervised, teacher forced mode (see Section 3), and also in a reinforcement learning mode (see\nSection 4). We will prove that the rule converges globally to the target weight value w\u2217\n\ni , given by\n\nw\u2217\n\ni = log\n\np(y0 = 1|yi = 1)\np(y0 = 0|yi = 1)\n\n.\n\nWe \ufb01rst show that the expected update E[\u2206wi] under (1) vanishes at the target value w\u2217\ni :\n\nE[\u2206w\u2217\n\ni ] = 0 \u21d4 p(y0=1, yi=1)\u03b7(1 + e\u2212w\u2217\n\ni ) \u2212 p(y0=0, yi=1)\u03b7(1 + ew\u2217\n\ni ) = 0\n\n\u21d4\n\n\u21d4\n\ni\n\n1 + ew\u2217\n1 + e\u2212w\u2217\n\ni\n\n=\n\nw\u2217\n\ni = log\n\np(y0=1, yi=1)\np(y0=0, yi=1)\np(y0=1|yi=1)\np(y0=0|yi=1)\n\n.\n\n(2)\n\n(3)\n\nSince the above is a chain of equivalence transformations, this proves that w\u2217\ni is the only equilibrium\nvalue of the rule. The weight vector w\u2217 is thus a global point-attractor with regard to expected weight\nchanges of the Bayesian Hebb rule (1) in the n-dimensional weight-space Rn.\n\nFurthermore we show, using the result from (3), that the expected weight change at any current value\nof wi points in the direction of w\u2217\ni +2\u01eb:\n\ni . Consider some arbitrary intermediate weight value wi = w\u2217\n\nE[\u2206wi]|w\u2217\n\ni +2\u01eb = E[\u2206wi]|w\u2217\n\ni +2\u01eb \u2212 E[\u2206wi]|w\u2217\n\ni\n\n\u221d p(y0=1, yi=1)e\u2212w\u2217\ni (e\u22122\u01eb \u2212 1) \u2212 p(y0=0, yi=1)ew\u2217\n= (p(y0=0, yi=1)e\u2212\u01eb + p(y0=1, yi=1)e\u01eb)(e\u2212\u01eb \u2212 e\u01eb)\n\ni (e2\u01eb \u2212 1)\n\n.\n\n(4)\n\nThe \ufb01rst factor in (4) is always non-negative, hence \u01eb < 0 implies E[\u2206wi] > 0, and \u01eb > 0 implies\nE[\u2206wi] < 0. The Bayesian Hebb rule is therefore always expected to perform updates in the right\ndirection, and the initial weight values or perturbations of the weights decay exponentially fast.\n\n2\n\n\fAlready after having seen a \ufb01nite set of examples hy0, . . . , yni \u2208 {0, 1}n+1, the Bayesian Hebb rule\nclosely approximates the optimal weight vector \u02c6w that can be inferred from the data. A traditional\nfrequentist\u2019s approach would use counters ai = #[y0=1 \u2227 yi=1] and bi = #[y0=0 \u2227 yi=1] to\nestimate every w\u2217\n\ni by\n\n\u02c6wi = log\n\n.\n\n(5)\n\nai\nbi\n\nA Bayesian approach would model p(y0|yi) with an (initially \ufb02at) Beta-distribution, and use the\ncounters ai and bi to update this belief [3], leading to the same MAP estimate \u02c6wi. Consequently, in\nboth approaches a new example with y0 = 1 and yi = 1 leads to the update\n\n\u02c6wnew\n\ni = log\n\nai + 1\n\nbi\n\n= log\n\nai\n\nbi (cid:18)1 +\n\n1\n\nai(cid:19) = \u02c6wi + log(1 +\n\n1\nNi\n\n(1 + e\u2212 \u02c6wi )) ,\n\nwhere Ni := ai + bi is the number of previously processed examples with yi = 1, thus 1\nai\n1\nNi\n\n). Analogously, a new example with y0 = 0 and yi = 1 gives rise to the update\n\n(1 + bi\nai\n\n\u02c6wnew\n\ni = log\n\nai\n\nbi + 1\n\n= log\n\nai\n\nbi 1\n\n1 + 1\n\nbi! = \u02c6wi \u2212 log(1 +\n\n1\nNi\n\n(1 + e \u02c6wi )).\n\n(6)\n\n=\n\n(7)\n\nFurthermore, \u02c6wnew\ni = \u02c6wi for a new example with yi = 0. Using the approximation log(1 + \u03b1) \u2248 \u03b1\nthe update rules (6) and (7) yield the Bayesian Hebb rule (1) with an adaptive learning rate \u03b7i = 1\nNi\nfor each synapse.\n\nIn fact, a result of Robbins-Monro (see [8] for a review) implies that the updating of weight estimates\ni not only for the particular choice\n\u02c6wi according to (6) and (7) converges to the target values w\u2217\n\u03b7(Ni)\n)2 <\ni\n\u221e. More than that the Supermartingale Convergence Theorem (see [8]) guarantees convergence in\ndistribution even for a suf\ufb01ciently small constant learning rate.\n\nthat satis\ufb01es P\u221e\n\n= \u221e and P\u221e\n\n, but for any sequence \u03b7(Ni)\n\nNi=1(\u03b7(Ni)\n\nNi=1 \u03b7(Ni)\n\n= 1\nNi\n\ni\n\ni\n\ni\n\nLearning rate adaptation\n\nOne can see from the above considerations that the Bayesian Hebb rule with a constant learning rate\n\u03b7 converges globally to the desired log-odds. A too small constant learning rate, however, tends\nto slow down the initial convergence of the weight vector, and a too large constant learning rate\nproduces larger \ufb02uctuations once the steady state is reached.\n\n(6) and (7) suggest a decaying learning rate \u03b7(Ni)\n, where Ni is the number of preceding\nexamples with yi = 1. We will present a learning rate adaptation mechanism that avoids biologically\nimplausible counters, and is robust enough to deal even with non-stationary distributions.\n\n= 1\nNi\n\ni\n\nSince the Bayesian Hebb rule and the Bayesian approach of updating Beta-distributions for condi-\ntional probabilities are closely related, it is reasonable to expect that the distribution of weights wi\nover longer time periods with a non-vanishing learning rate will resemble a Beta(ai, bi)-distribution\ntransformed to the log-odd domain. The parameters ai and bi in this case are not exact counters any-\nmore but correspond to virtual sample sizes, depending on the current learning rate. We formalize\nthis statistical model of wi by\n\n\u03c3(wi) =\n\n1\n\n1 + e\u2212wi\n\n\u223c Beta(ai, bi) \u21d0\u21d2 wi \u223c\n\n\u0393(ai + bi)\n\u0393(ai)\u0393(bi)\n\n\u03c3(wi)ai \u03c3(\u2212wi)bi ,\n\nIn practice this model turned out to capture quite well the actually observed quasi-stationary distri-\nbution of wi. In [9] we show analytically that E[wi] \u2248 log ai\n. A learning\nbi\nrate adaptation mechanism at the synapse that keeps track of the observed mean and variance of the\nsynaptic weight can therefore recover estimates of the virtual sample sizes ai and bi. The following\nmechanism, which we call variance tracking implements this by computing running averages of the\nweights and the squares of weights in \u00afwi and \u00afqi:\n\nand Var[wi] \u2248 1\nai\n\n+ 1\nbi\n\n\u03b7new\ni\n\u00afwnew\n\u00afqnew\ni\n\ni\n\n\u00afqi\u2212 \u00afw2\ni\n\n1+cosh \u00afwi\n\n\u2190\n\u2190 (1 \u2212 \u03b7i) \u00afwi + \u03b7i wi\n\u2190 (1 \u2212 \u03b7i) \u00afqi + \u03b7i w2\ni\n\n.\n\n(8)\n\n3\n\n\fIn practice this mechanism decays like 1\nNi\nchanging input distributions. It was used in all presented experiments for the Bayesian Hebb rule.\n\nunder stationary conditions, but is also able to handle\n\n3 Hebbian learning of Bayesian decisions\n\nm\n\n=\n\np(x0=1)\np(x0=0)\n\nWe now show how the Bayesian Hebb rule can be used to learn Bayes optimal decisions. The \ufb01rst\napplication is the Naive Bayesian classi\ufb01er, where a binary target variable x0 should be inferred\nfrom a vector of multinomial variables x = hx1, . . . , xmi, under the assumption that the xi\u2019s are\n1 p(xk|x0). Using basic rules of\n\nprobability theory the posterior probability ratio for x0 = 1 and x0 = 0 can be derived:\n\np(x0=1|x)\np(x0=0|x)\n\nconditionally independent given x0, thus p(x0, x) = p(x0)Qm\np(x0=0)(cid:19)(1\u2212m) m\n=(cid:18) p(x0=1)\nYk=1\np(x0=0|xk=j)(cid:19)I(xk=j)\nYj=1(cid:18) p(x0=1|xk=j)\n\nYk=1\n=(cid:18) p(x0=1)\n\np(x0=0)(cid:19)(1\u2212m) m\nYk=1\n\np(xk|x0=1)\np(xk|x0=0)\n\np(x0=1|xk)\np(x0=0|xk)\n\nwhere mk is the number of different possible values of the input variable xk, and the indicator\nfunction I is de\ufb01ned as I(true) = 1 and I(f alse) = 0.\nLet the m input variables x1, . . . , xm be represented through the binary \ufb01ring states y1, . . . , yn \u2208\n{0, 1} of the n presynaptic neurons in a population coding manner. More precisely, let each input\nvariable xk \u2208 {1, . . . , mk} be represented by mk neurons, where each neuron \ufb01res only for one of\nthe mk possible values of xk. Formally we de\ufb01ne the simple preprocessing (SP)\n\nmk\n\n,\n\n=\n\n(9)\n\ny\n\n(10)\nThe binary target variable x0 is represented directly by the binary state y0 of the postsynaptic neuron.\nSubstituting the state variables y0, y1, . . . , yn in (9) and taking the logarithm leads to\n\nT =(cid:2)\u03c6(x1)T, . . . , \u03c6(xm)T(cid:3) with \u03c6(xk)T = [I(xk = 1), . . . , I(xk = mk)] .\n\nlog\n\np(y0 = 1|y)\np(y0 = 0|y)\n\n= (1 \u2212 m) log\n\np(y0 = 1)\np(y0 = 0)\n\n+\n\nyi log\n\np(yi = 1|y0 = 1)\np(yi = 1|y0 = 0)\n\n.\n\nn\n\nXi=1\n\nHence the optimal decision under the Naive Bayes assumption is\n\n\u02c6y0 = sign((1 \u2212 m)w\u2217\n\n0 +\n\nThe optimal weights w\u2217\n\n0 and w\u2217\n\ni\n\nw\u2217\n\ni yi)\n\n.\n\nn\n\nXi=1\n\nw\u2217\n\n0 = log\n\np(y0 = 1)\np(y0 = 0)\n\nand\n\nw\u2217\n\ni = log\n\np(y0 = 1|yi = 1)\np(y0 = 0|yi = 1)\n\nfor\n\ni = 1, . . . , n.\n\nare obviously log-odds which can be learned by the Bayesian Hebb rule (the bias weight w0 is\nsimply learned as an unconditional log-odd).\n\n3.1 Learning Bayesian decisions for arbitrary distributions\n\nWe now address the more general case, where conditional independence of the input variables\nx1, . . . , xm cannot be assumed. In this case the dependency structure of the underlying distribu-\ntion is given in terms of an arbitrary Bayesian network BN for discrete variables (see e.g. Figure\n1 A). Without loss of generality we choose a numbering scheme of the nodes of the BN such that\nthe node to be learned is x0 and its direct children are x1, . . . , xm\u2032. This implies that the BN can be\ndescribed by m + 1 (possibly empty) parent sets de\ufb01ned by\n\nPk = {i | a directed edge xi \u2192 xk exists in BN and i \u2265 1}\n\n.\n\nThe joint probability distribution on the variables x0, . . . , xm in BN can then be factored and evalu-\nated for x0 = 1 and x0 = 0 in order to obtain the probability ratio\n\np(x0 = 1, x)\np(x0 = 0, x)\n\n=\n\np(x0 = 1|x)\np(x0 = 0|x)\n\n=\n\np(x0 = 1|xP0 )\np(x0 = 0|xP0 )\n\nm\u2032\n\nYk=1\n\np(xk|xPk , x0 = 1)\np(xk|xPk , x0 = 0)\n\nm\n\nYk=m\u2032+1\n\np(xk|xPk )\np(xk|xPk )\n\n.\n\n4\n\n\fA\n\nB\n\nFigure 1: A) An example Bayesian network with general connectivity. B) Population coding applied\nto the Bayesian network shown in panel A. For each combination of values of the variables {xk, xPk }\nof a factor there is exactly one neuron (indicated by a black circle) associated with the factor that\noutputs the value 1. In addition OR\u2019s of these values are computed (black squares). We refer to the\nresulting preprocessing circuit as generalized preprocessing (GP).\n\nObviously, the last term cancels out, and by applying Bayes\u2019 rule and taking the logarithm the target\nlog-odd can be expressed as a sum of conditional log-odds only:\n\nlog\n\np(x0=1|x)\np(x0=0|x)\n\n= log\n\np(x0=1|xP0 )\np(x0=0|xP0 )\n\n+\n\nm\u2032\n\nXk=1(cid:18)log\n\np(x0=1|xk, xPk )\np(x0=0|xk, xPk )\n\n\u2212 log\n\np(x0=1|xPk )\n\np(x0=0|xPk )(cid:19) .\n\n(11)\n\nWe now develop a suitable sparse encoding of of x1, . . . , xm into binary variables y1, . . . , yn (with\nn \u226b m) such that the decision function (11) can be written as a weighted sum, and the weights corre-\nspond to conditional log-odds of yi\u2019s. Figure 1 B illustrates such a sparse code: One binary variable\nis created for every possible value assignment to a variable and all its parents, and one additional\nbinary variable is created for every possible value assignment to the parent nodes only. Formally,\nthe previously introduced population coding operator \u03c6 is generalized such that \u03c6(xi1 , xi2 , . . . , xil )\nj=1 mij that equals zero in all entries except for one 1-entry which\nidenti\ufb01es by its position in the vector the present assignment of the input variables xi1 , . . . , xil. The\nconcatenation of all these population coded groups is collected in the vector y of length n\n\ncreates a vector of length Ql\n\ny\n\nT =(cid:2)\u03c6(xP0)T, \u03c6(x1, xP1 )T, \u2212\u03c6(xP1 )T, . . . , \u03c6(xm, xPm )T, \u2212\u03c6(xPm )T(cid:3)\n\nThe negated vector parts in (12) correspond to the negative coef\ufb01cients in the sum in (11). Inserting\nthe sparse coding (12) into (11) allows writing the Bayes optimal decision function (11) as a pure\nsum of log-odds of the target variable:\n\n.\n\n(12)\n\n\u02c6x0 = \u02c6y0 = sign(\n\nw\u2217\n\ni yi),\n\nwith\n\nw\u2217\n\ni = log\n\np(y0=1|yi6=0)\np(y0=0|yi6=0)\n\n.\n\nn\n\nXi=1\n\nEvery synaptic weight wi can be learned ef\ufb01ciently by the Bayesian Hebb rule (1) with the formal\nmodi\ufb01cation that the update is not only triggered by yi=1 but in general whenever yi6=0 (which\nobviously does not change the behavior of the learning process). A neuron that learns with the\nBayesian Hebb rule on inputs that are generated by the generalized preprocessing (GP) de\ufb01ned in\n(12) therefore approximates the Bayes optimal decision function (11), and converges quite fast to\nthe best performance that any probabilistic inference could possibly achieve (see Figure 2B).\n\n4 The Bayesian Hebb rule in reinforcement learning\n\nWe show in this section that a reward-modulated version of the Bayesian Hebb rule enables a learn-\ning agent to solve simple reinforcement learning tasks. We consider the standard operant condi-\ntioning scenario, where the learner receives at each trial an input x = hx1, . . . , xmi, chooses an\naction \u03b1 out of a set of possible actions A, and receives a binary reward signal r \u2208 {0, 1} with\nprobability p(r|x, a). The learner\u2019s goal is to learn (as fast as possible) a policy \u03c0(x, a) so that\naction selection according to this policy maximizes the average reward. In contrast to the previous\n\n5\n\n\flog\n\np(r = 1|y, a = \u03b1)\np(r = 0|y, a = \u03b1)\n\n= w\u2217\n\n\u03b1,0 +\n\nw\u2217\n\n\u03b1,i yi\n\n,\n\nXi=1\n\nlearning tasks, the learner has to explore different actions for the same input to learn the reward-\nprobabilities for all possible actions. The agent might for example choose actions stochastically\nwith \u03c0(x, a = \u03b1) = p(r = 1|x, a = \u03b1), which corresponds to the matching behavior phenomenon\noften observed in biology [10]. This policy was used during training in our computer experiments.\n\nThe goal is to infer the probability of binary reward, so it suf\ufb01ces to learn the log-odds log p(r=1|x,a)\np(r=0|x,a)\nfor every action, and choose the action that is most likely to yield reward (e.g. by a Winner-Take-All\nstructure). If the reward probability for an action a = \u03b1 is de\ufb01ned by some Bayesian network BN,\none can rewrite this log-odd as\n\nlog\n\np(r = 1|x, a = \u03b1)\np(r = 0|x, a = \u03b1)\n\n= log\n\np(r = 1|a = \u03b1)\np(r = 0|a = \u03b1)\n\n+\n\nlog\n\np(xk|xPk , r = 1, a = \u03b1)\np(xk|xPk , r = 0, a = \u03b1)\n\n.\n\n(13)\n\nm\n\nXk=1\n\nIn order to use the Bayesian Hebb rule, the input vector x is preprocessed to obtain a binary vector\ny. Both a simple population code such as (10), or generalized preprocessing as in (12) and Figure\n1B can be used, depending on the assumed dependency structure. The reward log-odd (13) for the\npreprocessed input vector y can then be written as a linear sum\nn\n\nwhere the optimal weights are w\u2217\np(r=0|yi6=0,a=\u03b1) . These log-\nodds can be learned for each possible action \u03b1 with a reward-modulated version of the Bayesian\nHebb rule (1):\n\np(r=0|a=\u03b1) and w\u2217\n\n\u03b1,i = log p(r=1|yi6=0,a=\u03b1)\n\n\u03b1,0 = log p(r=1|a=\u03b1)\n\n\u2206w\u03b1,i =( \u03b7 \u00b7 (1 + e\u2212w\u03b1,i),\n\n\u2212\u03b7 \u00b7 (1 + ew\u03b1,i),\n\n0,\n\nif r = 1, yi 6= 0, a = \u03b1\nif r = 0, yi 6= 0, a = \u03b1\notherwise\n\n(14)\n\nThe attractive theoretical properties of the Bayesian Hebb rule for the prediction case apply also to\nthe case of reinforcement learning. The weights corresponding to the optimal policy are the only\nequilibria under the reward-modulated Bayesian Hebb rule, and are also global attractors in weight\nspace, independently of the exploration policy (see [9]).\n\n5 Experimental Results\n\n5.1 Results for prediction tasks\n\nWe have tested the Bayesian Hebb rule on 400 different prediction tasks, each of them de\ufb01ned by a\ngeneral (non-Naive) Bayesian network of 7 binary variables. The networks were randomly generated\nby the algorithm of [11]. From each network we sampled 2000 training and 5000 test examples, and\nmeasured the percentage of correct predictions after every update step.\n\nThe performance of the predictor was compared to the Bayes optimal predictor, and to online logistic\nregression, which \ufb01ts a linear model by gradient descent on the cross-entropy error function. This\nnon-Hebbian learning approach is in general the best performing online learning approach for linear\ndiscriminators [3]. Figure 2A shows that the Bayesian Hebb rule with the simple preprocessing (10)\ngeneralizes better from a few training examples, but is outperformed by logistic regression in the\nlong run, since the Naive Bayes assumption is not met. With the generalized preprocessing (12), the\nBayesian Hebb rule learns fast and converges to the Bayes optimum (see Figure 2B). In Figure 2C\nwe show that the Bayesian Hebb rule is robust to noisy updates - a condition very likely to occur in\nbiological systems. We modi\ufb01ed the weight update \u2206wi such that it was uniformly distributed in\nthe interval \u2206wi \u00b1 \u03b3%. Even such imprecise implementations of the Bayesian Hebb rule perform\nvery well. Similar results can be obtained if the exp-function in (1) is replaced by a low-order Taylor\napproximation.\n\n5.2 Results for action selection tasks\n\nThe reward-modulated version (14), of the Bayesian Hebb rule was tested on 250 random action\nselection tasks with m = 6 binary input attributes, and 4 possible actions. For every action a\n\n6\n\n\fA\n\ns\ns\ne\nn\nt\nc\ne\nr\nr\no\nC\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n \n0\n\n \n\nBayesian Hebb SP\nLog. Regression \u03b7=0.2\nNaive Bayes\nBayes Optimum\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n# Training Examples\n\nB\n\ns\ns\ne\nn\nt\nc\ne\nr\nr\no\nC\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n \n0\n\n \n\nBayesian Hebb GP\nBayesian Hebb SP\nBayes Optimum\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n# Training Examples\n\nC\n\ns\ns\ne\nn\nt\nc\ne\nr\nr\no\nC\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n \n0\n\n \n\nWithout Noise\n50% Noise\n100% Noise\n150% Noise\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n# Training Examples\n\nFigure 2: Performance comparison for prediction tasks. A) The Bayesian Hebb rule with simple\npreprocessing (SP) learns as fast as Naive Bayes, and faster than logistic regression (with optimized\nconstant learning rate). B) The Bayesian Hebb rule with generalized preprocessing (GP) learns fast\nand converges to the Bayes optimal prediction performance. C) Even a very imprecise implemen-\ntation of the Bayesian Hebb rule (noisy updates, uniformly distributed in \u2206wi \u00b1 \u03b3%) yields almost\nthe same learning performance.\n\nrandom Bayesian network [11] was drawn to model the input and reward distributions (see [9] for\ndetails). The agent received stochastic binary rewards for every chosen action, updated the weights\nw\u03b1,i according to (14), and measured the average reward on 500 independent test trials.\nIn Figure 3A we compare the reward-modulated Bayesian Hebb rule with simple population coding\n(10) (Bayesian Hebb SP), and generalized preprocessing (12) (Bayesian Hebb GP), to the standard\nlearning model for simple conditioning tasks, the non-Hebbian Rescorla-Wagner rule [12]. The\nreward-modulated Bayesian Hebb rule learns as fast as the Rescorla-Wagner rule, and achieves in\ncombination with generalized preprocessing a higher performance level. The widely used tabular\nQ-learning algorithm, in comparison is slower than the other algorithms, since it does not generalize,\nbut it converges to the optimal policy in the long run.\n\n5.3 A model for the experiment of Yang and Shadlen\n\nIn the experiment by Yang and Shadlen [1], a monkey had to choose between gazing towards a red\ntarget R or a green target G. The probability that a reward was received at either choice depended\non four visual input stimuli that had been shown at the beginning of the trial. Every stimulus was\none shape out of a set of ten possibilities and had an associated weight, which had been de\ufb01ned by\nthe experimenter. The sum of the four weights yielded the log-odd of obtaining a reward at the red\ntarget, and a reward for each trial was assigned accordingly to one of the targets. The monkey thus\nhad to combine the evidence from four visual stimuli to optimize its action selection behavior.\n\nIn the model of the task it is suf\ufb01cient to learn weights only for the action a = R, and select\nthis action whenever the log-odd using the current weights is positive, and G otherwise. A simple\npopulation code as in (10) encoded the 4-dimensional visual stimulus into a 40-dimensional binary\nvector y. In our experiments, the reward-modulated Bayesian Hebb rule learns this task as fast and\nwith similar quality as the non-Hebbian Rescorla-Wagner rule. Furthermore Figures 3B and 3C\nshow that it produces after learning similar behavior as that reported for two monkeys in [1].\n\n6 Discussion\n\nWe have shown that the simplest and experimentally best supported local learning mechanism, Heb-\nbian learning, is suf\ufb01cient to learn Bayes optimal decisions. We have introduced and analyzed the\nBayesian Hebb rule, a training method for synaptic weights, which converges fast and robustly to\noptimal log-probability ratios, without requiring any communication between plasticity mechanisms\nfor different synapses. We have shown how the same plasticity mechanism can learn Bayes optimal\ndecisions under different statistical independence assumptions, if it is provided with an appropriately\npreprocessed input. We have demonstrated on a variety of prediction tasks that the Bayesian Hebb\nrule learns very fast, and with an appropriate sparse preprocessing mechanism for groups of statisti-\ncally dependent features its performance converges to the Bayes optimum. Our approach therefore\nsuggests that sparse, redundant codes of input features may simplify synaptic learning processes in\nspite of strong statistical dependencies. Finally we have shown that Hebbian learning also suf\ufb01ces\n\n7\n\n\fA\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n \n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\n0.4\n \n0\n\n400\n\n \n\nBayesian Hebb SP\nBayesian Hebb GP\nRescorla\u2212Wagner\nQ\u2212Learning\nOptimal Selector\n\n800\n\n1200\n\n1600\n\n2000\n\nTrials\n\nB\n\ni\n\ns\ne\nc\no\nh\nc\n \n\nd\ne\nr\n \nf\n\no\n\n \n\ne\ng\na\n\nt\n\nn\ne\nc\nr\ne\nP\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\u22124\n\nC\n\ni\n\ns\ne\nc\no\nh\nc\n \n\nd\ne\nr\n \nf\n\no\n\n \n\ne\ng\na\n\nt\n\nn\ne\nc\nr\ne\nP\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\u22124\n\n4\n\n\u22122\n\n0\n\n2\n\nEvidence for red (logLR)\n\n4\n\n\u22122\n\n0\n\n2\n\nEvidence for red (logLR)\n\nFigure 3: A) On 250 4-action conditioning tasks with stochastic rewards, the reward-modulated\nBayesian Hebb rule with simple preprocessing (SP) learns similarly as the Rescorla-Wagner rule,\nand substantially faster than Q-learning. With generalized preprocessing (GP), the rule converges to\nthe optimal action-selection policy. B, C) Action selection policies learned by the reward-modulated\nBayesian Hebb rule in the task by Yang and Shadlen [1] after 100 (B), and 1000 (C) trials are\nqualitatively similar to the policies adopted by monkeys H and J in [1] after learning.\n\nfor simple instances of reinforcement learning. The Bayesian Hebb rule, modulated by a signal\nrelated to rewards, enables fast learning of optimal action selection. Experimental results of [1] on\nreinforcement learning of probabilistic inference in primates can be partially modeled in this way\nwith regard to resulting behaviors.\n\nAn attractive feature of the Bayesian Hebb rule is its ability to deal with the addition or removal\nof input features through the creation or deletion of synaptic connections, since no relearning of\nweights is required for the other synapses. In contrast to discriminative neural learning rules, our\napproach is generative, which according to [13] leads to faster generalization. Therefore the learning\nrule may be viewed as a potential building block for models of the brain as a self-organizing and fast\nadapting probabilistic inference machine.\n\nAcknowledgments\n\nWe would like to thank Martin Bachler, Sophie Deneve, Rodney Douglas, Konrad Koerding, Rajesh\nRao, and especially Dan Roth for inspiring discussions. Written under partial support by the Aus-\ntrian Science Fund FWF, project # P17229-N04, project # S9102-N04, and project # FP6-015879\n(FACETS) as well as # FP7-216593 (SECO) of the European Union.\n\nReferences\n\n[1] T. Yang and M. N. Shadlen. Probabilistic reasoning by neurons. Nature, 447:1075\u20131080, 2007.\n[2] R. P. N. Rao. Neural models of Bayesian belief propagation. In K. Doya, S. Ishii, A. Pouget, and R. P. N.\n\nRao, editors, Bayesian Brain., pages 239\u2013267. MIT-Press, 2007.\n\n[3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer (New York), 2006.\n[4] S. Deneve. Bayesian spiking neurons I, II. Neural Computation, 20(1):91\u2013145, 2008.\n[5] A. Sandberg, A. Lansner, K. M. Petersson, and \u00a8O. Ekeberg. A Bayesian attractor network with incremen-\n\ntal learning. Network: Computation in Neural Systems, 13:179\u2013194, 2002.\n\n[6] D. Roth. Learning in natural language. In Proc. of IJCAI, pages 898\u2013904, 1999.\n[7] D. O. Hebb. The Organization of Behavior. Wiley, New York, 1949.\n[8] D. P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[9] B. Nessler, M. Pfeiffer, and W. Maass. Journal version. in preparation, 2009.\n[10] L. P. Sugrue, G. S. Corrado, and W. T. Newsome. Matching behavior and the representation of value in\n\nthe parietal cortex. Science, 304:1782\u20131787, 2004.\n\n[11] J. S. Ide and F. G. Cozman. Random generation of Bayesian networks.\n\nBrazilian Symposium on Arti\ufb01cial Intelligence, pages 366\u2013375, 2002.\n\nIn Proceedings of the 16th\n\n[12] R. A. Rescorla and A. R. Wagner. Classical conditioning II. In A. H. Black and W. F. Prokasy, editors, A\n\ntheory of Pavlovian conditioning, pages 64\u201399. 1972.\n\n[13] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classi\ufb01ers. NIPS, 14:841\u2013848, 2002.\n\n8\n\n\f", "award": [], "sourceid": 278, "authors": [{"given_name": "Bernhard", "family_name": "Nessler", "institution": null}, {"given_name": "Michael", "family_name": "Pfeiffer", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}]}