{"title": "A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1287, "page_last": 1294, "abstract": "", "full_text": "A Biologically Plausible Algorithm\n\nfor Reinforcement-shaped\nRepresentational Learning\n\nManeesh Sahani\n\nW.M. Keck Foundation Center for Integrative Neuroscience\n\nUniversity of California, San Francisco, CA 94143-0732\n\nmaneesh@phy.ucsf.edu\n\nAbstract\n\nSigni\ufb01cant plasticity in sensory cortical representations can be driven in\nmature animals either by behavioural tasks that pair sensory stimuli with\nreinforcement, or by electrophysiological experiments that pair sensory\ninput with direct stimulation of neuromodulatory nuclei, but usually not\nby sensory stimuli presented alone. Biologically motivated theories of\nrepresentational learning, however, have tended to focus on unsupervised\nmechanisms, which may play a signi\ufb01cant role on evolutionary or devel-\nopmental timescales, but which neglect this essential role of reinforce-\nment in adult plasticity. By contrast, theoretical reinforcement learning\nhas generally dealt with the acquisition of optimal policies for action in\nan uncertain world, rather than with the concurrent shaping of sensory\nrepresentations. This paper develops a framework for representational\nlearning which builds on the relative success of unsupervised generative-\nmodelling accounts of cortical encodings to incorporate the effects of\nreinforcement in a biologically plausible way.\n\n1 Introduction\n\nA remarkable feature of the brain is its ability to adapt to, and learn from, experience.\nThis learning has measurable physiological correlates in terms of changes in the stimulus-\nresponse properties of individual neurons in the sensory systems of the brain (as well as in\nmany other areas). While passive exposure to sensory stimuli can have profound effects on\nthe developing sensory cortex, signi\ufb01cant plasticity in mature animals tends to be observed\nonly in situations where sensory stimuli are associated with either behavioural or electrical\nreinforcement. Considerable theoretical attention has been paid to unsupervised learning of\nrepresentations adapted to natural sensory statistics, and to the learning of optimal policies\nof action for decision processes; however, relatively little work (particularly of a biological\nbent) has sought to understand the impact of reinforcement tasks on representation.\n\nTo be complete, understanding of sensory plasticity must come at two different levels. At\na mechanistic level, it is important to understand how synapses are modi\ufb01ed, and how\nsynaptic modi\ufb01cations can lead to observed changes in the response properties of cells.\nNumerous experiments and models have addressed these questions of how sensory plastic-\n\n\fity occurs. However, a mechanistic description alone neglects the information-processing\naspects of the brain\u2019s function. Measured changes in sensory representation must underlie\nan adaptive change in neural information processing. If we can understand the processing\ngoals of sensory systems, and therefore understand how changes in representation advance\nthese goals in the face of changing experience, we will have shed light on the question of\nwhy sensory plasticity occurs. This is the goal of the current work.\n\nTo approach this goal, we \ufb01rst construct a representational model and associated objec-\ntive function which together isolate the question of how the reinforcement-related value\nof a stimulus is learned (the classic problem of reinforcement learning) from the ques-\ntion of how this value impacts the sensory representation. We show that the objective\nfunction can be optimised by an expectation-maximisation learning procedure, but suggest\nthat direct optimisation is not biologically plausible, relying as it does on the availability\nof an exact posterior distribution over the cortical representation given both stimulus and\nreinforcement-value. We therefore develop and validate (through simulation) an alternative\noptimisation approach based on the statistical technique of importance sampling.\n\n2 Model\n\nThe standard algorithms of reinforcement learning (RL) deal with an agent that receives\nrewards or penalties as it interacts with a world of known structure and, generally Marko-\nvian, dynamics [1]. The agent passes through a series of \u201cstates\u201d, choosing in each one\nan action which results (perhaps stochastically) in a payoff and in a transition to another\nstate. Associated with each state (or state-action pair) and a given policy of action is a\nvalue, which represents the expected payoff that would be received if the policy were to\nbe followed starting from that initial state (and initial action). Much work in RL has fo-\ncused on learning the value function. Often the state that the agent occupies at each point\nin time is assumed to be directly observable. In other cases, the agent receives only partial\ninformation about the state it occupies, although in almost all studies the basic structure\nof the world is assumed to be known. In these partially observable models, then, the state\ninformation (which might be thought of as a form of sensory input) is used to estimate\nwhich one of a known group of states is currently occupied, and so a natural representation\nemerges in terms of a belief-distribution over states.\n\nIn the general case, however, the state structure of the world, if indeed a division into dis-\ncrete states makes sense at all, is unknown. Instead, the agent must simultaneously discover\na representation of the sensory inputs suitable for predicting the reinforcement value, and\nlearn the action-contingent value function itself. This general problem is quite dif\ufb01cult. In\nprobabilistic terms, solving it exactly would require coping with a complicated joint distri-\nbution over representational structures and value functions. However, using an analogy to\nthe variational inference methods of unsupervised learning [2], we might modularise our\napproach by factoring this joint into independent distributions over the sensory representa-\ntion on the one hand and the value function on the other. In this framework approximate\nestimation might proceed iteratively, using the current value function to tune the sensory\nrepresentation, and then re\u00a8estimating the value function for the revised sensory encoding.\n\nThe present work, being concerned with the way in which reinforcement guides sensory\nrepresentational learning, focuses exclusively on the \ufb01rst of these two steps. Thus, we take\nthe value associated with the current sensory input to be given. This value might repre-\nsent a current estimate generated in the course of the iterative procedure described above.\nIn many of the reinforcement schedules used in physiological experiments, however, the\nvalue is easily determined. For example, in a classical conditioning paradigm the value is\nindependent of action, and is given by the sum of the current reinforcement and the dis-\ncounted average reinforcement received. Our problem, then, is to develop a biologically\nplausible algorithm which is able to \ufb01nd a representation of the sensory input which facili-\n\n\ftates prediction of the value.\n\nAlthough our eventual goal clearly \ufb01ts well in the framework of RL, we \ufb01nd it useful to\nstart from a standard theoretical account of unsupervised representational learning. The\nview we adopt \ufb01ts well with a Helmholtzian account of perceptual processing, in which the\nsensory cortex interprets the activities of receptors so as to infer the state of the external\nworld that must have given rise to the observed pattern of activation. Perception, by this\naccount, may be thought of as a form of probabilistic inference in a generative model. The\ngeneral structure of such a model involves a set of latent variables or causes whose values\ndirectly re\ufb02ect the underlying state of the world, along with a parameterisation of effects of\nthese causes on immediate sensory experience. A generative model of visual sensation, for\nexample, might contain a hierarchy of latent variables that, at the top, corresponded to the\nidentities and poses of visual objects or the colour and direction of the illuminating light,\nand at lower levels, represented local consequences of these more basic causes, for example\nthe orientation and contrast of local edges. Taken together, these variables would provide a\ncausal account for observations that correspond to photoreceptor activation. To apply such\na framework as a model for cortical processing, then, we take the sensory cortical activity\nto represent the inferred values of the latent variables.\n\nThus, perceptual inference in this framework involves estimating the values of the causal\nvariables that gave rise to the sensory input, while developmental (unsupervised) learning\ninvolves discovering the correct causal structure from sensory experience. Such a treatment\nhas been used to account for the structure of simple-cell receptive \ufb01elds in visual cortex\n[3, 4], and has been extended to further visual cortical response properties in subsequent\nstudies. In the present work our goal is to consider how such a model might be affected by\nreinforcement. Thus, in addition to the latent causes Li that generate a sensory event Si,\nwe consider an associated (possibly action-contingent) value Vi. This value is presumably\nmore parsimoniously associated with the causes underlying the sensory experience, rather\nthan with the details of the receptor activation, and so we take the sensory input and the\ncorresponding value to be conditionally independent given the cortical representation:\n\nP\u03b8(Si, Vi) =\n\ndLi P\u03b8(Si | Li)P\u03b8(Vi | Li)P\u03b8(Li),\n\n(1)\n\nZ\n\nwhere \u03b8 is a general vector of model parameters. Thus, the variables Si, Li and Vi form\na Markov chain. In particular, this means that whatever information Si carries about Vi is\nexpressed (if the model is well-\ufb01t) in the cortical representation Li, making this structure\nappropriate for value prediction. The causal variables Li have taken on the r\u02c6ole of the\n\u201cstate\u201d in standard RL.\n\n3 Objective function\n\nThe natural objective in reinforcement learning is to maximise some form of accumulated\nreward. However, the model of (1) is, by itself, descriptive rather than prescriptive. That\nis, the parameters modelled (those determining the responses in the sensory cortex, rather\nthan in associative or motor areas) do not directly control actions or policies of action. In-\nstead, these descriptive parameters only in\ufb02uence the animal\u2019s accumulated reinforcement\nthrough the accuracy of the description they generate. As a result, even though the ulti-\nmate objective may be to maximise total reward, we need to use objective functions that\nare closer in spirit to the likelihoods common in probabilistic unsupervised learning.\n\nIn particular, we consider functions of the form\n\n\u03b1(Vi) log P\u03b8(Si) + \u03b2(Vi) log P\u03b8(Vi | Si)\n\n(2)\n\nL(\u03b8) =X\n\ni\n\n\fIn this expression, the two log probabilities re\ufb02ect the accuracy of stimulus representation,\nand of value prediction, respectively. These two terms would appear alone in a straightfor-\nward representational model of the joint distribution over sensory stimuli and values. How-\never, in considering a representational subsystem within a reinforcement learning agent,\nwhere the overall goal is to maximise accumulated reward, it seems reasonable that the\ndemand for representative or predictive \ufb01delity depend on the value associated with the\nstimulus; this dependence is re\ufb02ected here by a value-based weighting of the log probabil-\nities, which we assume will weight the more valuable cases more heavily.\n\n4 Learning\n\nWhile the objective function (2) does not depend explicitly on the cortical representa-\ntion variables, it does depend on their distributions, through the marginal likelihoods\n\nP\u03b8(Si) = R dLi P\u03b8(Si, Li) and P\u03b8(Vi | Si) = R dLi P\u03b8(Vi, Li | Si). For all but the\n\nsimplest probabilistic models, optimising these integral expressions directly is computa-\ntionally prohibitive. However, a standard technique called the Expectation-Maximisation\n(EM) algorithm can be extended in a straightforward way to facilitate optimisation of func-\ntions with the form we consider here.\nWe introduce 2N unknown probability distributions over the cortical representation,\nQ\u03b1(Li) and Q\u03b2(Li). Then, using Jensen\u2019s inequality for convex functions, we obtain a\nlower bound on the objective function:\n\nQ\u03b1(Li) P\u03b8(Si, Li) + \u03b2(Vi) log\n\nZ Q\u03b1(Li)\nZ Q\u03b2(Li)\nQ\u03b2(Li) P\u03b8(Li, Vi | Si)\n(cid:16)hlog P\u03b8(Si, Li)iQ\u03b1(Li) + H[Q\u03b1(Li)]\n(cid:17)\n(cid:16)hlog P\u03b8(Li, Vi | Si)iQ\u03b2 (Li) + H[Q\u03b2(Li)]\n(cid:17)\n\n+ \u03b2(Vi)\n\n\u03b1(Vi) log\n\n\u03b1(Vi)\n\ni\n\nL(\u03b8) =X\n\u2265X\n\ni\n\n= F(\u03b8, Q\u03b1(Li), Q\u03b2(Li))\n\nIt can be shown that, provided both functions are continuous and differentiable, local max-\nima of the \u201cfree-energy\u201d F with respect to all of its arguments correspond, in their optimal\nvalues of \u03b8, to local maxima of L [5]. Thus, any hill-climbing technique applied to the free-\nenergy functional can be used to \ufb01nd parameters that maximise the objective. In particular,\nthe usual EM approach alternates maximisations (or just steps in the gradient direction)\nwith respect to each of the arguments of F. In our case, this results in the following on-line\nlearning updates made after observing the ith data point:\nQ\u03b1(Li) \u2190 P\u03b8(Li | Si)\nQ\u03b2(Li) \u2190 P\u03b8(Li | Vi, Si)\n\n(3a)\n(3b)\n\u03b1(Vi)hlog P\u03b8(Si, Li)iQ\u03b1(Li) + \u03b2(Vi)hlog P\u03b8(Li, Vi | Si)iQ\u03b2 (Li)\n(3c)\nwhere the \ufb01rst two equations represent exact maximisations, while the third is a gradient\nstep, with learning rate \u03b7. It will be useful to rewrite (3c) as\n\n\u03b8 \u2190 \u03b8 + \u03b7\u2207\u03b8\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\u03b1(Vi)h\u2207\u03b8 log P\u03b8(Si, Li)iQ\u03b1(Li) + \u03b2(Vi)h\u2207\u03b8 log P\u03b8(Li | Si)iQ\u03b2 (Li)\n\n+\u03b2(Vi)h\u2207\u03b8 log P\u03b8(Vi | Li)iQ\u03b2 (Li)\n\n(3c0)\n\n(cid:16)\n\n\u03b8 \u2190 \u03b8 + \u03b7\n\nwhere the conditioning on Si in the \ufb01nal term in not needed due to the Markovian structure\nof the model.\n\n\f5 Biologically Plausible Learning\n\nCould something like the updates of (3) underlie the task- or neuromodulator-driven\nchanges that are seen in sensory cortex? Two out of the three steps seem plausible. In (3a),\nthe distribution P\u03b8(Li | Si) represents the animal\u2019s beliefs about the latent causes that led\nto the current sensory experience, and as such is the usual product of perceptual inference.\nIn (3c0), the various log probabilities involved are similarly natural products of perceptual\nor predictive computations. However, the calculation of the distribution P\u03b8(Li | Vi, Si) in\n(3b) is less easily reconciled with biological constraints.\nThere are two dif\ufb01culties. First, the sensory input, Si, and the information needed to assess\nits associated value, Vi, often arrive at quite different times. However, construction of the\nposterior distribution in its full detail requires simultaneous knowledge of both Si and Vi,\nand would therefore only be possible if rich information about the sensory stimulus were to\nbe preserved until the associated value could be determined. The feasibility of such detailed\npersistence of sensory information is unclear. The second dif\ufb01culty is an architectural one.\nThe connections from receptor epithelium to sensory areas of cortex are extensive, easily\ncapable of conveying the information needed to estimate P(L | S). By contrast, the brain\nstructures that seem to be associated with the evaluation of reinforcement, such as the\nventral tegmental area or nucleus basalis, make only sparse projections to early sensory\ncortex; and these projections are frequently modulatory in character, rather than synaptic.\nThus, exact computation of P(Li | Vi) (a component of the full P(Li | Vi, Si)) seems\ndif\ufb01cult to imagine.\n\nIt might seem at \ufb01rst that the former of these two problems would also apply to the weight\n\u03b1(Vi) (in the \ufb01rst term of (3c0)), in that execution of this portion of the update would also\nneed to be delayed until this value-dependent weight could be calculated. On closer exam-\nination, however, it becomes evident that this dif\ufb01culty can be avoided. The trick is that in\nlearning, the weight can be applied to the gradient. Thus, it is suf\ufb01cient only to remember\nthe gradient, or indeed the corresponding change in synaptic weights. One possible way to\ndo this is to actually carry out an update of the weights when just the sensory stimulus is\nknown, but then consolidate this learning (or not) as indicated by the value-related weight.\nSuch a consolidation signal might easily be carried by a neuromodulatory projection from\nsubcortical nuclei involved in the evaluation of reinforcement.\nWe propose to solve the problem posed by P(L | S, V ) in essentially the same way, that is\nby using information about reinforcement-value to guide modulatory reweighting or con-\nsolidation of synaptic changes that are initially based on the sensory stimulus alone. Note\nthat the expectations over P(Li | Si, Vi) that appear in (3c0) could, in principle, be re-\nplaced by sums over samples drawn from the distribution. Since learning is gradual and\non-line, such a stochastic gradient ascent algorithm would still converge (in probability)\nto the optimum. Of course, sampling from this distribution is no more compatible with\nthe foregoing biological constraints than integrating over it. However, consider drawing\nsamples \u02dcLi from P(Li | Si), and then weighting the corresponding terms in the sum by\n| Si). Then we have, taking the second term in (3c0) for\nw(\u02dcLi) = P(Vi\nexample,\n\n\u02dcLi\u223cP(Li|Si)\n\n=\n\nP(Vi, \u02dcLi | Si)\nP(Vi | Si)\n\n=\n\nd \u02dcLi\u2207\u03b8 log P\u03b8( \u02dcLi | Si)\n\nD\u2207\u03b8 log P\u03b8( \u02dcLi | Si)\nE\n\nP(Vi | \u02dcLi)\nP(Vi | Si)\n\nP( \u02dcLi | Si)\n\n\u02dcLi\u223cP(Li|Si,Vi)\n\n.\n\n| \u02dcLi)/P(Vi\nD\u2207\u03b8 log P\u03b8( \u02dcLi | Si)w( \u02dcLi)\nE\n\nZ\n\n=\n\nd \u02dcLi\u2207\u03b8 log P\u03b8( \u02dcLi | Si)\n\nZ\n\nThis approach to learning, which exploits the standard statistical technique of impor-\ntance sampling [6], resolves both of the dif\ufb01culties discussed above.\nIt implies that\nreinforcement-related processing and learning in the sensory systems of the brain proceeds\nin these stages:\n\n\f1. The sensory input is processed to infer beliefs about the latent causes P\u03b8(Li | Si).\n2. Synaptic weights are updated to follow the gradients h\u2207\u03b8 log P\u03b8(Si, Li)iP\u03b8(Li|Si)\n\nOne or more samples \u02dcLi are drawn from this distribution.\n\nand \u2207\u03b8 log P\u03b8(\u02dcLi | Si) (corresponding to the \ufb01rst two terms of (3c0).\n\n3. The associated value is predicted, both on the basis of the full posterior, giving\n\nP\u03b8(Vi | Si), and on the basis of the sample(s), giving P\u03b8(Vi | \u02dcLi).\n\n4. The actual value is observed or estimated, facilitating calculation of the weights\n\n\u03b1(Vi), \u03b2(Vi), and w(\u02dcLi).\n\n5. These weights are conveyed to sensory cortex and used to consolidate (or not) the\n\nsynaptic changes of step 2.\n\nThis description does not encompass the updates corresponding to the third term of (3c0).\nSuch updates could be undertaken once the associated value became apparent; however,\nthe parameters that represent the explicit dependence of value on the latent variables are\nunlikely to lie in the sensory cortex itself (instead determining computations in subsequent\nprocessing).\n\n5.1 Distributional Sampling\n\nA commonly encountered dif\ufb01culty with importance sampling has to do with the distribu-\ntion of importance weights wi. If the range of weights is too extensive, the optimisation\nwill be driven primarily by few large weights, leading to slow and noisy learning. For-\ntunately, it is possible to formulate an alternative, in which distributions over the cortical\nrepresentational variables, rather than samples of the variables themselves, are randomly\ngenerated and weighted appropriately.1\n\nLet ePi(L) be a distribution over the latent causes L, drawn randomly from a functional\nDePi(L)\nE\ndistribution P(ePi | Si), such that\nP(ePi|Si)\nZ\ndL P(Vi | L)ePi(L)\nw(ePi) =\n(cid:29)\nw(ePi)\nePi\u223cP(ePi|Si)\n\n(cid:28)D\u2207\u03b8 log P\u03b8(\u02dcLi | Si)\nE\n\n= h\u2207\u03b8 log P\u03b8(Li | Si)i \u02dcLi\u223cP(Li|Si,Vi) .\n\nthe result above, it can be shown that given importance weights\n\nP(Vi | Si)\n\n,\n\n(4)\n\n= P(Li | Si). Then, by analogy with\n\nwe have\n\nePi(L)\n\nThese distributional samples can thus be used in almost exactly the same manner as the\nsingle-valued samples described above.\n\n6 Simulation\n\nA paradigmatic generative model structure is that underlying factor analysis (FA) [7], in\nwhich both latent and observed variables are normally distributed:\nP\u03b8(Li) = N (0, I) ; P\u03b8(Si | Li) = N (\u039bSLi, \u03a8S) ; P\u03b8(Vi | Li) = N (\u039bV Li, \u03a8V ) .\n(5)\na cortical representation re-expressed in terms of the parameters determining the distributionePi(L).\n\n1This sampling scheme can also be formalised as standard importance sampling carried out with\n\n\fFigure 1: Generative and learned sensory weights. See text for details.\n\nThe parameters of the FA model (grouped here in \u03b8) comprise two linear weight matrices\n\u039bS and \u039bV and two diagonal noise covariance matrices \u03a8S and \u03a8V . This model is sim-\nilar in its linear generative structure to the independent components analysis models that\nhave previously been employed in accounts of unsupervised development of visual cortical\nproperties [3, 4]; the only difference is in the assumed distribution of the latent variables.\nThe unit normal assumption of FA introduces a rotational degeneracy in solutions. This can\nbe resolved in general by constraining the weight matrix \u039b = [\u039bS, \u039bV ] to be orthogonal \u2013\ngiving a version of FA known as principal factor analysis (PFA).\n\nWe used a PFA-based simulation to verify that the distributional importance-weighted sam-\npling procedure described here is indeed able to learn the correct model given sensory and\nreinforcement-value data. Random vectors representing sensory inputs and associated val-\nues were generated according to (5); these were then used as inputs to a learning system.\nThe objective function optimised had both value-dependent weights \u03b1(Vi) and \u03b2(Vi) set to\nunity; thus the learning system simply attempted to model the joint distribution of sensory\nand reinforcement data.\n\nThe generative model comprised 11 latent variables, 40 observed sensory variables (which\nwere arranged linearly so as to represent 40 discrete values along a single sensory axis),\nand a single reinforcement variable. Ten of the latent variables only affected the sensory\nobservations. The weight vectors corresponding to each of these are shown by the solid\nlines in \ufb01gure 1a. These \u201ctuning curves\u201d were designed to be orthogonal. The curves shown\nin \ufb01gure 1a have been rescaled to have equal maximal amplitude; in fact the amplitudes\nwere randomly varied so that they formed a unique orthogonal basis for the data. These\nfeatures of the generative weight matrix were essential for PFA to be able to recover the\ngenerative model uniquely. The \ufb01nal latent variable affected both reinforcement value and\nthe sensory input at a single point (indicated by the dashed line in \ufb01gure 1a). Since the\noutput noise matrix in PFA can associate arbitrary variance with each sensory variable, a\nmodel \ufb01t to only the sensory data would treat the in\ufb02uence of this latent cause as noise.\nOnly when the joint distribution over both sensory input and reinforcement is modelled\n\n00.51relative amplitudeagenerative weights00.51relative amplitudebunweighted learning00.51relative amplitudesensory input dimensioncweighted learning\fwill this aspect of the sensory data be captured in the model parameters.\n\nS \u039bS)\u22121 and mean \u00b5L = \u03a3L\u039bT\n\nS\u03a8\u22121\n\nLearning was carried out by processing data generated by the model described above one\nsample at a time. The posterior distribution P\u03b8(Li | Si) for the PFA model is Gaussian,\nS\u03a8\u22121\nwith covariance \u03a3L = (I + \u039bT\nS Si. The distribu-\ndrawn randomly from N (\u00b5L, 0.4\u03a3L).\nTwo simulations were performed. In one case learning proceeded according to the sampled\n\ntional samplesePi were also taken to be Gaussian. Each had covariance 0.6\u03a3L and mean\ndistributionsePi, with no importance weighting. In the other, learning was modulated by the\n\nimportance weights given by (4). In all other regards the two simulations were identical.\nIn particular, in both cases the reinforcement predictive weights \u039bV were estimated, and\nin both cases the orthogonality constraint of PFA was applied to the combined estimated\nweight matrix [\u039bS, \u039bV ]. Figure 1b and c shows the sensory weights \u039bS learnt by each\nof these procedures (again the curves have been rescaled to show relative weights). Both\nalgorithms recovered the basic tuning properties; however, only the importance sampling\nalgorithm was able to model the additional data feature that was linked to the prediction\nof reinforcement value. The fact that in all other regards the two learning simulations\nwere identical demonstrates that the importance weighting procedure (rather than, say, the\northogonality constraint) was responsible for this difference.\n\n7 Summary\n\nThis paper has presented a framework within which the experimentally observed impact of\nbehavioural reinforcement on sensory plasticity might be understood. This framework rests\non a similar foundation to the recent work that has related unsupervised learning to sensory\nresponse properties. It extends this foundation to consider prediction of the reinforcement\nvalue associated with sensory stimuli. Direct learning by expectation-maximisation within\nthis framework poses dif\ufb01culties regarding biological plausibility. However, these were\nresolved by the introduction of an importance sampled approach, along with its extension\nto distributional sampling. Information about reinforcement is thus carried by a weighting\nsignal that might be identi\ufb01ed with the neuromodulatory signals in the brain.\n\nReferences\n\n[1] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge,\n\nMA, 1998.\n\n[2] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. K. Saul. An introduction to variational methods\n\nfor graphical models. Mach. Learning, 37(2):183\u2013233, 1999.\n\n[3] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning\n\na sparse code for natural images. Nature, 381(6583):607\u20139, 1996.\n\n[4] A. J. Bell and T. J. Sejnowski. The \u201dindependent components\u201d of natural scenes are edge \ufb01lters.\n\nVision Res., 37(23):3327\u20133338, 1997.\n\n[5] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse,\nand other variants. In M. I. Jordan, ed., Learning in Graphical Models, pp. 355\u2013370. Kluwer\nAcademic Press, 1998.\n\n[6] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The\n\nArt of Scienti\ufb01c Computing. CUP, Cambridge, 2nd edition, 1993.\n\n[7] B. S. Everitt. An Introduction to Latent Variable Models. Chapman and Hall, London, 1984.\n\n\f", "award": [], "sourceid": 2412, "authors": [{"given_name": "Maneesh", "family_name": "Sahani", "institution": null}]}