{"title": "Statistical Models of Conditioning", "book": "Advances in Neural Information Processing Systems", "page_first": 117, "page_last": 123, "abstract": null, "full_text": "Statistical Models of Conditioning \n\nPeter Dayan* \n\nBrain & Cognitive Sciences \n\nE25-2IDMIT \n\nCambridge, MA 02139 \n\nTheresa Long \n\n123 Hunting Cove \n\nWilliamsburg, VA 23185 \n\nAbstract \n\nConditioning experiments probe the ways that animals make pre(cid:173)\ndictions about rewards and punishments and use those predic(cid:173)\ntions to control their behavior. One standard model of condition(cid:173)\ning paradigms which involve many conditioned stimuli suggests \nthat individual predictions should be added together. Various key \nresults show that this model fails in some circumstances, and mo(cid:173)\ntivate an alternative model, in which there is attentional selection \nbetween different available stimuli. The new model is a form of \nmixture of experts, has a close relationship with some other exist(cid:173)\ning psychological suggestions, and is statistically well-founded. \n\n1 Introduction \n\nClassical and instrumental conditioning experiments study the way that animals \nlearn about the causal texture of the world (Dickinson, 1980) and use this informa(cid:173)\ntion to their advantage. Although it reached a high level of behavioral sophistica(cid:173)\ntion, conditioning has long since gone out of fashion as a paradigm for studying \nlearning in animals, partly because of the philosophical stance of many practition(cid:173)\ners, that the neurobiological implementation of learning is essentially irrelevant. \nHowever, more recently it has become possible to study how conditioning phe(cid:173)\nnomena are affected by particular lesions or pharmacological treatments to the \nbrain (eg Gallagher & Holland, 1994), and how particular systems, during simple \nlearning tasks, report information that is consistent with models of conditioning \n(Gluck & Thompson, 1987; Gabriel & Moore, 1989). \nIn particular, we have studied the involvement of the dopamine (DA) system in \nthe ventral tegmental area of vertebrates in reward based learning (Montague et \naI, 1996; Schultz et aI, 1997). The activity of these cells is consistent with a model \nin which they report a temporal difference (TO) based prediction error for reward \n\n\"This work was funded by the Surdna Foundation. \n\n\f118 \n\nP. Dayan and T. Long \n\n(Sutton & Barto, 1981; 1989). This prediction error signal can be used to learn correct \npredictions and also to learn appropriate actions (Barto, Sutton & Anderson, 1983). \nThe DA system is important since it is crucially involved in normal reward learning, \nand also in the effects of drugs of addiction, self stimulation, and various neural \ndiseases. \nThe TO model is consistent with a whole body of experiments, and has even cor(cid:173)\nrectly anticipated new experimental findings. However, like the Rescorla-Wagner \n(RW; 1972) or delta rule, it embodies a particular additive model for the net pre(cid:173)\ndiction made when there are multiple stimuli. Various sophisticated conditioning \nexperiments have challenged this model and found it wanting. The results support \ncompetitive rather than additive models. Although ad hoc suggestions have been \nmade to repair the model, none has a sound basis in appropriate prediction. There \nis a well established statistical theory for competitive models, and it is this that we \nadopt. \nIn this paper we review existing evidence and theories, show what constraints \na new theory must satisfy, and suggest and demonstrate a credible candidate. \nAlthough it is based on behavioral data, it also has direct implications for our \nneural theory. \n\n2 Data and Existing Models \n\nTable 1 describes some of the key paradigms in conditioning (Dickinson, 1980; \nMackintosh, 1983). Although the collection of experiments may seem rather arcane \n(the standard notation is even more so), in fact it shows exactly the basis behind \nthe key capacity of animals in the world to predict events of consequence. We will \nextract further biological constraints implied by these and other experiments in the \ndiscussion. \nIn the table,l (light) and s (tone) are potential predictors (called conditioned stimuli or \nCSs), of a consequence, r, such a~ the delivery of a reward (called an unconditioned \nstimulus or US). Even though we use TO rules in practice, we discuss some of the \nabstract learning rules without much reference to the detailed time course of trials. \nThe same considerations apply to TD. \n\nIn Pavlovian conditioning, the light acquires a positive association with the reward \nin a way that can be reasonably well modeled by: \n\n,6.Wl(t) = al(t)(r(t) - wl(t\u00bbl(t), \n\n(1) \nwhere let) E {O, I} represents the presence of the light in trial t (s(t) will similarly \nrepresent the presence of a tone), Wl(t) (we will often drop the index t) represents \nthe strength of the expectation about the delivery of reward ret) in trial t if the \nlight is also delivered, and al(t) is the learning rate. This is just the delta rule. It \nalso captures well the probabilistic contingent nature of conditioning - for binary \nret) E {O, I}, animals seem to assess il = P[r(t)ll(t) = 1J - P[r(t)ll(t) = OJ, and then \nonly expect reward following the light (in the model, have WI > 0) if il > O. \nPavlovian conditioning is easy to explain under a whole wealth of rules. The trouble \ncomes in extending equation 1 to the case of multiple predictors (in this paper we \nconsider just two). The other paradigms in table 1 probe different aspects of this. \nThe one that is most puzzling is (perversely) called downwards unblocking (Holland, \n1988). In a first set of trials, an association is established between the light and two \npresentations of reward separated by a few (u) seconds. In a second set, a tone is \nincluded with the light, but the second reward is dropped. The animal amasses \nless reward in conjunction with the tone. However, when presented with the tone \n\n\fStatistical Models of Conditioning \n\n119 \n\n1 \n\n2 \n\n3 \n\nName Set 1 \n\nPavlovian \n\nOvershadowing \n\nInhibitory \n\nBlocking \n4 \n5 \nUpwards unblocking \n.6 Downwards unblockinK \n\nl-tr \nl-tr \nI -t rflur \n\nSet 2 \nI -t r \n\nTest \nl~ r \n\nl+s-tr { I ~ r~ } \n\ns~ r'i \n\n1 \n\n{ l-tr } \n\nZ+s-t\u00b7 \n\nl+s-tr \n1+ s -t rflur \nl+s-tr \n\ns~f \n\ns~\u00b7 \n\ns~r \n\ns~ \u00b1~ \n\nTable 1: Paradigms. Sets 1 and 2 are separate sets of learning trials, which are continued \nuntil convergence. Symbols land s indicate presentation of lights and tones as potential \npredictors. The 't+ in the test set indicates that the associations of the predictors are tested, \nprodUCing the listed results. In overshadowing, association with the reward can be divided \nbetween the light and the sound, indicated by r!. In some cases overshadowing favours \none stimulus at the complete expense of the other; and at the end of very prolonged training, \nall effects of overshadowing can disappear. In blocking, the tone makes no prediction of r. \nIn set 2 of inhibitory conditioning, the two types of trials are interleaved and the outcome \nis that the tone predicts the absence of reward. In upwards and downwards unblocking, \nthe 6\" indicates that the delivery of two rewards is separated by time u. For downwards \nunblocking, if u is small, then s is associated with the absence of r; if u is large, then s is \nassociated with the presence of r. \n\nalone, the animal expects the presence rather than the absence of reward. On the face \nof it, this seems an insurmountable challenge to prediction-based theories. First \nwe describe the existing theories, then we formalise some potential replacements. \n\nOne theory (called a US-processing theory) is due to Rescorla & Wagner (RW; 1972), \nand, as pointed out by Sutton & Barto (1981), is just the delta rule. For RW, the \nanimal constructs a net prediction: \n\nV(t) = wi(t)l(t) + ws{t)s{t) \n\n(2) \n\nfor r(t), and then changes flWi(t) = Cti(t)(r(t) - V(t\u00bbl(t) (and similarly for ws(t\u00bb \nusing the prediction error r(t) - V(t). Its foundation in the delta rule makes it \ncomputationally appropriate (Marr, 1982) as a method of making predictions. TD \nuses the same additive model in equation 2, but uses r(t) + V(t + 1) - V(t) as the \nprediction error. \nRW explains overshadowing, inhibitory conditioning, blocking, and upwards un(cid:173)\nblocking, but not downwards unblocking. In overshadowing, the terminal asso(cid:173)\nciation between I and r is weaker if I and s are simultaneously trained - this is \nexpected under RW since learning stops when V(t) = r{t), and W, and Ws will \nshare the prediction. In inhibitory conditioning, the sound comes to predict the ab(cid:173)\nsence of r. The explanation of inhibitory conditioning is actually quite complicated \n(Konorski, 1967; Mackintosh, 1983); however RW provides the simple account that \nWI = r for the I -t r trials, forcing Ws = -r for the 1+ s -t . trials. In blocking, \nthe prior association between I and r means that Wi = r in the second set of trials, \nleading to no learning for the tone (since V(t) - r(t) = 0). In upwards unblocking, \nWi = r at the start of set 2. Therefore, r(t) - WI = r > 0, allowing Ws to share in the \nprediction. \n\nAs described above, downwards unblocking is the key thorn in the side of RW. \nSince the TD rule combines the predictions from different stimuli in a similar way, \n\n\f120 \n\nP. Dayan and T. Long \n\nit also fails to account properly for downwards unblocking. This is one reason why \nit is incorrect as a model of reward learning. \nThe class of theories (called CS-processing theories) that is alternative to RW does \nnot construct a net prediction V(t), but instead uses equation 1 for all the stimuli, \nonly changing the learning rates O!l(t) and O!s(t) as a function of the conditioning \nhistory of the stimuli (eg Mackintosh, 1975; Pearce & Hall, 1980; Grossberg, 1982). \nA standard notion is that there is a competition between different stimuli for a \nlimited capacity learning processor (Broadbent, 1958; Mackintosh, 1975; Pearce & \nHall, 1980), translating into competition between the learning rates. In blocking, \nnothing unexpected happens in the second set of trials and equally, the tone does \nnot predict anything novel. In either case as is set to '\" 0 and so no learning \nhappens. In these models, downwards unblocking now makes qualitative sense: \nthe surprising consequences in set 2 can be enough to set as \u00bb0, but then learning \naccording to equation 1 can make Ws > O. Whereas Mackintosh's (1975) and Pearce \nand Hall's (1980) models only consider competition between the stimuli for learning, \nGrossberg's (1982) model incorporates competition during representation, so the net \nprediction on a trial is affected by competitive interactions between the stimuli. In \nessence, our model provides a statistical formalisation of this insight. \n\n3 New Models \n\nFrom the previous section, it would seem that we have to abandon the computa(cid:173)\ntional basis of the RW and TD models in terms of making collective predictions \nabout the reward. The CS-processing models do not construct a net prediction of \nthe reward, or say anything about how possibly conflicting information based on \ndifferent stimuli should be integrated. This is a key flaw - doing anything other \nthan well-founded prediction is likely to be maladaptive. Even quite successful \npre-synaptic models, such as Grossberg (1982), do not justify their predictions. \nWe now show that we can take a different, but still statistically-minded ap(cid:173)\nproach to combination in which we specify a parameterised probability distri(cid:173)\nbution P[r(t)ls(t), l(t)] and perform a form of maximum likelihood (ML) inference, \nupdating the parameters to maximise this probability over the samples. Consider \nthree natural models of P[r(t)/s(t), l(t)]: \n\nPa[r(t)ls(t),l(t)] \n\nP M[r(t)/s(t), l(t)] \n\nP J[r(t)/s(t), l(t)] \n\nN[w1l(t) + wss(t), (72] \n7l\"1 (t)N[Wl' (72] + 7l\"s(t)N[ws, (72] + 1i'(t).,v[w, r2] \nN[WI7l\"I(t)l(t) + wsnAt)s(t), (72] \n\n(3) \n\n(4) \n\n(5) \n\nwhere N[J.L, (72] is a normal distribution, with mean J.L and variance (72. In the latter \ntwo cases, 0 ::; 7l\"1 (t) + 7l\" s (t) ::; I, implementing a form of competition between \nthe stimuli, and 7l\".(t) = 0 if stimulus * is not presented. In equation 4, N[w, r2] \ncaptures the background expectation if neither the light nor the tone wins, and \n1i'(t) = 1 - 7l\"1(t) - n\"s{t). We will show that the data argue against the first two and \nsupport the third of these models. \n\nRescorla-Wagner: Pa[r(t)/s(t), l(t)] \n\nThe RW rule is derived as ML inference based on equation 3. The only difference \nis the presence of the variance, (72. This is useful for capturing the partial rein(cid:173)\nforcement effect (see Mackintosh, 1983), in which if r(t) is corrupted by substantial \nnoise (ie (72 \u00bb0), then learning to r is demonstrably slower. As we discussed above, \n\n\fStatistical Models of Conditioning \n\n121 \n\ndownwards unblocking suggests that animals are not using P G [r( t) Is( t), I (t)] as the \nbasis for their predictions. \n\nCompetitive mixture of experts: P M[r(t)ls(t), l(t)] \n\nPM[r(t)ls(t),l(t)] is recognisable as the generative distribution in a mixture \nof Gaussians model (Nowlan, 1991; Jacobs et ai, 1991b). Key in this model \nare the mixing proportions 7r1(t) and 7rs(t). Online variants of the E phase of \nthe EM algorithm (Dempster et ai, 1977) compute posterior responsibilities as \nql(t) + qs(t) + q(t) = 1, where ql(t) 0, and, through equation 6, this in turn means that the tone will \ncome to predict the presence of the first r. The time u between the rewards can be \nimportant because of temporal discounting. This means that there are sufficiently \nlarge values of u for which the inhibitory effect of the absence of the second reward \nwill be dominated. Note also that the expected reward based on l(t) and s(t) is the \nsum \n\n(7) \n\nAlthough the net prediction given in equation 7 is indeed based on all the stimuli, \nit does not directly affect the course of learning. This means that the model has \ndifficulty with inhibitory conditioning. The trouble with inhibitory conditioning is \nthat the model cannot use Ws < 0 to counterbalance WI > 0 - it can at best set Ws = 0, \nwhich is experimentally inaccurate. Note, however, this form of competition bears \nsome interesting similarities with comparator models of conditioning (see Miller & \nMatzel, 1989). It also has some problems in explaining overshadowing, for similar \nreasons. \n\nCooperative mixture of experts: P J[r(t)ls(t), l(t)] \n\nThe final model P J[r(t)ls(t), l(t)] is just like the mixture model that Jacobs et al \n(1991a) suggested (see also Bordley, 1982). One statistical formulation of this model \nconsiders that, independently, \n\nwhere Pl(t) and Ps(t) are inverse variances. This makes \n\n(72 = (Pl(t) + Ps(t\u00bb-l 7r1(t) = PI(t)(72 \n\n7rs(t) = Ps(t)(72. \n\nNormative learning rules should emerge from a statistical model of uncertainty in \nthe world. Short of such a model, we used: \n\nL\\WI = o:w-(-) 6(t) \n\n7r1 (t) \nPI t \n\nwhere 6(t) = r(t) - 7r1(t)Wl (t) - 7rs (t)ws (t) is the prediction error; the 1/ Pl(t) term \nin changing WI makes learning slower if WI is more certainly related to r (ie if PI (t) is \ngreater); the 0.1 substitutes for background noise; if 62 (t) is too large, then PI + Ps \n\n\f122 \n\nI:SIOCKlng ana unDloCklng \n\nt'reolctrVe vanances; I:SloCklng \n\nr-u\"gj;\\'--'--' \n\n8 \n\n6 \n\n4 \n\nP. Dayan and T. Long \n\nt'reolctove vanances: UnDlocKlng \n\n... , \n\n.........\n\n;:_1:1 \ni /~~ .... -\"-\n\nr . \"\" \n\n~~L-~1~0 --~2~0 --~M~~~ \n\nTim. to 2nd I8warn \n\n00 \n\n200 \n\n400 \n\n600 \n\nTrial \n\n800 \n\n1000 \n\n00 \n\n200 \n\n400 \n\n600 \n\nTrial \n\n800 \n\n1000 \n\nFigure 1: Blocking and downwards unblocking with 5 steps to the first reward; and a \nvariable number to the second. Here, the discount factor \"y = 0.9, and O:w = 0.5, O:p = 0.02, \nf.L = 0.75. For blocking, the second reward remains; for unblocking it is removed after 500 \ntrials. a) The terminal weight for the sound after learning - for blocking it is always small \nand positive; for downwards unblocking, it changes from negative at small ~u to positive \nat large ~ u. b,c) Predictive variances Pl(t) and P .. (t). In blocking, although there is a small \nchange when the sound is introduced because of additivity of the variances, learning to the \nsound is substantially prevented. In downwards unblocking, the surprise omission of the \nsecond reward makes the sound associable and unblocks learning to it. \n\nis shared out in proportion of pr to capture the insight that there can be dramatic \nchanges to variabilities; and the variabilities are bottom-limited. \nFigure 1 shows the end point and course of learning in blocking and downwards \nunblocking. Figure 1a confirms that the model captures downwards unblocking, \nmaking the terminal value of Ws negative for short separations between the re(cid:173)\nwards and positive for long separations. By comparison, in the blocking condition, \nfor which both rewards are always presented, W s is always small and positive. \nFigures 1b,c show the basis behind this behaviour in terms of Pl(t) and Ps{t). In \nparticular, the heightened associability of the sound in unblocking following the \nprediction error when the second reward is removed accounts for the behavior. \nAs for the mixture of experts model (and also for comparator models), the presence \nof 11'j(t) and nAt) makes the explanation of inhibitory conditioning and overshad(cid:173)\nowing a little complicated. For instance, if the sound is associable (Ps(t) \u00bb 0), then \nit can seem to act as a conditioned inhibitor even if Ws = O. Nevertheless, unlike \nthe mixture of experts model, the fact that learning is based on the joint prediction \nmakes true inhibitory conditioning possible. \n\n4 Discussion \n\nDownwards unblocking may seem like an extremely abstruse paradigm with which \nto refute an otherwise successful and computationally sound model. However, it \nis just the tip of a conditioning iceberg that would otherwise sink TD. Even in \nother reinforcement learning applications of TO, there is no a priori reason why \npredictions should be made according to equation 2 - the other statistical models \nin equations 4 and 5 could also be used. Indeed, it is easy to generate circumstances \nin which these more competitive models will perform better. For the neurobiology, \nexperiments on the behavior of the DA system in these conditioning tasks will help \nspecify the models further. \n\nThe model is incomplete in various important ways. First, it makes no distinction \nbetween preparatory and consumatory conditioning (Konorski, 1967). There is \nevidence that the predictions a CS makes about the affective value of USs fall in \na different class from the predictions it makes about the actual USs that appear. \n\n\fStatistical Models of Conditioning \n\n123 \n\nFor instance, an inhibitory stimulus reporting the absence of expected delivery of \nfood can block learning to the delivery of shock, implying that aversive events \nform a single class. The affective value forms the preparatory aspect, is likely \nwhat is reported by the DA cells, and perhaps controls orienting behavior, the \ncharacteristic reaction of animals to the conditioned stimuli that may provide an \nexperimental handle on the attention they are paid. Second, the model does not \nuse opponency (Konorski, 1967; Solomon & Corbit, 1974; Grossberg, 1982) to \nhandle inhibitory conditioning. This is particularly important, since the dynamics \nof the interaction between the opponent systems may well be responsible for the \nimportance of the delay u in downwards unblocking. Serotonin is an obvious \ncandidate as an opponent system to DA (Montague et a11996). We also have not \nspecified a substrate for the associabilities or the attentional competition - the DA \nsystem itself may well be involved. Finally, we have not specified an overall model \nof how the animal might expect the contingency of the world to change over time \n- which is key to the statistical justification of appropriate learning rules. \n\nReferences \n[1] Barto, AG, Sutton, RS & Anderson, CW (1983). IEEE Transactions on Systems, Man, and \n\nCybernetics, 13, pp 834-846. \n\n[2] Bordley, RF (1982). Journal of the Operational Research Society, 33, 171-174. \n[3] Broadbent, DE (1958). Perception and Communication. London: Pergamon. \n[4] Buhusi, CV & Schmajuk, NA. Hippocampus, 6, 621-642. \n[5] Dempster, AP, Laird, NM & Rubin, DB (1977). Proceedings of the Royal Statistical Society, \n\nB-39,1-38. \n\n[6] Dickinson, A (1980). Contemporary Animal Learning Theory. Cambridge, England: Cam(cid:173)\n\nbridge University Press. \n\n[7] Gabriel, M & Moore, J, editors (1989). Learning and Computational Neuroscience. Cam(cid:173)\n\nbridge, MA: MIT Press. \n\n[8] Gallagher, M & Holland, PC (1994). PNAS, 91, 11771-6. \n[9] Gluck, MA & Thompson, RF (1987). Psychological Reviews, 94, 176-191. \n[10] Grossberg, S (1982). Psychological Review, 89,529-572. \n[11] Holland, PC (1988). Journal of Experimental Psychology: Animal Behavior Processes, 14, \n\n261-279. \n\n[12] Jacobs, RA, Jordan, MI & Barto, AG (1991). Cognitive Science, 15, 219-250. \n[13] Jacobs, RA, Jordan, MI, Nowlan, SJ & Hinton, GE (1991). Neural Computation, 3, 79-87. \n[14] Konorski, J (1967). Integrative Activity of the Brain. Chicago, 11: Chicago University Press. \n[15] Mackintosh, NJ (1975). Psychological Review, 82,276-298. \n[16] Mackintosh, NJ (1983). Conditioning and Associative Learning. Oxford, UK: Oxford Uni(cid:173)\n\nversity Press. \n\n[17] Marr, 0 (1982). Vision. New York, NY: Freeman. \n[18] Miller, RR & Matzel, LD (1989). In SB Klein & RR Mowrer, editors, Contemporary \nLearning Theories: Pavlovian Conditioning and the Status of Traditional Theory. Hillsdale, \nNJ: Lawrence Erlbaum. \n\n[19] Montague, PR, Dayan, P & Sejnowski, TK (1996). Journal of Neuroscience, 16, 1936-1947. \n[20] Nowlan, SJ (1991). Soft Competitive Adaptation: Neural Network Learning Algorithms Based \non Fitting Statistical Mixtures. PhD Thesis, Department of Computer Science, Camegie(cid:173)\nMellon University. \n\n[21] Pearce, JM & Hall, G (1980). Psychological Review, 87,532-552. \n[22] Rescorla, RA & Wagner, AR (1972). In AH Black & WF Prokasy, editors, Classical Con(cid:173)\nditioning II: Current Research and Theory, pp 64-69. New York, NY: Appleton-Century(cid:173)\nCrofts. \n\n[23] Schultz, W, Dayan, P & Montague, PR (1997). Science, 275, 1593-1599. \n[24] Solomon, RL & Corbit, JD (1974). Psychological Review, 81, 119-145. \n[25] Sutton, RS & Barto, AG (1981). Psychological Review, 882, pp 135-170. \n[26] Sutton, RS & Barto, AG (1989). In Gabriel & Moore (1989). \n\n\f", "award": [], "sourceid": 1463, "authors": [{"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "Theresa", "family_name": "Long", "institution": null}]}