{"title": "Explaining Away in Weight Space", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 457, "abstract": null, "full_text": "Explaining Away in Weight Space \n\nPeter Dayan \n\nSham Kakade \n\nGatsby Computational Neuroscience Unit, UCL \n\n17 Queen Square London WCIN 3AR \n\nda ya n @ga t sby.u c l. ac . uk \n\nsham@ga t sby.u c l. ac .uk \n\nAbstract \n\nExplaining away has mostly been considered in terms of inference of \nstates in belief networks. We show how it can also arise in a Bayesian \ncontext in inference about the weights governing relationships such as \nthose between stimuli and reinforcers in conditioning experiments such \nas bacA,'Ward blocking. We show how explaining away in weight space \ncan be accounted for using an extension of a Kalman filter model; pro(cid:173)\nvide a new approximate way of looking at the Kalman gain matrix as a \nwhitener for the correlation matrix of the observation process; suggest \na network implementation of this whitener using an architecture due to \nGoodall; and show that the resulting model exhibits backward blocking. \n\n1 Introduction \n\nThe phenomenon of explaining away is commonplace in inference in belief networks. In \nthis, an explanation (a setting of activities of unobserved units) that is consistent with cer(cid:173)\ntain observations is accorded a low posterior probability if another explanation for the same \nobservations is favoured either by the prior or by other observations. Explaining away is \ntypically realized by recurrent inference procedures, such as mean field inference (see Jor(cid:173)\ndan, 1998). \n\nHowever, explaining away is not only important in the space of on-line explanations for \ndata; it is also important in the space of weights. This is a very general problem that we \nillustrate using a phenomenon from animal conditioning called bad 'Ward blocking (Shanks, \n1985; Miller & Matute, 1996). Conditioning paradigms are important because they provide \na window onto processes of successful natural inference, which are frequently statistically \nnormative. Backwards blocking poses a very different problem from standard explaining \naway, and rather complex theories have been advanced to account for it (eg Wagner & \nBrandon, 1989). We treat it as a case for Kalman jiltering, and suggest a novel network \nmodel for Kalman filtering to solve it. Consider three different conditioning paradigms \nassociated with backwards blocking: \n\nset 1 \nname \nL~R L,S~R S~ \u2022 \nforward \nbackward L,S~R L~R S~ \u2022 \nsharing \n\nL,S~R - - - S~Rl2 \n\nset 2 \n\ntest \n\nThese paradigms involve one or two sets of multiple learning trials (set 1 and set 2), in \nwhich stimuli (a light, L, and/or a sound, S) are conditioned to a reward (R), followed by a \n\n\ftest phase, in which the strength of association between the sound S and reward is assessed. \nThis is found to be weak (.) in forward and backward blocking, but stronger (Rf2) in the \nsharing paradigm. The effect that concerns this paper is occurring during the second set of \ntrials during backward blocking in which the association between the sound and the reward \nis weakened (compared with sharing), even though the sound is not presented during these \ntrials. The apparent association between the sound and the reward established in the first \nset of trials is explained away in the second set of trials. \n\nThe standard explanation for this (Wagner's SOP model, see Wagner & Brandon, 1989) \nsuggests that during the first set of trials, the light comes to predict the presence of the \nsound; and that during the second set of trials, the fact that the sound is expected (on the \nbasis of the light, represented by the activation of 'opponent' sound units) but not presented, \nweakens the association between the sound and the reward. Not only does this suggestion \nlack a statistical basis, but also its network implementation requires that the activation of the \nopponent sound units makes weaker the weights from the standard sound units to reward. \nIt is unclear how this could work. \n\nnamely the covariance matrix of uncertainty about the re(cid:173)\n\nIn this paper, we first extend the Kalman filter based conditioning theory of Sutton (1992) \nto the case of backward blocking. Next, we show the close relationship between the key \nquantity for a Kalman filter -\nlationship between the stimuli and the reward -\nand the symmetric whitening matrix for \nthe stimuli. Then we show how the Goodall algorithm for whitening (Goodall 1960; Atick \n& Redlich, 1993) makes for an appropriate network implementation for weight updates \nbased on the Kalman filter. The final algorithm is a motivated mixture of unsupervised and \nreinforcement (or, equivalently in this case, supervised) learning. Last, we demonstrate \nbackward blocking in the full model. \n\n2 The Kalman filter and classical conditioning \n\nSutton (1992) suggested that one can understand classical conditioning in terms of nor(cid:173)\nmative statistical inference. The idea is that on trial n there is a set of true weights Wn \nmediating the relationship between the presentation of stimuli Xn and the amount of re(cid:173)\nward Tn that is delivered, where \n\n(1) \n\nand En '\" N[O, T2] is zero-mean Gaussian noise, independent from one trial to the next. l \nFor the cases above, Xn = (x~,x~) might have two dimensions, one each for light and \nsound, taking on values that are binary, representing the presence and absence of the stim(cid:173)\nuli. Similarly, W n = (w~, w~) also has two dimensions. Crucially, to allow for the possi(cid:173)\nbility (realized in most conditioning experiments) that the true weights might change, the \nmodel includes a diffusion term \n\nW n +1 = Wn + 11n \n\n(2) \n\n'\" N[O, (72][] is also Gaussian. The task for the animal is to take obser(cid:173)\nwhere 11n \nvations of the stimuli {xn} and rewards {Tn} and infer a distribution over W n . Pro(cid:173)\n'\" N[O, ~o] for some covari(cid:173)\nvided that the initial uncertainty can be captured as Wo \nance matrix ~o , inference takes the form of a standard recursive Kalman filter, for which \nP(WnITl ... Tn-d '\" N[wn, ~n] and \n\n, \nWn+l =wn + \n\n, \n\n~n . Xn \n~ \n\nXn \u2022 LIn . Xn + T \n\n( \n\n) \n2 Tn-Wn'Xn \n\n, \n\n~ \nLln+l - LIn + (7 \n\n_ ~ \n\n2][ _ ~n' XnXn . ~n \nXn . LIn . Xn + T \n2 \n\n~ \n\n(3) \n\n(4) \n\niPor vectors a, b, matrix C, a\u00b7 b = I:i aibi, a\u00b7 C\u00b7 b = I:ij aiCijbj, matrix [ab]ij = aibj. \n\n\fIf 1;n ex n, then the update for the mean can be seen as a standard delta rule (Widrow & \nStearns, 1985; Rescorla & Wagner, 1972), involving the prediction error (or innovation) \nOn = rn - wn . x n . Note the familiar, but at first sight counterintuitive, result that the \nupdate for the covariance matrix does not depend on the innovation or the observed rn .2 \n\nIn backward blocking, in the first set of trials, the off-diagonal terms of the covariance \nmatrix 1;n become negative. This can either be seen from the form of the update equation \nfor the covariance matrix (since Xn '\" (1,1)), or, more intuitivep', from the fact that these \ntrials imply a constraint only on w* + w~, therefore forcing wn and w* to be negatively \ncorrelated. The consequence of this negative correlation in the second set of trials is that \nthe S component of 1;n . Xn = 1;n . (1,0) is less than 0, and so, via equation 3, w~ reduces. \nThis is exactly the result in backward blocking. Another way of looking at this is in terms \nof explaining away in weight space. From the first set of trials, the animal infers that \nw* + w~ = R > 0; from the second, that the prediction owes to w* rather than w~, and \nso the old value w~ = R/2 is explained away by w*. Sutton (1992) actually suggested the \napproximation of forcing the off-diagonal components of the covariance matrix 1;n to be \n0, which, of course, prevents the system from accounting for backward blocking. \n\nWe seek a network account of explaining away in the space of weights by implementing an \napproximate form of Kalman filtering. \n\n3 Whitening and the Kalman filter \n\nIn conventional applications of the Kalman filter, Xn would typically be constant. That \nis, the hidden state (wn ) would be observed through a fixed observation process. In cases \nsuch as classical conditioning, though, this is not true - we are interested in the case that Xn \nchanges over time, possibly even in a random (though fully observable) way. The plan for \nthis section is to derive an approximate relationship between the average covariance matrix \nover the weights f; and a whitening matrix for the stimulus inputs. In the next section, we \nconsider an implementation of a particular whitening algorithm as an unsupervised way of \nestimating the covariance matrix for the Kalman filter and show how to use it to learn the \nweights W n appropriately. \nConsider the case that Xn are random, with correlation matrix (xx) = Q, and consider the \nmean covariance matrix f; for the Kalman filter, averaging across the variation in x. Make \nthe approximation that \n\n/ \n\\X.1; ' X+T 2 \n\nf;. xx . f;) \n-\n\n(f; . xx . f;) \n\n(X\u00b71;\u00b7X+T 2 ) \n\nwhich is less drastic than it might first appear since the denominator is just a scalar. Then, \nwe can solve for the average of the asymptotic value of f; in the equation for the update of \nthe Kalman filter as \n\nf;Qf; ex n \n\n(5) \n\nThus f; is a whitening filter for the correlation matrix Q of the inputs {x}. Symmetric \nwhitening filters (f; must be symmetric) are generally unique (Atick & Redlich, 1993). \nThis result is very different from the standard relationship between Kalman filtering and \nwhitening. The standard Kalman filter is a whitening filter for the innovations process \non = rn - wn . X n , ie as extracting all the systematic variation into W n, leaving only \nrandom variation due to the observation noise and the diffusion process. Equation 5 is an \nadditional level of whitening, saying that one can look at the long-run average covariance \n\n2Note also the use of the alternative form of the Kalman filter, in which we perform observa(cid:173)\n\ntion/conditioning followed by drift, rather than drift followed by observation/conditioning. \n\n\fA \n\n]j10 \nc \n8. \nE 8 \n8 \n~ 6 \no \n~4 \n'5 \n~ 2 on diagonal component \n\noff diagonal com anent \n\nB x \n\ny(t) \n\nFigure 1: Whitening. A) The lower curve shows the average maximum off-diagonal element of \nIfjQfjl as a function of v . The upper curve shows the average maximum diagonal element of the \nsame matrix. The off-diagonal components are around an order of magnitude smaller than the on(cid:173)\ndiagonal components, even in the difficult regime where v is near 0, and thus the matrix Q is nearly \nsingular. B) Network model for Kalman filtering. Identity feedforward weights 1I map inputs x to a \nrecurrent network yet} whose output is used to make predictions. Learning of the recurrent weights \nB is based on Goodall's (1960) rule; learning of the prediction weights is based on the delta rule, \nonly using yeO} to make the predictions and y(oo} to change the weights. \n\nmatrix of the uncertainty in Wn as whitening the input process x n . This is inherently \nunsupervised, in that whitening takes place without any reference to the observed rewards \n(or even the innovation). \nGiven the approximation, we tested whether f; really whitens Q by by generating Xn from a \nGaussian distribution, with mean (1,1) and variance v 2II, calculating the long-run average \nvalue of f;, and assessing whether f = f;Qf; is white. There is no unique measure for \nthe deviation of f from being diagonal; as an example, figure lA shows as a function of \nv the largest on- and off-diagonal elements of f . The figure shows that the off-diagonal \ncomponents are comparatively very small, even when v is very small, for which Q has \nan eigenvalue very near to 0 making the whitening matrix nearly undefined. Equally, in \nthis case, ~n tends to have very large values, since, looking at equation 4, the growth in \nuncertainty coming from a 2II is not balanced by any observation in the direction (1, -1) \nthat is orthogonal to (1,1) . \nOf course, only the long-run average covariance matrix f; whitens Q. We make the further \napproximation of using an on-line estimate of the symmetric whitening matrix as the on(cid:173)\nline estimate of the covariance of the weights ~n . \n\n4 A network model \n\nFigure IB shows a network model in which prediction weights wn adapt in a manner that \nis appropriately sensitive to a learned, on-line, estimate of the whitening matrix. The net(cid:173)\nwork has two components, a mapping from input x to output y(t), via recurrent feedback \nweights B (the Goodall (1960) whitening filter), and a mapping from y, through a set of \nprediction weights W to an estimate of the reward. The second part of the network is most \nstraightforward. The feedforward weights from x to yare just the identity matrix II. There(cid:173)\nfore, the initial value in the hidden layer in response to stimulus Xn is y(O) = X n , and so \nthe prediction of reward is just w . y(O) = W . X n . \nThe first part of the network is a straightforward implementation of Goodall's whitening \nfilter (Goodall, 1960; Atick & Redlich, 1993). The recurrent dynamics in the y-Iayer are \ntaken as being purely linear. Therefore, in response to input x (propagated through the \n\n\fidentity feedforward weights) \n\nTY = -y+x+By \n\nand so y( 00) = (II - B)-lX, provided that the inverse exists. Goodall's algorithm changes \nthe recurrent weights B using local, anti-Hebbian learning, according to \n\ntl.B ()( -xy + II - B . \n\n(6) \nThis rule stabilizes on average when II = (II - B)-lQ[(II - B)-l], that is when (II - B)-l \nis a whitening filter for the correlation matrix Q of the inputs. If B is symmetric, which can \nbe guaranteed by making B = (()) initially (Atick & Redlich, 1993), then, by convergence, \nwe have (II - B)-l = f:; and, given input Xn to the network \n-\n~Xn = (II - B) Xn = Yn(oo) \n\n-1 \n\nTherefore, we can implement a learning rule for the prediction weights akin to the Kalman \nfilter (equation 3) using \n\n(7) \n\nThis is the standard delta rule, except that the predictions are based on Yn(O) \nwhereas the weight changes are based on Y n( 00) = f:;xn . The learning rule gets wrong the \nabsolute magnitude of the weight changes (since it lacks the Xn . ~n . Xn + T2 term on the \ndenominator - but it gets right the direction of the changes. \n\nX n , \n\n5 Results \nFigure 2 shows the result of learning in backward blocking. In association with Tn = 1, first \nstimulus Xn = (1,1) was presented for 20 trials, then stimulus Xn = (1,0) was presented \nfor a further 20 trials. Figure 2A shows the development of the weights w~ (solid) and \nw~ (dashed). During the first set of trials, these grow towards 0.5; during the second set, \nthey differentiate sharply with the weight associated with the light growing towards 1, and \nthat with the sound, which is explained away, growing towards O. Figure 2B shows the \ndevelopment of two terms in the estimated covariance matrix. The negative covariance \nbetween light and sound is evident, and causes the sharp changes in the weights on the 21st \ntrial. Figure 2C & D show the values using the exact Kalman filter, showing qualitatively \nsimilar behavior. \n\nThe increases in the magnitudes of ~~L and ~~s during the first sta~e of backwards block(cid:173)\ning come from the lack of information in the input about w~ - w n ' despite its continual \ndiffusion (from equation 2). Thus backwards blocking is a pathological case. Nevertheless, \nthe on-line method for estimating ~ captures the correct behavior. Figures 2 E-H show a \nnon-pathological case with observation noise added. The estimates from the model closely \nmatch those of the exact Kalman filter, a result that is is also true for other non-pathological \ncases. \n\n6 Discussion \n\nWe have shown how the standard Kalman filter produces explaining away in the space of \nweights, and suggested and proved efficacious a natural network model for implementing \nthe Kalman filter. The model mixes unsupervised learning of a whitener for the obser(cid:173)\nvation process (ie the Xn of equation 1), providing the covariance matrix governing the \nuncertainty in the weights, with supervised (or equivalently reinforcement) learning of the \nmean values of the weights . Unsupervised learning is reasonable since the evolution of the \ncovariance matrix of the weights is independent of the innovations. The basic result is an \n\n\fB . \n\n4 \n\n, \ni \n;;f \nI -, \ne .. \n\n0-\u00b7. \n\nL __ ~~. ___ . ___ \n\n20 \n\ntrial \n\n40 \n\n-'0 \n\n10 \n\n, \n\nF \n\nE~L \n\nE~L \n\n'. \n\n'. \n\n\\j \n\n20 \ntrial \n\n00 \n\n20 \n\ntrial \n\nA , \n\n,; DB \n\nD2 \n\n\"0 \n\nE , \n\n.DB \n\n, ~ \n\nD01~ ~ ':.... \n.. \" \u00b7\u00b7\u00b7\u00b7\u00b7 .......... j/~~L \n.. , \n\nB:~ ...... . \n\n3D \n\n40 \n\no \n\n10 \n\n20 \ntrial \n\n04 \n\n/.- ........ ----.... _' \u2022\u2022\u2022 __ ~_~. __ _ \u2022 ___ \u2022 \n\n04 \n\n\u00b0O~~-~20:;______;;c-__!' \n\n\u00b0O~~-~20:;______;;c-__!' \n\ntrial \n\ntrial \n\nFigure 2: Backward blocking in the full model. A) The development of w over 20 trials with \nXn = (1,1) and 20 with Xn = (1,0) . B) The development of the estimated covariance of the \nweight for the light 'E~L and cross-covariance between the light and the sound 'E~s. The learning \nrates in equations 6 and 7 were both 0.125.C & D) The development of wand 'E from the exact \nKalman filter with parameters (IT = .09 and T = 0.35). E) The development of w as in A) except \nwith multiplicative Gaussian noise added (ie noise with standard deviation 0.35 is added only to the \nrepresentations of stimuli that are present). F & G) The comparison of w in the model (solid line) and \nin the exact Kalman filter (dashed line), using the sarne parameters for the Kalman filter as in C) and \nD). H) A comparison of the true covariance, 'En (dashed line), with the rescaled estimate, (][ - B)-1 \n(solid line). \n\napproximation, but one that has been shown to match results quite closely. Further work is \nneeded to understand how to set the parameters of the Goodall learning rule to match 0'2 \nand 7 2 exactly. \n\nHinton (personal communication) has suggested an alternative interpretation of Kalman \nfiltering based on a heteroassociative novelty filter. Here, the idea is to use the recurrent \nnetwork B only once, rather than to equilibrium, with (as for our model) Yn(O) = Xn, the \nprediction v = wn . Yn(O), Yn(l) = Bn . Xn, and \n\nThis gives Bn a similar role to ~n in learning wn . For the novelty filter, \n\nb.wn IX Yn(l) (Tn - Wn ' Yn(O)) . \n\nb. \n\n_ Bn . XnXn . Bn \n\nBn--\n\nJBn . x n J2 \n\n' \n\nwhich makes the network a perfect heteroassociator between Xn and Tn. If we compare \nthe update for Bn to that for ~n (equation 4), we can see that it amounts approximately \nto assuming neither observation noise nor drift. Thus, whereas our network model approx(cid:173)\nimates the long-run covariance matrix, the novelty filter approximates the instantaneous \ncovariance matrix directly, and could clearly be adapted to take account of noise. Unfortu(cid:173)\nnately, there are few quantitatively precise experimental results on backwards blocking, so \nit is hard to choose between different possible rules. \n\nThere is a further alternative. Sutton (1992) suggested an online way of estimating the \nelements of the covariance matrix, observing that \n\n(8) \nand so considered using a standard delta rule to fit the square innovation using a quadratic \ninput representation ((X~)2, (X~)2 , x~ X x~, 1) .3 The weight associated with the last ele-\n\nE[t5~l = 7 2 + Xn . ~n . Xn \n\n3 Although the x~ x x~ term was omitted from Sutton's diagonal approximation to 'En. \n\n\fment, ie the bias, should come to be the observation noise 7 2 ; the weights associated with \nthe other elements are just the components of ~n. The most critical concern about this is \nthat it is not obvious how to use the resulting covariance matrix to control learning about the \nmean values of the weights. There is also the more theoretical concern that the covariance \nmatrix should really be independent of the prediction errors, one manifestation of which is \nthat the occurrence of backward blocking in the model of equation 8 is strongly sensitive \nto initial conditions. \n\nAlthough backward blocking is a robust phenomenon, particularly in human conditioning \nexperiments (Shanks, 1985), it is not observed in all animal conditioning paradigms. One \npossibility for why not is that the anatomical substrate of the cross-modal recurrent network \n(the B weights in the model) is not ubiquitously available. In its absence, y( 00) = y(O) = \nXn in response to an input X n , and so the network will perform like the standard delta or \nRescorla-Wagner (Rescorla & Wagner, 1972) rule. \n\nThe Kalman filter is only one part of a more complicated picture for statistically normative \nmodels of conditioning. It makes for a particularly clear example of what is incomplete \nabout some of our own learning rules (notably Kakade & Dayan, 2000) which suggest that, \nat least in some circumstances, learning about the two different stimuli should progress \ncompletely independently. We are presently trying to integrate on-line and learned com(cid:173)\npetitive and additive effects using ideas from mixture models and Kalman filters. \n\nAcknowledgements \n\nWe are very grateful to David Shanks, Rich Sutton, Read Montague and Terry Sejnowski \nfor discussions of the Kalman filter model and its relationship to backward blocking, and to \nSam Roweis for comments on the paper. This work was funded by the Gatsby Charitable \nFoundation and the NSF. \n\nReferences \n\nAtick, JJ & Redlich, AN (1993) Convergent algorithm for sensory receptive field development. Neu(cid:173)\nral Computation 5:45-60. \n\nGoodall, MC (1960) Performance of stochastic net. Nature 185:557-558. \n\nJordan, MI, editor (1998) Learning in Graphical Models. Dordrecht: Kluwer. \n\nKakade, S & Dayan, P (2000) Acquisition in autoshaping. In SA Solla, TK Leen & K-R Muller, \neditors, Advances in Neural Information Processing Systems, 12. \n\nMiller, RR & Matute, H (1996). Biological significance in forward and backward blocking: Resolu(cid:173)\ntion of a discrepancy between animal conditioning and human causal judgment. Journal of Experi(cid:173)\nmental Psychology: General 125:370-386. \n\nRescorla, RA & Wagner, AR (1972) A theory of Pavlovian conditioning: The effectiveness of rein(cid:173)\nforcement and non-reinforcement. In AH Black & WF Prokasy, editors, Classical Conditioning II: \nCurrent Research and Theory. New York:Aleton-Century-Crofts, 64-69. \n\nShanks, DR (1985). Forward and backward blocking in human contingency judgement. Quarterly \nJournal of Experimental Psychology: Comparative & Physiological P5ychology 37:1-21. \n\nSutton, RS (1992). Gain adaptation beats least squares? In Proceedings of the 7th Yale Workshop on \nAdaptive and Learning Systems. \n\nWagner, AR & Brandon, SE (1989). Evolution of a structured connectionist model of Pavlovian con(cid:173)\nditioning (AESOP). In SB Klein & RR Mowrer, editors, Contemporary Learning Theories. Hillsdale, \nNJ: Erlbaum, 149-189. \n\nWidrow, B & Stearns, SD (1985) Adaptive Signal Processing. Englewood Cliffs, NJ:Prentice-Hall. \n\n\f", "award": [], "sourceid": 1852, "authors": [{"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}]}