{"title": "The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters", "book": "Advances in Neural Information Processing Systems", "page_first": 1585, "page_last": 1592, "abstract": null, "full_text": " The Rescorla-Wagner algorithm and Maximum\n Likelihood estimation of causal parameters.\n\n\n\n Alan Yuille\n Department of Statistics\n University of California at Los Angeles\n Los Angeles, CA 90095\n yuille@stat.ucla.edu\n\n\n\n\n Abstract\n\n\n\n This paper analyzes generalization of the classic Rescorla-Wagner (R-\n W) learning algorithm and studies their relationship to Maximum Like-\n lihood estimation of causal parameters. We prove that the parameters\n of two popular causal models, P and P C, can be learnt by the same\n generalized linear Rescorla-Wagner (GLRW) algorithm provided gener-\n icity conditions apply. We characterize the fixed points of these GLRW\n algorithms and calculate the fluctuations about them, assuming that the\n input is a set of i.i.d. samples from a fixed (unknown) distribution. We\n describe how to determine convergence conditions and calculate conver-\n gence rates for the GLRW algorithms under these conditions.\n\n\n\n\n\n1 Introduction\n\n\nThere has recently been growing interest in models of causal learning formulated as proba-\nbilistic inference [1,2,3,4,5]. There has also been considerable interest in relating this work\nto the Rescorla-Wagner learning model [3,5,6] (also known as the delta rule). In addition,\nthere are studies of the equilibria of the Rescorla-Wagner model [6].\n\nThis paper proves mathematical results about these related topics. In Section (2), we de-\nscribe two influential models, P and P C, for causal inference and how their parameters\ncan be learnt by maximum likelihood estimation from training data. Section (3) introduces\nthe generalized linear Rescorla-Wagner (GLRW) algorithm, characterize its fixed points\nand quantify its fluctuations. We demonstrate that a simple GLRW can estimate the M-\nL parameters for both the P and P C models provided certain genericity conditions are\nsatisfied. But the experimental conditions studied by Cheng [2] require a non-linear gener-\nalization of Rescorla-Wagner (Yuille, in preparation). Section (4) gives a way to determine\nconvergence conditions and calculate the convergence rates of GLRW algorithms. Finally\nSection (5) sketches how the results in this paper can be extended to allow for arbitrary\nnumber of causes.\n\n\f\n2 Causal Learning and Probabilistic Inference\n\nThe task is to estimate the causal effect of variables. There is an observed event E and\ntwo causes C1, C2. Observers are asked to determine the causal power of the two causes.\nThe variables are binary-valued. E = 1 means the event occurs, E = 0 means it does\nnot. Similarly for causes C1 and C2. Much of the work in this section can be generalized\nto cases where there are an arbitrary number of causes C1, C2, ..., CN , see section (5).\nThe training data {(E, C, C)\n 1 2 } is assumed to be samples from an unknown distribution\nPemp(E, C1, C2).\n\nTwo simple models, P [1] and P C [2,3], have been proposed to account for how people\nestimate causal power. There is also a more recent theory based on model selection [4].\n\nThe P and P C theories are equivalent to assuming probability distributions for how\nthe training data is generated. Then the power of the causes is given by the maximum\nlikelihood estimation of the distribution parameters 1, 2. The two theories correspond to\nprobability distributions PP (E|C1, C2, 1, 2) and PP C(E|C1, C2, 1, 2) given by:\n PP (E = 1|C1, C2, 1, 2) = 1C1 + 2C2. P model. (1)\n PP C (E = 1|C1, C2, 1, 2) = 1C1 + 2C2 - 12C1C2. P C model. (2)\n\nThe later is a noisy-or model. The event E = 1 can be caused by C1 = 1 with probability\n1, by C2 = 1 with probability 2, or caused by both. The model can be derived by setting\nPP C (E = 0|C1, C2, 1, 2) = (1 - 1C1)(1 - 2C2).\n\nWe assume that there is also a distribution on the causes P (C1, C2|) which the observer-\ns also learn from the training data. This is equivalent to maximizing (with respect to\n1, 2, )):\n\n P ({(E, C)} : , ) = P (E, C : , ) = P (E|C : )P (C : ). (3)\n \n\nBy taking logarithms, we see that estimating 1, 2 and are independent. So we will\nconcentrate on estimating the 1, 2.\n\nIf the training data {E, C} is consistent with the model i.e. there exist parameters\n1, 2 such Pemp(E|C1, C2) = P (E|C1, C2, 1, ) then we can calculate the solution\n 2\ndirectly.\n\nFor the P model, we have:\n\n 1 = Pemp(E = 1|C1 = 1, C2 = 0) = Pemp(E = 1|C1 = 1),\n 2 = Pemp(E = 1|C1 = 0, C2 = 1) = Pemp(E = 1|C2 = 1). (4)\n\nFor the PP C model, we obtain Cheng's measures of causality [2,3].\n\n Pemp(E = 1|C1 = 1, C2) - Pemp(E = 1|C1 = 0, C2)\n 1 = 1 - Pemp(E = 1|C1 = 0, C2)}\n Pemp(E = 1|C1, C2 = 1) - Pemp(E = 1|C1, C2 = 0)\n 2 = . (5)\n 1 - Pemp(E = 1|C1, C2 = 0)}\n\n3 Generalized Linear Rescorla-Wagner\n\nThe Rescorla-Wagner model [7] is an alternative way to account for human learning. This\niterative algorithm specifies an update rule for weights. These weights could measure the\nstrength of a cause, such as the parameters of the Maximum Likelihood estimation. Follow-\ning recent work [3,6], we seek to find relationships between generalized linear Rescorla-\nWagner (GLRW) and ML estimation.\n\n\f\n3.1 GLRW and two special cases\n\n\nThe Rescorla-Wagner algorithm updates weights {V } using training data {E, C}. It is\nof form:\n V t+1 = V t + V t. (6)\n\nIn this paper, we are particularly concerned with two special cases for choice of the update\nV .\n\n V1 = 1C1(E - C1V1 - C2V2), V2 = 2C2(E - C1V1 - C2V2), basic (7)\n V1 = 1C1(1 - C2)(E - V1), V2 = 2C2(1 - C1)(E - V2), variant. (8)\n\nThe first (7) is the basic RW algorithm. The second (8) is a variant of RW with a natural\ninterpretation a weight V1 is updated only if one cause is present, C1 = 1, and the other\ncause is absent, C2 = 0.\n\nThe most general GLRW is of form:\n\n N\n\n V t = V tf\n i j ij (Et, C t) + gi(Et, C t), i, (9)\n j=1\n\nwhere {fij(., .) : i, j = 1, ..., N } and {gi(.) : i = 1, ..., N } are functions of the data\nsamples E, C.\n\n\n3.2 GLRW and Stochastic Samples\n\n\nOur analysis assumes that the data samples {E, C)} are independent identical (i.i.d.)\nsamples from an unknown distribution Pemp(E|C)P (C).\n\nIn this case, the GLRW becomes stochastic. It defines a distribution on weights which is\nupdated as follows:\n\n N\n\n P (V t+1|V t) = dEt dCt (V t+1 )P (Et, Ct). (10)\n i - V ti - V ti\n i=1\n\n\nThis defines a Markov Chain. If certain conditions are satisfied (see section (4), the chain\nwill converge to a fixed distribution P (V ). This distribution can be characterized by its\nexpected mean < V >= V P (V ) and its expected covariance = (V - <\n V V\nV >)(V - < V >)T P (V ). In other words, even after convergence the weights will\nfluctuate about the expected mean < V > and the magnitude of the fluctuations will be\ngiven by the expected covariance.\n\n\n3.3 What Does GLRW Converge to?\n\n\nWe now compute the means and covariance of the fixed point distribution P (V ). We first\ndo this for the GLRW, equation (9), and then we restrict ourselves to the two special cases,\nequations (7,8).\n\nTheorem 1. The means V and the covariance of the fixed point distribution P (V ),\nusing the GLRW equation (9) and any empirical distribution Pemp(E, C) are given by the\nsolutions to the linear equations,\n\n N\n\n V f g\n j ij (E, C )Pemp(E, C ) + i(E, C )Pemp(E, C ) = 0, i, (11)\n j=1 E,C E,C\n\n\f\nand i, j:\n\n = A\n ik jl ij (E, C )Akl(E, C )Pemp(E, C )\n\n jl E,C\n\n + Bi(E, C)Bk(E, C)Pemp(E, C), (12)\n\n E,C\n\n\n\nwhere Aij(E, C) = ij + fij(E, C) and Bi(E, C) = V f\n j j ij (E, C ) + gi(E, C ) (here\nij is the Kronecker delta function defined by ij = 1, if i = j and = 0 otherwise).\nThe means have a unique solution provided P\n E,C emp(E, C )fij (E, C ) is an invertible\nmatrix.\n\nProof. We derive the formula for the means V by taking the expectation of the update rule,\nsee equations (9), with respect to P (V ) and Pemp(E, C). To calculate the covariances,\nwe express the update rule as:\n\n V t+1 = (V t )A\n i - V \n i j - V \n j ij (E, C ) + Bi(E, C ), i (13)\n j\n\n\nwith Aij(E, C) and Bi(E, C) defined as above. Then we multiply both sides of equa-\ntion (13) by their transpose (e.g. the left hand side by (V t+1 - V )) and taking the expec-\n k k\ntation with respect to P (V ) and Pemp(E, C) (making use of the result that the expected\nvalue of V t is zero as t\n j - V \n j .\n\nWe can apply these results to study the behaviour of the two special cases, equations (7,8),\nwhen the data is generated by either the P or P C model.\n\nFirst consider the basic RW algorithm (7) when the data is generated by the PP model.\nWe can use Theorem 1 to rederive the result that < V >= [3,6], and so basic RW\nperforms ML estimation for the PP model. It also follows directly. that if the data is\ngenerated by the PP C model, then < V >= (although they are related by a nonlinear\nequation).\n\nNow consider the variant RW, equation (8).\n\nTheorem 2. The expected means of the fixed points of the variant RW equation (8) when\nthe data is generated by probability model PP C (E|C, ) or PP (E|C; ) are given by:\n\n V = = \n 1 1, V \n 2 2, (14)\n\nprovided Pemp(C) satisfies genericity conditions so that < C1(1-C2) >< C2(1-C1) >=\n0.\n\nThe expected covariances are given by:\n 1 2\n 11 = 1(1 - 1) , 22 = 2(1 - 2) , 12 = 21 = 0. (15)\n 2 - 1 2 - 2\n. Proof. This is a direct calculation of quantities specified in Theorem 1. For example, we\ncalculate the expected value of V1 and V2 first with respect to P (E|C) and then with\nrespect to P (V ). This gives:\n\n < V1 > = ),\n P (E|C)P (V ) 1C1(1 - C2)(1 - V \n 1\n\n < V2 > = ), (16)\n P (E|C)P (V ) 2C2(1 - C1)(2 - V \n 2\n\n\nwhere we have used P (V )V = V , EP\n V E P C (E|C ) = 1C1 +2C2 -12C1C2,\nand logical relations to simply the terms (e.g. C2 = C\n 1 1, C1(1 - C1) = 0).\n\n\f\nTaking the expectation of < V1 > with respect to P (C) gives,\n P (E|C)P (V )\n\n 11 < C1(1 - C2) >P (C) -1V < C\n 1 1(1 - C2) >= 0,\n\n 22 < C2(1 - C1) >P (C) -2V < C\n 2 2(1 - C1) >= 0, (17)\n\nand the result follows directly, except for non-generic cases where < C1(1 - C2) >= 0 or\n< C2(1 - C1) >= 0. These degenerate cases are analyzed separately.\n\nIt is perhaps surprising that the same GLRW algorithm can perform ML estimation when\nthe data is generated by either model PP or PP C (and this can be generalized, see sec-\ntion (5)). Moreover, the expected covariance is the same for both models. Observe that the\ncovariance decreases if we make the update coefficients 1, 2 of the algorithm small. The\nconvergence rates are given in the next section.\n\nThe non-generic cases include the situation studied in [2] where C1 is a background cause\nthat it assumed to be always present, so < C1 >= 1. In this case V = \n 1 1, but V \n 2 is\nunspecified. It can be shown (Yuille, in preparation) that a nonlinear generalization of RW\ncan perform ML on this problem (but it is eay to check that no GLRW can). But an even\nmore ambiguous case occurs when 1 = 1 (i.e. cause C1 always causes event E), then\nthere is no way to estimate 2 and Cheng's measure of causality, equation (5), becomes\nundefined.\n\n\n4 Convergence of Rescorla-Wagner\n\nWe now analyze the convergence of the GLRW algorithm. We obtain conditions for the\nalgorithm to converge and give the convergence rates. For simplicity, the results will be\nillustrated only on the simple models.\n\nOur results are based on the following theorem for the convergence of the state vector\nof a stochastic iterative equation. The theorem gives necessary and sufficient conditions\nfor convergence, shows what the expected state vector converges to, and gives the rate of\nconvergence.\n\nTheorem 3. Let zt+1 = Atzt be an iterative update equation, where z is a state vector and\nthe update matrices At are i.i.d. samples from P (A). The convergence properties as t \n depends on < A >= AP (A). If < A > has a unit eigenvalue with eigenvector\n A\nz and the next largest eigenvalue has modulus < 1, then limt < zt > z and the\nrate of convergence is et log . If the moduli of the eigenvalues of < A > are all less than\n1, then limt < zt >= 0. If < A > has an eigenvalue with modulus greater than 1,\nthen < zt > diverges as t .\n\nProof. This is a standard result. To obtain it, write zt+1 = AtAt-1....A1z1, where z1 is the\ninitial condition. Now take the expectation of zt+1 with respect to the samples {(at, bt)}.\nBy the i.i.d. assumption, this gives < zt+1 >=< A >t z1. The result follows by linear\nalgebra. Let the eigenvectors and eigenvalues of < A > be {(i, ei)}. Express the initial\nconditions as z1 = iei where the {i} are coefficients. Then < zt >= \n i itei, and\nthe result follows.\n\nWe use Theorem 3 to obtain convergence results for the GLRW algorithm. To ensure con-\nvergence, we need both the expected covariance and the expected means to converge. Then\nMarkov's lemma can be used to bound the fluctuations. (If we just require the expected\nmeans to converge, then the fluctuations of the weights may be infinitely large). This can\nbe done by a suitable choice of the state vector z.\n\nFor simplicity of algebra, we demonstrate this for a GLRW algorithm with a single weight.\nThe update rule is Vt+1 = atVt + bt where at, bt are random samples. We define the state\nvector to be z = (V 2, V\n t t, 1).\n\n\f\nTheorem 4. Consider the stochastic update rule Vt+1 = atVt + bt where at and bt are\nsamples from distributions Pa(a) and Pb(b). Define 1 = a2P (a), aP (a),\n a 2 = a\n1 = b2P (b), bP (b), and = 2 abP (a, b). The algorithm converges\n b 2 = b a,b\n\nif, and only if, 1 < 1, 2 < 1. If so, then limt < Vt >=< V >= 2 , lim\n 1- t <\n 2\n(Vt- < V >)2 >= 1(1-2)+2 - 2\n 2 . The convergence rate is {max{\n (1- 1, |2|}t.\n 1 )(1-2) (1-2)2\n\nProof. Define zt = (V 2, V\n t t, 1) and express the update rule in matrix form:\n V 2 a2 2a V 2\n t+1 t tbt b2t t \n V = 0 a V\n t+1 t bt t\n 1 0 0 1 1 \nThis is of the form analyzed in Theorem 3 provided we set:\n a2 2a \n t tbt b2t 1 1\n A = 0 a and < A >= 0 ,\n t bt 2 2\n 0 0 1 0 0 1\n\nwhere 1 = a2P (a), aP (a), b2P (b), bP (b), and\n a 2 = a 1 = b 2 = b\n = 2 abP (a, b).\n a,b\n\nThe eigenvalues {} and eigenvectors {e} of < A > are:\n 1(1 - 2) + 2 2\n 1 = 1, e1 ( , , 1)\n (1 - 1)(1 - 2) 1 - 2\n \n 2 = 1, e2 = (1, 0, 0), 3 = 2, e3 ( , 1, 0). (18)\n 2 - 1\nThe result follows from Theorem 3.\n\nObserve that if |2| < 1 but 1 > 1, then < Vt > will converge but the expected variance\ndoes not. The fluctuations in the GLRW algorithm will be infinitely large.\n\nWe can extend Theorem 4 to the variant of RW equation (8). Let P = Pemp, then\n\n 12 = P (E|C)P (C)C1(1 - C2), 21 = P (E|C)P (C)C2(1 - C1),\n\n E,C E,C\n\n 12 = P (E|C)P (C)EC1(1 - C2), 21 = P (E|C)P (C)EC2(1 - C1). (19)\n\n E,C E,C\n\n\nIf the data is generated by PP or PP C , then 12, 21, 12, 21 take the same values:\n 12 =< C1(1 - C2) >, 21 =< (1 - C1)C2 >,\n 12 = 1 < C1(1 - C2) >, 21 = 2 < (1 - C1)C2 > . (20)\nTheorem 5. The algorithm specified by equation (8) converges provided =\nmax{|2|, |3|, |4|, |5|} < 1, where 2 = 1 - (21 - 2) )\n 1 12, 3 = 1 - (22 - 2\n 2 21,\n4 = 1 - 112 5 = 1 - 221. The convergence rate is et log . The expected means\nand covariances can be calculated from the first eigenvector.\n\nProof. We define the state vector z = (V 2, V 2, V\n 1 2 1, V2, 1) and derive the update matrix\nA from equation (8). The eigenvectors and eigenvalues can be calculated (calculations\nomitted due to space constraints). The eigenvalues are 1, 1, 2, 3, 4. The convergence\nconditions and rates follow from Theorem 3. The expected means and covariances can be\ncalculated from the first eigenvector, which is:\n\n 2(1 - 2)2 212 2(2 - 2)2 221 12 21\ne 1 12 1 2 21 2\n 1 = ( + , + , , , 1),\n (21 - 2)2 (2 ) (2 )2 (2 ) \n 1 12 1 - 2\n 1 12 2 - 2\n 2 21 2 - 2\n 2 21 12 21\n (21)\nand they agree with the calculations given in Theorem 2.\n\n\f\n5 Generalization\n\nThe results of the previous sections can be generalized to cases where there are more than\ntwo causes. For example, we can use the generalization of the P C model to include multi-\nple generative causes C and preventative causes L, [5] extending [2].\n\nThe probability distribution for this generalized P C model is:\n\n n m\n\n PP C (E = 1|C, L; , ) = {1 - (1 - iCi)} (1 - jLj), (22)\n i=0 j=1\n\n\nwhere there are n + 1 generative causes {Ci} and m preventative causes {Lj} specified in\nterms of parameters {i} and {j} (constrained to lie between 0 and 1).\n\nWe assume that there is a single background cause C0 which is always on (i.e. C0 = 1)\nand whose strength 0 is known (for relaxing this constraint, see Yuille in preparation).\n\nThen it can be shown that the following GLRW algorithm will converge to the ML estimates\nof the remaining parameters {1 : 1 = 1, ..., n} and {j : j = 1, ..., m} of the generalized\nP C model:\n\n m n\n\n V t = C (1 (1 ),\n k k { - Li) - Cj)}(E - 0 - (1 - 0)V tk\n i=1 j=1:j=k\n\n m n\n\n U t = L (1 (1 ),\n l l{ - Lk) - Cj)}(E - 0 - 0Utl (23)\n k=1:k=l j=1\n\n\nwhere {Vk : k = 1, ..., n} and {Ul : l = 1, ..., m} are weights.\n\nThe proof is straightforward algebra and is based on the following identity for binary vari-\nables: (1 - (1 - L (1 - L\n j j Lj ) j j ) = j j ).\n\nThe GLRW algorithm (23) will also perform ML estimation for data generated by other\nprobability distributions which share the same linear terms as the generalized P C model\n(i.e. the terms linear in the {i} and {j}.) The convergence conditions and the conver-\ngence rates can be calculated using the techniques in section (4).\n\nThese results all assume genericity conditions so that none of the generative or preventative\ncauses is either always on or always off (i.e. ruling out case like [2]).\n\n\n6 Conclusion\n\nThis paper introduced and studied generalized linear Rescorla-Wagner (GLRW) algorithm-\ns. We showed that two influential theories, P and P C, for estimating causal effects can\nbe implemented by the same GLRW, see (8). We obtained convergence results for GLR-\nW including classifying the fixed points, calculating the asymptotic fluctuations, and the\nconvergence rates. Our results assume that the GLRW are i.i.d. samples from an unknown\nempirical distribution Pemp(E, C). Observe that the fluctuations of GLRW can be removed\nby introducing damping coefficients which decrease over time. Stochastic approximation\ntheory [8] can then be used to give conditions for convergence.\n\nMore recent work (Yuille in preparation) clarifies the class of maximum likelihood infer-\nence problems that can be \"solved\" by GLRW and by non-linear GLRW. In particular, we\nshow that a non-linear RW can perform ML estimation for the non-generic case studied by\nCheng. We also investigate similarities to Kalman filter models [9].\n\n\f\nAcknowledgements\n\nI thank Patricia Cheng, Peter Dayan and Yingnian Wu for helpfell discussions. Anonymous\nreferees gave useful feedback that has motivated a follow-up paper. This work was partially\nsupported by an NSF SLC catalyst grant \"Perceptual Learning and Brain Plasticity\" NSF\nSBE-0350356.\n\n\nReferences\n\n[1]. B. A. Spellman. \"Conditioning Causality\". In D.R. Shanks, K.J. Holyoak, and D.L.\nMedin, (eds). Causal Learning: The Psychology of Learning and Motivation, Vol. 34.\nSan Diego, California. Academic Press. pp 167-206. 1996.\n\n[2]. P. Cheng. \"From Covariance to Causation: A Causal Power Theory\". Psychological\nReview, 104, pp 367-405. 1997.\n\n[3]. M. Buehner and P. Cheng. \"Causal Induction: The power PC theory versus the\nRescorla-Wagner theory\". In Proceedings of the 19th Annual Conference of the Cogni-\ntive Science Society\". 1997.\n\n[4]. J.B. Tenenbaum and T.L. Griffiths. \"Structure Learning in Human Causal Induction\".\nAdvances in Neural Information Processing Systems 12. MIT Press. 2001.\n\n[5]. D. Danks, T.L. Griffiths, J.B. Tenenbaum. \"Dynamical Causal Learning\". Advances in\nNeural Information Processing Systems 14. 2003.\n\n[6]. D. Danks. \"Equilibria of the Rescorla-Wagner Model\". Journal of Mathematical\nPsychology. Vol. 47, pp 109-121. 2003.\n\n[7]. R.A. Rescorla and A.R. Wagner. \"A Theory of Pavlovian Conditioning: Variations\nin the Effectiveness of Reinforcement and Nonreinforcement\". In A.H. Black andW.F.\nProkasy, eds. Classical Conditioning II: Current Research and Theory. New York.\nAppleton-Century-Crofts, pp 64-99. 1972.\n\n[8]. H.J. Kushner and D.S. Clark. Stochastic Approximation for Constrained and Un-\nconstrained Systems. New York. Springer-Verlag. 1978.\n\n[9]. P. Dayan and S. Kakade. \"Explaining away in weight space\". In Advances in Neural\nInformation Processing Systems 13. 2001.\n\n\f\n", "award": [], "sourceid": 2574, "authors": [{"given_name": "Alan", "family_name": "Yuille", "institution": null}]}