{"title": "Structure Learning in Human Causal Induction", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 65, "abstract": null, "full_text": "Structure learning in human causal induction \n\nJoshua B. Tenenbaum & Thomas L. Griffiths \n\nDepartment of Psychology \n\nStanford University, Stanford, CA 94305 \n\n{jbt,gruffydd}@psych.stanford .edu \n\nAbstract \n\nWe use graphical models to explore the question of how people learn sim(cid:173)\nple causal relationships from data. The two leading psychological theo(cid:173)\nries can both be seen as estimating the parameters of a fixed graph. We \nargue that a complete account of causal induction should also consider \nhow people learn the underlying causal graph structure, and we propose \nto model this inductive process as a Bayesian inference. Our argument is \nsupported through the discussion of three data sets. \n\n1 Introduction \n\nCausality plays a central role in human mental life. Our behavior depends upon our under(cid:173)\nstanding of the causal structure of our environment, and we are remarkably good at infer(cid:173)\nring causation from mere observation. Constructing formal models of causal induction is \ncurrently a major focus of attention in computer science [7], psychology [3,6], and philos(cid:173)\nophy [5]. This paper attempts to connect these literatures, by framing the debate between \ntwo major psychological theories in the computational language of graphical models. We \nshow that existing theories equate human causal induction with maximum likelihood pa(cid:173)\nrameter estimation on a fixed graphical structure, and we argue that to fully account for hu(cid:173)\nman behavioral data, we must also postulate that people make Bayesian inferences about \nthe underlying causal graph structure itself. \n\nPsychological models of causal induction address the question of how people learn asso(cid:173)\nciations between causes and effects, such as P ( C ----+ E), the probability that some event C \ncauses outcome E. This question might seem trivial at first; why isn't P( C----+E) simply \nP( e+ Ic+), the conditional probability that E occurs (E = e+ as opposed to e-) given that \nC occurs? But consider the following scenarios. Three case studies have been done to eval(cid:173)\nuate the probability that certain chemicals, when injected into rats, cause certain genes to \nbe expressed. In case 1, levels of gene 1 were measured in 100 rats injected with chem(cid:173)\nical 1, as well as in 100 uninjected rats; cases 2 and 3 were conducted likewise but with \ndifferent chemicals and genes. In case 1, 40 out of 100 injected rats were found to have \nexpressed the gene, while 0 out of 100 uninjected rats expressed the gene. We will denote \nthese results as {40/100, O/IOO}. Case 2 produced the results {7/100, O/IOO}, while case \n3 yielded {53/100, 46/100}. For each case, we would like to know the probability that the \nchemical causes the gene to be expressed, P ( C ----+ E), where C denotes the chemical and E \ndenotes gene expression. \n\n\fPeople typically rate P( C----+E) highest for case 1, followed by case 2 and then case 3. In an \nexperiment described below, these cases received mean ratings (on aO-20 scale) of14.9\u00b1.8, \n8.6 \u00b1 .9, and 4.9 \u00b1 .7, respectively. Clearly P(C----+E) # P(e+lc+), because case 3 has the \nhighest value of P(e+ Ic+) but receives the lowest rating for P(C----+E). \n\nThe two leading psychological models of causal induction elaborate upon this basis \nin attempting to specify P(C----+E). The f:j.P model [6] claims that people estimate \nP(C----+E) according to \n\n(We restrict our attention here to facilitatory causes, in which case f:j.P is always between \u00b0 \n\nand 1.) Equation 1 captures the intuition that C is perceived to cause E to the extent that C's \noccurence increases the likelihood of observing E. Recently, Cheng [3] has identified sev(cid:173)\neral shortcomings of f:j.P and proposed that P( C----+E)instead corresponds to causal power, \nthe probability that C produces E in the absence of all other causes. Formally, the power \nmodel can be expressed as: \n\n(1) \n\nf:j.P \n\npower = 1 _ P(e+ lc)' \n\n(2) \n\nThere are a variety of normative arguments in favor of either of these models [3,7]. Em(cid:173)\npirically, however, neither model is fully adequate to explain human causal induction. We \nwill present ample evidence for this claim below, but for now, the basic problem can be il(cid:173)\nlustrated with the three scenarios above. While people rate P(C----+E) higher for case 2, \n{71100,OllOO}, than for case 3, {531100,4611O0}, f:j.p rates them equally and the power \nmodel ranks case 3 over case 2. To understand this discrepancy, we have to distinguish be(cid:173)\ntween two possible senses of P( C----+E): \"the probability thatC causes E (on any given trial \nwhen C is present)\" versus \"the probability that C is a cause of E (in general, as opposed \nto being causally independent of E)\". Our claim is that the f:j.P and power models concern \nonly the former sense, while people's intuitions about P( C ----+ E) are often concerned with \nthe latter. In our example, while the effect of C on any given trial in case 3 may be equal \nto (according to f:j.P) or stronger than (according to power) its effect in case 2, the general \npattern of results seems more likely in case 2 than in case 3 to be due to a genuine causal \ninfluence, as opposed to a spurious correlation between random samples of two independent \nvariables. In the following section, we formalize this distinction in terms of parameter esti(cid:173)\nmation versus structure learning on a graphical model. Section 3 then compares two variants \nof our structure learning model with the parameter estimation models (f:j.P and power) in \nlight of data from three experiments on human causal induction. \n\n2 Graphical models of causal induction \n\nThe language of causal graphical models provides a useful framework for thinking about \npeople's causal intuitions [5,7]. All the induction models we consider here can be viewed \nas computations on a simple directed graph (Graph l in Figure 1). The effect node E is the \nchild of two binary-valued parent nodes: C, the putative cause, and B, a constant back(cid:173)\nground. Let X = (Cl , El)\"'\" (CN, EN) denote a sequence of N trials in which C and \nE are each observed to be present or absent; B is assumed to be present on all trials. (To \n\nkeep notation concise in this section, we use 1 or \u00b0 in addition to + or - to denote presence \n\nor absence of an event, e.g. Ci = 1 if the cause is present on the ith trial.) Each parent node \nis associated with a parameter, W B or We, that defines the strength of its effect on E. In the \n\n\f!:l.P model, the probability of E occuring is a linear function of C: \n\nQ(e+ le; WB , w e ) = WB + w e' e. \n\n(3) \n\n(We use Q to denote model probabilities and P for empirical probabilities in the sample X .) \nIn the causal power model, as first shown by Glymour [5], E is a noisy-OR gate: \n\n(4) \n\n2.1 Parameter inferences: !:l.P and Causal Power \n\nIn this framework, both the!:l.P and power model's predictions for P (C - E ) can be seen \nas maximum likelihood estimates of the causal strength parameter We in Graph1' but under \ndifferent parameterizations. For either model, the loglikelihood of the data is given by \n\nN \n2:)og [(Q(eile; ))\"' (I- Q(eil e;) )1- e. ] \n\nN \nL ei log Q(ei lei ) + (1 - ei ) log(1 - Q( e; lei )) , \n\n; =: 1 \n\n(5) \n\n(6) \n\nwhere we have suppressed the dependence of Q(ei Ie;) on WB , We . Breaking this sum into \nfour parts, one for each possible combination of {e+, e-} and {e+ , e-} that could be ob(cid:173)\nserved, \u00a3( XI WB , w e) can be written as \n\nN P( e+) [P(e+ le+) log Q( e+ le+) + (1 - P(e+ le+)) log(1 - Q(e+ le+))] \n+ N P(e-) [P(e+le-)logQ(e+le- ) + (1- P(e+ le-))log(l- Q (e+ le-))] \n\n(7) \n\nBy the Information inequality [4], Equation 7 is maximized whenever WB and We can be \nchosen to make the model probabilities equal to the empirical probabilites: \n\nQ(e+ le+; WB , we ) \nQ (e+ le-; WB, w e ) \n\nP(e+le+) , \nP(e+Ic-)\u00b7 \n\n(8) \n(9) \n\nTo show that the !:l.P model's predictions for P( C - E ) correspond to maximum likelihood \nestimates of We under a linear parameterization of Graph1 ' we identify We in Equation 3 \nwith!:l.P (Equation 1), and WB with P ( e+ le-) . Equation 3 then reduces to P(e+ le+) for \nthe case e = e+ (i.e., e = 1) and to P( e+ le-) for the case e = e- (i.e., e = 0), thus satis(cid:173)\nfying the sufficient conditions in Equations 8-9 for W B and We to be maximum likelihood \nestimates. To show that the causal power model's predictions for P( C - E ) correspond to \nmaximum likelihood estimates of We under a noisy-OR parameterization, we follow the \nanalogous procedure: identify We in Equation 4 with power (Equation 2), and W B with \nP ( e+ le-). Then Equation 4 reduces to P( e+ le+) for e = e+ and to P( e+ le-) for e = e- , \nagain satisfying the conditions for WB and We to be maximum likelihood estimates. \n\n2.2 Structural inferences: Causal Support and x2 \n\nThe central claim of this paper is that people's judgments of P( C - E ) reflect something \nother than estimates of causal strength parameters - the quantities that we have just shown \nto be computed by !:l.P and the power model. Rather, people's judgments may correspond \nto inferences about the underlying causal structure, such as the probability that C is a direct \n\n\fcause of E. In terms of the graphical model in Figure 1, human causal induction may be \nfocused on trying to distinguish between Graph), in which C is a parent of E, and the \"null \nhypothesis\" of Grapho, in which C is not. \n\nThis structural inference can be formalized as a Bayesian decision. Let he be a binary vari(cid:173)\nable indicating whether or not the link C -+ E exists in the true causal model responsible for \ngenerating our observations. We will assume a noisy-OR gate, and thus our model is closely \nrelated to causal power. However, we propose to model human estimates of P(C -+E ) as \ncausal support, the log posterior odds in favor of Graphl (h e = 1) over Grapho (h e = 0) : \n\nsupport = log \n\n(h \n\nP(h e = l 1X ) \n) . \nP \n\ne =O X \n\nI\n\n(10) \n\nVia Bayes' rule, we can express P( he = l 1X) in terms of the marginal likelihood or evi(cid:173)\nden ce, P(X lhe = 1), and the prior probability that C is a cause of E , P(h e = 1): \n\nP(h e = l 1X) ex P( Xl he = I)P(he = 1). \n\n(11) \n\nFor now, we take P(h e = 1) = P(h e = 0) = 1/2. Computing the evidence requires \nintegrating the likelihood P( X I W B , we ) over all possible values of the strength parameters: \n\nWe take p( W B , We I he = 1) to be a uniform density, and we note that p( X I W B, we ) is sim(cid:173)\nply the exponential of \u00a3( XI W8, we ) as defined in Equation 5. P( Xl he = 0), the marginal \nlikelihood for Grapho, is computed similarly, but with the prior P(WB , wei he = 1) in \nEquation 12 replaced by p( WB Ihe = 0)8( we ). We again take p( WB Ihe = 0) to be uni(cid:173)\nform. The Dirac delta distribution on We = 0 enforces the restriction that the C -+ E link \nis absent. By making these assumptions, we eliminate the need for any free numerical pa(cid:173)\nrameters in our probabilistic model (in contrast to a similar Bayesian account proposed by \nAnderson [1]). \n\nBecause causal support depends on the full likelihood functions for both Graph) and \nGrapho, we may expect the support model to be modulated by causal power - which is \nbased strictly on the likelihood maximum estimate for Graphl - but only in interaction with \nother factors that determine how much of the posterior probability mass for We in Graph) \nis bounded away from zero (where it is pinned in Grapho). In general, evaluating causal \nsupport may require fairly involved computations, but in the limit of large N and weak \ncausal strength We , it can be approximated by the familiar X2 statistic for independence, \nN L c,e (P(c'~o(~~~c,e))2 . Here Po(c , e) = P( c)P(e) is the factorized approximation to \nP(c, e), which assumes C and E to be independent (as they are in Graph o). \n\n3 Comparison with experiments \n\nIn this section we examine the strengths and weaknesses of the two parameter inference \nmodels, !::..P and causal power, and the two structural inference models, causal support and \nX2, as accounts of data from three behavioral experiments, each designed to address dif(cid:173)\nferent aspects of human causal induction. To compensate for possible nonlinearities in peo(cid:173)\nple's use of numerical rating scales on these tasks, all model predictions have been scaled by \npower-law transformations, f( x ) = sign(x) Ixl'Y, with I chosen separately for each model \n\n\fand each data set to maximize their linear correlation. In the figures, predictions are ex(cid:173)\npressed over the same range as the data, with minimum and maximum values aligned. \n\nFigure 2 presents data from a study by Buehner & Cheng [2], designed to contrast the pre(cid:173)\ndictions of flP and causal power. People judged P ( C -+ E) for hypothetical medical studies \nmuch like the gene expression scenarios described above, seeing eight cases in which C oc(cid:173)\ncurred and eight in which C did not occur. Some trends in the data are clearly captured by \nthe causal power model but not by flP, such as the monotonic decrease in P( C -+ E ) from \n{l.00 , 0.75} to {.25 , O.OO}, as flP stays constant but P( e+lc-) (and hence power) de(cid:173)\ncreases (columns 6-9). Other trends are clearly captured by flP but not by the power model, \nlike the monotonic increase in P(C -+E ) as P(e+ Ic+) stays constant at 1.0 but P(e+ Ic-) \ndecreases, from {l.00, l.00} to {l.00 , O.OO} (columns 1, 6, 10, 13, 15). However, one of \nthe most salient trends is captured by neither model: the decrease in P(C -+E ) as flP stays \nconstant at 0 but P (e+lc-) decreases (columns 1-5). The causal support model predicts \nthis decrease, as well as the other trends. The intuition behind the model's predictions for \nflP = 0 is that decreasing the base rate P(e+ Ic-) increases the opportunity to observe the \ncause's influence and thus increases the statistical force behind the inference that C does not \ncause E , given flP = O. This effect is most obvious when P( e+lc+) = P( e+lc-) = 1, \nyielding a ceiling effect with no statistical leverage [3], but also occurs to a lesser extent for \nP (e + Ie) < 1. While x2 generally approximates the support model rather well, italso fails \nto explain the cases with P( e+ Ic+) = P( e+ Ic-), which always yield X2 = O. The superior \nfit of the support model is reflected in its correlation with the data, giving R2 = 0.95 while \nthe power, flP , and x2models gave R2 values of 0.81,0.82, and 0.82 respectively. \nFigure 3 shows results from an experiment conducted by Lober and Shanks [6], designed \nto explore the trend in Buehner and Cheng's experiment that was predicted by flP but not \nby the power model. Columns 4-7 replicated the monotonic increase in P ( C -+ E ) when \nP(e+lc+) remains constant at 1.0 but P(e+lc-) decreases, this time with 28 cases in which \nC occurred and 28 in which C did not occur. Columns 1-3 show a second situation in which \nthe predictions of the power model are constant, but judgements of P ( C -+ E ) increase. \nColumns 8-10 feature three scenarios with equal flP, for which the causal power model \npredicts a decreasing trend. These effects were explored by presenting a total of 60 trials, \nrather than the 56 used in Columns 4-7. For each of these trends the flP model outperforms \nthe causal power model, with overall R2 values of 0.95 and 0.35 respectively. However, it \nis important to note that the responses of the human subjects in columns 8-10 (contingen(cid:173)\ncies {l.00, 0.50} , {0 .80 , 0.40} , {0 .40, O.OO}) are not quite consistent with the predictions \nof flP : they show a slight V-shaped non-linearity, with P( C -+E) judged to be smaller for \n0.80 , 0.40 than for either of the extreme cases. This trend is predicted by the causal support \nmodel and its X2 approximation, however, which both give the slightly better R 2 of 0.99. \n\nFigure 4 shows data that we collected in a similar survey, aiming to explore this non-linear \neffect in greater depth. 35 students in an introductory psychology class completed the sur(cid:173)\nvey for partial course credit. They each provided a judgment of P ( C -+ E ) in 14 different \nmedical scenarios, where information about P( e+ Ic+) and P(e+ Ic-) was provided in terms \nof how many mice from a sample of 100 expressed a particular gene. Columns 1-3, 5-7, and \n9-11 show contingency structures designed to elicit V-shaped trends in P ( C -+ E ). Columns \n4 and 8 give intermediate values, also consistent with the observed non-linearity. Column 14 \nattempted to explore the effects of manipulating sample size, with a contingency structure \nof {7/7, 93/193}. In each case, we observed the predicted nonlinearity : in a set of situa(cid:173)\ntions with the same flP , the situations involving less extreme probabilities show reduced \njudgments of P( C -+E ) . These non-linearities are not consistent with the flP model, but \n\n\fare predicted by both causal support and X2 . I:!.P actually achieves a correlation comparable \nto X2 (R2 = 0.92 for both models) because the non-linear effects contribute only weakly to \nthe total variance. The support model gives a slightly worse fit than X2, R2 = 0.80, while \nthe power model gives a poor account of the data, R2 = 0.38. \n\n4 Conclusions and future directions \n\nIn each of the studies above, the structural inference models based on causal support or \nX2 consistently outperformed the parameter estimation models, I:!.P and causal power. \nWhile causal power and I:!.P were each capable of capturing certain trends in the data, \ncausal support was the only model capable of predicting all the trends. For the third data \nset, X2 provided a significantly better fit to the data than did causal support. This finding \nmerits future investigation in a study designed to tease apart X2 and causal support; in any \ncase, due to the close relationship between the two models, this result does not undermine \nour claim that probabilistic structural inferences are central to human causal induction. \n\nOne unique advantage of the Bayesian causal support model is its ability to draw inferences \nfrom very few observations. We have begun a line of experiments, inspired by Gopnik, \nSobel & Glymour (submitted), to examine how adults revise their causal judgments when \ngiven only one or two observations, rather than the large samples used in the above studies. \nIn one study, subjects were faced with a machine that would inform them whether a pen(cid:173)\ncil placed upon it contained \"supedead\" or ordinary lead. Subjects were either given prior \nknowledge that superlead was rare or that it was common. They were then given two pen(cid:173)\ncils, analogous to B and C in Figure 1, and asked to rate how likely these pencils were to \nhave supedead, that is, to cause the detector to activate. Mean responses reflected the in(cid:173)\nduced prior. Next, they were shown that the superlead detector responded when Band C \nwere tested together, and their causal ratings of both Band C increased. Finally, they were \nshown that B set off the supedead detector on its own, and causal ratings of B increased to \nceiling while ratings of C returned to their prior levels. This situation is exactly analogous \nto that explored in the medical tasks described above, and people were able to perform accu(cid:173)\nrate causal inductions given only one trial of each type. Of the models we have considered, \nonly Bayesian causal support can explain this behavior, by allowing the prior in Equation \n11 to adapt depending on whether superlead is rare or common. \n\nWe also hope to look at inferences about more complex causal structures, including those \nwith hidden variables. With just a single cause, causal support and X2 are highly correlated, \nbut with more complex structures, the Bayesian computation of causal support becomes in(cid:173)\ncreasingly intractable while the X2 approximation becomes less accurate. Through exper(cid:173)\niments with more complex structures, we hope to discover where and how human causal \ninduction strikes a balance between ponderous rationality and efficient heuristic. \n\nFinally, we should stress that despite the superior performance of the structural inference \nmodels here, in many situations estimating causal strength parameters is likely to be just as \nimportant as inferring causal structure. Our hope is that by using graphical models to relate \nand extend upon existing accounts of causal induction, we have provided a framework for \nexploring the interplay between the different kinds of judgments that people make. \n\nReferences \n\n[1] J. Anderson (1990). The adaptive character of thought. Erlbaum. \n[2] M. Buehner & P. Cheng (1997) Causal induction; The power PC theory versus the Rescorla(cid:173)\nWagner theory. In Proceedings of the 19th Annual Conference of the Cognitive Science Society. \n\n\f[3] P. Cheng (1997). From covariation to causation: A causal power theory. Psychological Review \n104,367-405. \n[4] T. Cover & J .Thomas (1991). Elements of information theory. Wiley. \n[5] C. Glymour (1998). Learning causes: Psychological explanations of causal explanation. Minds \nand Machines 8, 39-60. \n[6] K. Lober & D. Shanks (2000). Is causal induction based on causal power? Critique of Cheng \n(1997). Psychological Review 107, 195-212. \n[7] J. Pearl (2000). Causality. Cambridge University Press. \n\nGraph 1 \n(he = 1) \n\nGrapho \n(he = 0) \n\n~c ~~c \n\nModel \n\nForm of P(elb,c) \n\nP(C->E) \n\nM' \n\nLinear \n\nwe \n\nPower \n\nNoisy OR-gate \n\nSupport \n\nNoisy OR-gate \n\nwe \nI \nP(he=l) \nog __ \nP(he = O) \n\n'\u00b05\u00b0:[ \n\nP(e+lc+) 100 075 050 025 000 1 aD 075 050 025100075050100 075 1 aD \nP(e+lc-) 100 075 050 025 000075050 025 000 050 025 000 025 000 000 \n\nHumans \n\n1 \n\u2022\u2022\u2022 __ \u2022\u2022\u2022\u2022 1 \u2022\u2022 11 \n1_.:: ___ \u2022\u2022\u2022\u2022\u2022\u2022\u2022 111 \nI N/ \u2022 . : : . __ I \u2022\u2022\u2022 II.III \n1 \n._ \u2022\u2022 _ \u2022\u2022\u2022\u2022\u2022\u2022\u2022 11 \nI _..:' ___ \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 111 \n\nSupport \n\nI \n\nI \n\nFigure 1: Different theories of human causal \ninduction expressed as different operations on \na simple graphical model. The M' and power \nmodels correspond to maximum likelihood \nparameter estimates on a fixed graph (Graph[), \nwhile the support model corresponds to a \n(Bayesian) inference about which graph is the \ntrue causal structure. \n\np(e+lc+) 090 080 070 100 100 100 100 100 080 040 090 \np(e+lc-) 066033 000 075 050 025 000 060 040 000 083 \n\nII \n\n[.... . ... \n\n'::or \n1 \u2022 \u2022 1.111 \u2022\u2022\u2022 :.: \nI 11111111 \u2022 .::w.:. \n\u2022\u2022 1.1 \nI \u2022\u2022\u2022 \nI \u2022 \u2022 I.III \u2022\u2022\u2022 ~ \n\nII \n\nHumans \n\nSupport \n\nI \n\nFigure 3: Computational models compared \nwith the performance of human participants \nfrom Lober and Shanks [5], Experiments \n4-6. \n\nFigure 2: Computational models compared with the \nperformance of human participants from Buehner and \nCheng [1], Experiment lB. Numbers along the top \nof the figure show stimulus contingencies. \n\nHumans \n\np(e+lc+) 040 070 1 00 090 007 053 1 00 074 002 051 1 00 0 10 1 00 1 00 \nP(e+lc-) 0 00 030 060 083 0 00 046 093 072 0 00 049 098 0 10 1 00 048 \n\n,: ............... ___ . \n. .. _._-_. .. \n\n20 f I \n~p I \nIII........ _ \n[ 1111 \u2022\u2022 I \u2022\u2022\u2022 I,:w::.1 \nII I \nIII........ \u2022 \n\nSupport \n\ni \n\nI\n\nFigure 4: Computational models compared with the \nperformance of human participants on a set of stimuli \ndesigned to elicit the non-monotonic trends shown in \nthe data of Lober and Shanks [5]. \n\n\f", "award": [], "sourceid": 1845, "authors": [{"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}