{"title": "Characterization and Learning of Causal Graphs with Latent Variables from Soft Interventions", "book": "Advances in Neural Information Processing Systems", "page_first": 14369, "page_last": 14379, "abstract": "The challenge of learning the causal structure underlying a certain phenomenon is undertaken by connecting the set of conditional independences (CIs) readable from the observational data, on the one side, with the set of corresponding constraints implied over the graphical structure, on the other, which are tied through a graphical criterion known as d-separation (Pearl, 1988). In this paper, we investigate the more general scenario where multiple observational and experimental distributions are available. We start with the simple observation that the invariances given by CIs/d-separation are just one special type of a broader set of constraints, which follow from the careful comparison of the different distributions available. Remarkably, these new constraints are intrinsically connected with do-calculus (Pearl, 1995) in the context of soft-interventions. We introduce a novel notion of interventional equivalence class of causal graphs with latent variables based on these invariances, which associates each graphical structure with a set of interventional distributions that respect the do-calculus rules. Given a collection of distributions, two causal graphs are called interventionally equivalent if they are associated with the same family of interventional distributions, where the elements of the family are indistinguishable using the invariances obtained from a direct application of the calculus rules. We introduce a graphical representation that can be used to determine if two causal graphs are interventionally equivalent. We provide a formal graphical characterization of this equivalence. Finally, we extend the FCI algorithm, which was originally designed to operate based on CIs, to combine observational and interventional datasets, including new orientation rules particular to this setting.", "full_text": "Characterization and Learning of Causal Graphs\n\nwith Latent Variables from Soft Interventions\n\nMurat Kocaoglu\u2217\n\nMIT-IBM Watson AI Lab\nIBM Research MA, USA\n\nmurat@ibm.com\n\nKarthikeyan Shanmugam\u2217\nMIT-IBM Watson AI Lab\nIBM Research NY, USA\n\nAmin Jaber\u2217\n\nDepartment of Computer Science\n\nPurdue University, USA\njaber0@purdue.edu\n\nElias Bareinboim\n\nDepartment of Computer Science\n\nColumbia University, USA\n\nkarthikeyan.shanmugam2@ibm.com\n\neb@cs.columbia.edu\n\nAbstract\n\nThe challenge of learning the causal structure underlying a certain phenomenon is\nundertaken by connecting the set of conditional independences (CIs) readable from\nthe observational data, on the one side, with the set of corresponding constraints\nimplied over the graphical structure, on the other, which are tied through a graphical\ncriterion known as d-separation (Pearl, 1988). In this paper, we investigate the more\ngeneral setting where multiple observational and experimental distributions are\navailable. We start with the simple observation that the invariances given by CIs/d-\nseparation are just one special type of a broader set of constraints, which follow\nfrom the careful comparison of the di\ufb00erent distributions available. Remarkably,\nthese new constraints are intrinsically connected with do-calculus (Pearl, 1995) in\nthe context of soft-interventions. We then introduce a novel notion of interventional\nequivalence class of causal graphs with latent variables based on these invariances,\nwhich associates each graphical structure with a set of interventional distributions\nthat respect the do-calculus rules. Given a collection of distributions, two causal\ngraphs are called interventionally equivalent if they are associated with the same\nfamily of interventional distributions, where the elements of the family are indistin-\nguishable using the invariances obtained from a direct application of the calculus\nrules. We introduce a graphical representation that can be used to determine if\ntwo causal graphs are interventionally equivalent. We provide a formal graphical\ncharacterization of this equivalence. Finally, we extend the FCI algorithm, which\nwas originally designed to operate based on CIs, to combine observational and\ninterventional datasets, including new orientation rules particular to this setting.\n\n1\n\nIntroduction\n\nExplaining a complex system through their cause and e\ufb00ect relations is one of the fundamental chal-\nlenges in science. Data is collected and experiments are performed with the intent of understanding\nhow a certain phenomenon comes about, or how the underlying system works, which could be social,\nbiological, arti\ufb01cial, among others. The study of causal relations can be seen through the lens of\nlearning and inference [16, 21]. The learning component is concerned with discovering the causal\nstructure, which is the very subject of interest in many domains, since they can provide insight about\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFx\n\nFx\n\nFx,z\n\nFz\n\nY\n\nX\nZ\n(a) Causal Diagram D.\n\nX\n(b) Aug(D) for P and Px.\n\nY\n\nZ\n\nX\n\nY\n\n(c) Aug(D) for P, Px, and Px,z.\n\nZ\n\nFigure 1: (a) Causal graph where the bidirected edge represents a latent confounder. (b) Given Px, P,\nwe can use Fx to capture information such as \u201cthere is a backdoor path from X to Y\u201d in terms of\nm-separation Fx (cid:54)\u22a5\u22a5 Y |X . (c) Given P, Px, Px,z, under controlled experiment assumption, we can add\nFz although Pz is not available. This allows us to discover that Z is a cause of Y and there is no\nconfounder between them. Without adding Fz this relation cannot be identi\ufb01ed.\n\nhow a complex system works and lead to better understanding about the phenomenon under investi-\ngation. The latter, inference, attempts to leverage the causal structure to compute quantitative claims\nabout the e\ufb00ect of interventions and retrospective counterfactuals, which are critical to assign credit,\nunderstand blame and responsibility, and perform judgement about fairness in decision-making.\nOne of the most popular languages used to encode the invariances needed to reason about causal\nrelations, for both learning and inference, is based on graphical models, and appears under the rubric\nof causal graphs [16, 21, 2]. A causal graph is a directed acyclic graph (DAG) with latent variables,\nwhere each edge encodes a causal relationship between its endpoints: X is a direct cause of Y, i.e.,\nX \u2192 Y, if, when the remaining factors are held constant, forcing X to take a speci\ufb01c value a\ufb00ects the\nrealization of Y, where X, Y are random variables representing some relevant features of the system.\nThe task of learning the causal structure entails a search over the space of causal graphs that are\ncompatible with the observed data; the collection of these graphs forms what is called an equivalence\nclass. The most popular mark imprinted on the data by the underlying causal structure that is used\nto delineate an equivalence class are conditional independence (CI) relations. These relations are\nthe most basic type of probabilistic invariances used in the \ufb01eld and have been studied at large in\nthe context of graphical models since, at least, [15] (see also [5]). While CIs are powerful and have\nbeen the driving force behind some of the most prominent structural learning algorithms in the \ufb01eld\n[16, 21], including the PC, FCI, these are constraints speci\ufb01c for one distribution.\nIn this paper, we start by noting something very simple, albeit powerful, that happens when a\ncombination of observational and experimental distributions are available: There are constraints over\nthe graphical structure that emerge by comparing these di\ufb00erent distributions, and which are not of\nCI-type2. Remarkably, and unknown until our work, the converse of the causal calculus developed by\nPearl [18] o\ufb00ers a systematic way of reading these constraints and tying them back to the underlying\ngraphical structure. In reference to their connection to the do-calculus rules (or a generalization, as\ndiscussed later), we call these constraints the do-constraints. For concreteness, consider the graph in\nFig. 1(a), where the dashed-bidirected arrow represents hidden variables that generate variations of\nthe two observed variables, X, Y in this case. Suppose the observational (conditional) distribution and\nan interventional distribution on X are available, which are written as P(y|x), P(y|do(x)), respectively.\nSuppose we contrast these two distributions and the test evaluating the expression P(y|do(x)) = P(y|x)\ncomes out as false. This is called a do-see test since the experimental (or \u201cdo\u201d) and observational\n(\u201csee\u201d) distributions are contrasted. Based on the second rule of do-calculus, one can infer that\nthere is an open backdoor path from X to Y, where the edge adjacent to X on this path has an\narrowhead into X. In our setting, we do not have access to the true graph, but we leverage this\nand the other do-constraints to reverse engineer the process and try to learn the structure. Broadly\nspeaking, do-constraints will play a critical role for learning, in the same way CI/d-separation plays\nin learning when only observational data is available. To the best of our knowledge, this type of\nconstraints appeared \ufb01rst at the very de\ufb01nition of causal Bayesian networks (CBNs) in [1] and then\nwere leveraged to design e\ufb03cient experiments to learn the causal graph in [12].\nWe assume throughout this work that interventions are soft. A soft intervention a\ufb00ects the mechanism\nthat generates the variable, while keeping the causal connections intact. Soft-interventions are widely\nemployed in biology and medicine, where it is hard to change the underlying system, but possibly\n\n2Recall that a CI represents a constraint readable from one speci\ufb01c distribution saying that the value of Z is\n\nirrelevant for computing the likelihood of Y once we know the value of X, i.e., P(Y|X, Z) = P(Y|X),\u2200X, Y, Z.\n\n2\n\n\feasier to perturb it. For our characterization, we utilize an extension of the causal calculus to soft\ninterventions introduced in [4]. Under soft-interventions, the do-see test can be written as checking if\nPx(y|x) = P(y|x), where Px is the distribution obtained after a soft intervention on X.\nThe second observation leveraged here follows from another realization by Pearl that interventions\ncan be represented explicitly in the graphical model [17]. He then introduced what we call F-nodes,\nwhich graphically encode the changes due to an intervention and the corresponding parametrization\n(see also [16, Sec. 3.2.2]). This is important in our context since the do-calculus tests will be visible\nmore explicitly in the graph. The graph obtained by adding F-nodes to the causal graph is called the\naugmented graph. The same construct was used more prominently in [6] in the context of inference\nand identi\ufb01cation. Going back to Fig. 1b, the existence of the backdoor path from X to Y, as detected\nby rule 2 of the calculus, can be captured by the statement FX is not d-separated from Y given X. In\nthe context of structure learning, similar constructions have been leveraged in the literature [13, 24].\nWe further make a speci\ufb01c assumption throughout the paper about the soft-interventions. We call it\nthe controlled experiment setting, where each variable is intervened with the same mechanism change\nacross di\ufb00erent interventions. For example, in Fig. 1c, suppose we are given distributions from\ntwo controlled experiments Px, Px,z along with observational data. We can then use Fz to capture\nthe invariances between Px,z and Px. For example, if Px,z(y) (cid:44) Px(y), for some y, we can read that\nFZ (cid:54)\u22a5\u22a5 Y\ngraph by introducing an F-node for every unique set di\ufb00erence between pairs of controlled intervention\nsets (more on that later on). Without the controlled experiment assumption, our machinery can still be\nused if one knows which mechanism changes are identical and by constructing F-nodes to re\ufb02ect and\ncapture the mechanism di\ufb00erence across two interventions. For simplicity of presentation, however,\nwe restrict ourselves to the controlled experiment setting and do not pursue this route explicitly.\nTo encapsulate the distributional invariants directly induced by the causal calculus rules3, we call a set\nof interventional distributions I-Markov to a graph, if these distributions respect the causal calculus\nrules relative to that graph. Note that the notion of I-Markov is \ufb01rst introduced in [9, 10] for causally\nsu\ufb03cient systems without the use of do-constraints4. For our characterization, we \ufb01rst extend the\ncausal calculus rules to operate between arbitrary sets of interventions. We call two causal graphs\nD1,D2 I-Markov equivalent if the set of distributions that are I-Markov to D1 and D2 are the same.\nUsing the augmented graph, we identify a graphical condition that is necessary and su\ufb03cient for two\nCBNs with latents to be I-Markov equivalent. Finally, we propose a sound algorithm for learning the\naugmented graph from interventional data. Our contributions can be summarized as follows:\n\n(cid:12)(cid:12)(cid:12)FX, FX,Z . Accordingly, given a set of interventional distributions, we construct an augmented\n\n\u2022 We propose a characterization of I-Markov equivalence between two causal graphs with\nlatent variables for a given intervention set I that is based on a generalization of do-calculus\nrules to arbitrary subsets of interventions.\n\u2022 We show a graphical characterization of I-Markov equivalence of causal graphs with latents.\n\u2022 We introduce a learning algorithm for inferring the graphical structure using a combination\nof observational and interventional data and utilizing the corresponding new constraints.\nThis procedure comes with a new set of orientation rules. We formally show its soundness.\n\n2 Background and Related Work\n\nIn this section, we introduce necessary concepts that we use throughout the paper. Upper case letters\ndenote variables and lower case letters denote an assignment. Also, bold letters denote sets.\nCausal Bayesian Network (CBN): Let P(v) be a probability distribution over a set of variables V,\nand let Px(v) denote the distribution resulting from the hard intervention do(X = x), which sets\nX \u2286 V to constants x. Let P\u2217 denote the set of all interventional distributions Px(v), for all X \u2286 V,\nincluding P(V). A directed acyclic graph (DAG) over V is said to be a causal Bayesian network\n\ncompatible with P\u2217 if and only if, for all X \u2286 V, Px(v) =(cid:81){i|Vi(cid:60)X} P(vi|pai), for all v consistent with\n\nx, and where Pai is the set of parents of Vi [16, 1, pp. 24]. If so, we refer to the DAG as causal.\n\n3There may be constraints that can be obtained by applying the rules multiple times we do not consider here.\n4In the causally su\ufb03cient case, name is in reference to both global and local Markov conditions. However,\nin our work, the name stems from our observation that the do-constraints correspond to the global Markov\nconditions in the augmented graph.\n\n3\n\n\fGiven that a subset of the variables are unmeasured or latent, D(V\u222a L, E) represents the causal graph\nwhere V and L denote the measured and latent variables, respectively, and E denotes the edges. A\ndashed bi-directed edge is used instead of \u2190 L \u2192, where L \u2208 L, whenever L is a root node with\nexactly two children. The observed distribution P(v) is obtained by marginalizing L out.\n\n(cid:88)\n\n(cid:89)\n\nL\n\n{i|Ti\u2208V\u222aL}\n\nP(v) =\n\nP(ti|pai)\n\nClearly, the joint distribution over V does not factorize relative to D in a typical fashion, since\nMarkovianity is no longer valid, but it does relative to both V and L. Still, CI relations can be read\nfrom the graph using a graphical criterion known as d-separation. Also, two causal graphs are called\nMarkov equivalent whenever they share the same set of conditional independences over V.\nSoft Interventions: Another common type of intervention is soft, where the original conditional\ndistributions of the intervened variables X are replaced with new ones, without completely eliminating\nthe causal e\ufb00ect of the parents. Accordingly, the interventional distribution Px(v) becomes as follows,\nwhere P(cid:48)(Xi|Pai) (cid:44) P(Xi|Pai) is the new conditional distribution set by the intervention:\n\n(cid:88)\n\n(cid:89)\n\nL\n\n{i|Xi\u2208X}\n\n(cid:89)\n\n{ j|T j(cid:60)X}\n\nPx(v) =\n\nP(cid:48)(xi|pai)\n\nP(t j|pa j)\n\nIn this work, we assume that all the soft interventions are controlled. This means that for any two\ninterventions I, J \u2286 V where Xi \u2208 I \u2229 J, we have PI(Xi|Pai) = PJ(Xi|Pai).\nAncestral graphs: We now introduce a graphical representation of equivalence classes of causal\ngraphs with latent nodes. A mixed graph can contain directed and bi-directed edges. A is an ancestor\nof B if there is a directed path from A to B. A is a spouse of B if A \u2194 B is present. If A is both a\nspouse and an ancestor of B, this creates an almost directed cycle. An inducing path relative to L is\na path on which every non-endpoint node X (cid:60) L is a collider on the path (i.e., both edges incident\nto the node are into it) and every collider is an ancestor of an endpoint of the path. A mixed graph\nis ancestral if it does not contain a directed or almost directed cycle. It is maximal if there is no\ninducing path (relative to the empty set) between any two non-adjacent nodes. A Maximal Ancestral\nGraph (MAG) is a graph that is both ancestral and maximal [19]. Given a causal graph D(V, L), a\nMAG MD over V can be constructed such that both the independence and the ancestral relations\namong variables in V are retained, see, for example, [27, p. 6].\nA triple (cid:104)X, Y, Z(cid:105) is an unshielded triple if X and Y are adjacent, Y and Z are adjacent, and X and Z\nare not adjancent. If both edges are into Y, then the triple is referred to as unshielded collider. A path\nbetween X and Y, p = (cid:104)X, . . . , W, Z, Y(cid:105), is a discriminating path for Z if (1) p includes at least three\nedges; (2) Z is a non-endpoint node on p, and is adjacent to Y on p; and (3) X is not adjacent to Y, and\nevery node between X and Z is a collider on p and is a parent of Y. Two MAGs are Markov equivalent\nif and only if (1) they have the same adjacencies; (2) they have the same unshielded colliders; and (3)\nif a path p is a discriminating path for a vertex Z in both graphs, then Z is a collider on the path in\none graph if and only if it is a collider on the path in the other. A PAG, which represents a Markov\nequivalence class of a MAG, is learnable from the independence model over the observed variables,\nand the FCI algorithm is a standard sound and complete method to learn such an object [28].\nRelated Work: Learning causal graphs from a combination of observational and interventional data\nhas been studied in the literature [3, 11, 7, 20, 8, 12, 23]. For causally su\ufb03cient systems, the notion\nand characterization of interventional Markov equivalence has been introduced in [9, 10]. More\nrecently, [24] showed that the same characterization can be used for both hard and soft interventions.\nFor causally insu\ufb03cient systems, [22] uses SAT solvers to learn a summary graph over the observed\nvariables given data from di\ufb00erent experimental conditions. [13] introduces an algorithm to pool\nexperimental datasets together and runs a modi\ufb01cation of FCI to learn an augmented graph; however,\nthey do not consider characterizing an equivalence class.\nNotations: For random variables X, Y, Z, the CI relation X is independent of Y conditioned on Z is\nshown by X \u22a5\u22a5 Y |Z . The d-separation statement node X is d-separated from Y given Z in graph D\nis shown by (X \u22a5\u22a5 Y |Z )D. I \u2286 2V is reserved for a set of interventions, where 2V is the power set\nof V. We show the symmetric di\ufb00erence by I(cid:52)J (cid:66) (I \\ J) \u222a (J \\ I). DX denotes the graph obtained\nfrom D where all the incoming edges to the set of nodes in X are removed. Similarly, DX denotes\nthe removal of outgoing edges. We assume that there is no selection bias. A star on an endpoint of an\nedge \u2217\u2212\u2217 is used as a wildcard to denote circle, arrowhead, or tail.\n\n4\n\n\f3 Do-Constraints \u2013 Combining Observational and Experimental\n\nDistributions\n\nOne of the most celebrated results in causal inference comes under the rubric of do-calculus (or\ncausal calculus) [18, 16]. The calculus consists of a set of inference rules that allows one to create a\nmap between distributions generated by a causal graph when certain graphical conditions hold in the\ngraph. The calculus was developed in the context of hard interventions, and recent work presented a\ngeneralization of this result for soft interventions [4], which we state next:\nTheorem 1 (Special case of Thm. 1 in [4]). Let D = (V \u222a L, E) be a causal graph. Then, the\nfollowing holds for any strictly positive distribution consistent with D.\nRule 1 (see-see): For any X \u2286 V and disjoint Y, Z, W \u2286 V\n\nPx(y|w, z) = Px(y|w)\n\nif Y \u22a5\u22a5 Z |W in D.\n\nRule 2 (do-see): For any disjoint X, Y, Z \u2286 V and W \u2282 V \\ (Z \u222a Y)\n\nPx,z(y|z, w) = Px(y|z, w)\n\nif Y \u22a5\u22a5 Z |W in DZ.\n\nRule 3 (do-do): For any disjoint X, Y, Z \u2286 V and W \u2282 V \\ (Z \u222a Y)\n\nPx,z(y|w) = Px(y|w)\n\nif Y \u22a5\u22a5 Z |W in DZ(W),\n\nwhere Z(W) \u2286 Z are non-ancestors of W in D.\nThe \ufb01rst rule of the calculus is a d-separation type of statement relative to a speci\ufb01c interventional\ndistribution Px, which says that Y \u22a5\u22a5 Z |W in D implies the corresponding conditional independence\nPx(y|w, z) = Px(y|w). Note that the converse of this rule is the work horse underlying most of the\nstructure learning algorithms found in practice, which says that if some independence hold in P, this\nwould imply a corresponding graphical separation (under faithfulness). In the case just mentioned,\nthis would imply that Y and Z should be separated in D, meaning, they have neither a directed nor a\nbidirected arrow connecting them.\nFrom this understanding, we make a very simple, albeit powerful observation \u2013 i.e., the converse of\nthe other two rules should o\ufb00er insights about the underlying graphical structure as well. To witness,\nconsider the causal graph D = {X \u2192 Y, X (cid:99)(cid:100) Y}, and suppose we have the observational and\ninterventional distributions P(Y, X) and PX(Y, X), respectively. Using the CI tests P(Y, X) (cid:44) P(Y)\u00b7P(X)\nand PX(Y, X) (cid:44) PX(Y) \u00b7 PX(X), we infer that the two variables are dependent (or not independent) and\nconsequently d-connected in the graph, while no claim can be made about the causal relation between\nthem. Given the inequality PX(Y) (cid:44) P(Y), we infer that the condition for rule 3 does not hold and\nY (cid:54)\u22a5\u22a5 X in DX. Hence, X must be a cause of Y \u2013 changing the value of X has a downstream e\ufb00ect on\nY. Similarly, given the inequality PX(Y|X) (cid:44) P(Y|X), the condition related to rule 2 does not hold,\nand Y (cid:54)\u22a5\u22a5 X in DX. The implication in this case is that there is an unblockable backdoor path between\nX and Y that is into X, i.e., a latent variable. Alternatively, if D = {X \u2192 Y}, then PX(Y|X) = P(Y|X),\nunder faithfulness, implies the absence of a latent variable by the converse of rule 2.\nBroadly speaking, rule 3 allows one to infer causal relations between variables, and consequently\ndirected edges in the causal graph. Since the compared interventional distributions di\ufb00er by a subset\nof interventions (Z), we call this the do-do test. On the other hand, rule 2 allows one to infer spurious\nrelations between variables, and consequently latent variables in the causal graph5. The do-see\nnaming of the test stems from the fact that we compare a distribution with an intervention on a subset\nZ (do) versus another which only conditions on Z (see). Naturally, rule 1 is the usual conditional\nindependence test that allows one to detect that neither directed nor bidirected arrow exists.\nPutting together these rules, we show in Corollary 1 a generalization of rules 2 and 3 . Note that rule\n2 appears when J \u2282 I and I \\ J \u2286 W; similarly, rule 3 can be seen when J \u2282 I and (I \\ J) \u2229 W = \u2205.\nCorollary 1 (mixed do-do/do-see). Let D = (V \u222a L, E) be a causal graph. Under the controlled\nintervention assumption, for any I, J \u2286 V and disjoint Y, W \u2286 V, we have the following:\n\nPI(y|w) = PJ(y|w)\n\nif Y \u22a5\u22a5 K|W \\ Wk in DWk,R(W),\n\nwhere K (cid:66) I(cid:52)J, Wk (cid:67) W \u2229 K, R (cid:66) K\\Wk, and R(W) \u2286 R are non-ancestors of W in D.\n\n5More precisely, rule 2 allows us to detect inducing paths that are into both variables.\n\n5\n\n\fFx\n\nX\n\nZ\nW\n\n(a) AugI(D)\n\nFx\n\nX\n\nY\n\nZ\nW\n\nFx\n\nX\n\nY\n\nFx\n\nX\n\nY\n\nZ\nW\n\nY\n\nZ\nW\n\nFigure 2: Augmented graphs with respect to I = {\u2205,{X}} and the corresponding augmented MAGs.\n\n(b) AugI(D(cid:48))\n\n(c) MAG(AugI(D)).\n\n(d) MAG(AugI(D(cid:48))).\n\nIn general, the proposed rule is a mixture of rules 2 and 3 as we could be conditioning in W on a subset\nof the symmetrical di\ufb00erence set I(cid:52)J. For instance, consider the causal graph D = {C (cid:99)(cid:100) A \u2192\nB, C (cid:99)(cid:100) B} and suppose we have the interventional distributions PA,B and PC,B. Since B \u22a5\u22a5 {A, C}\nin DA,C, then PA,B(B|A) = PB,C(B|A). This generalization will soon play a signi\ufb01cant role in the\ncharacterization and learning of the interventional equivalence class.\n\n4\n\nInterventional Markov Equivalence under Do-constraints\n\nPI(y|w, z) = PI(y|w)\nPI(y|w) = PJ(y|w)\n\nif Y \u22a5\u22a5 Z|W in D.\nif Y \u22a5\u22a5 K|W \\ Wk in DWk,R(W),\n\nIn this section, the new do-constraints will be used to de\ufb01ne the notion of interventional Markov\nequivalence. Then, we will characterize when two causal graphs are equivalent in accordance to the\nproposed de\ufb01nition. We start by de\ufb01ning the notion of interventional Markov as shown below.\nDe\ufb01nition 1. Consider the tuples of absolutely continuous probability distributions (PI)I\u2208I over a set\nof variables V. A tuple (PI)I\u2208I satis\ufb01es the I-Markov property with respect to a graph D = (V\u222aL, E)\nif the following holds for disjoint Y, Z, W \u2286 V:\n(1) For I \u2208 I:\n(2) For I, J \u2208 I:\nwhere K (cid:66) I(cid:52)J, Wk (cid:67) W \u2229 K, R (cid:66) K\\Wk, and R(W) \u2286 R are non-ancestors of W in D.\nThe set of all tuples that satisfy the I-Markov property with respect to D are denoted by PI(D, V).\nThe two conditions used in the de\ufb01nition correspond to rule 1 of Theorem 1 and that of Corollary 1.\nNotice that the traditional Markov de\ufb01nition only considers the \ufb01rst condition over the observational\ndistribution P(V); a case included in the I-Markov whenever \u2205 \u2208 I. Accordingly, two causal graphs\nare said to be I-Markov equivalent if they license the same set of distribution tuples. This notion is\nformalized in the following de\ufb01nition.\nDe\ufb01nition 2. Given two causal graphs D1 = (V\u222aL1, E1) and D2 = (V\u222aL2, E2), and an intervention\nset I \u2286 2V, D1 and D2 are called I-Markov equivalent if PI(D1, V) = PI(D2, V).\nOne challenge with De\ufb01nition 1 is that testing for the d-separation statement in condition (2) requires\na mutilated graph where we cut some of the edges in D. This makes it harder to represent all the\nconstraints imposed by a causal graph compactly. Accordingly, we use the notion of an augmented\ngraph that is introduced below (De\ufb01nition 3). In words, the construction of the augmented graph goes\nas follows. First, initialize the augmented graph to the input causal graph. Then, for every distinct\nsymmetric set di\ufb00erence between I, J \u2208 I, denoted by Si, introduce a new node Fi and make it a\nparent to each node in Si, i.e., Fi \u2192 S \u2208 Si. Note that this type of construction has been used in\nthe literature to model interventions [17, 6]. For example, for I = {\u2205,{X}}, Figure 2a presents the\naugmented graph corresponding to the causal graph, which is the induced subgraph over {X, W, Z, Y}.\nNode Fx is added in accordance with the symmetrical di\ufb00erence set (\u2205 \\ {X}) \u222a ({X} \\ \u2205) = {X}.\nDe\ufb01nition 3 (Augmented graph). Consider a causal graph D = (V \u222a L, E) and an intervention set\nI \u2286 2V. Let S = {S1, S2, . . . , Sk} = {S : \u2203I, J \u2208 I s.t. I(cid:52)J = S}. The augmented graph of D with\nrespect to I, denoted as AugI(D), is the graph constructed as follows: AugI(D) = (V \u222a F , E \u222a E)\nwhere F (cid:66) {Fi}i\u2208[k] and E = {(Fi, j)}i\u2208[k], j\u2208Si.\nThe signi\ufb01cance of the augmented graph construction is illustrated by Proposition 1, which provides\ncriteria to test the d-separation statements in De\ufb01nition 1 equivalently from the corresponding\naugmented graph of a causal graph. Back to the example in Figure 2a, the statement Y \u22a5\u22a5 X |Z in\n\n6\n\n\fDX can be equivalently tested by the statement Y \u22a5\u22a5 Fx |Z in the corresponding augmented graph.\nSimilarly, Y \u22a5\u22a5 X in DX can be equivalently tested by Y \u22a5\u22a5 Fx |X in AugI(D).\nProposition 1. Consider a causal graph D = (V \u222a L, E) and the corresponding augmented graph\nAugI(D) = (V \u222a L \u222a F , E \u222a E) with respect to an intervention set I, where F = {Fi}i\u2208[k]. Let Si be\nthe set of nodes adjacent to Fi,\u2200i \u2208 [k]. We have the following equivalence relations.\nFor disjoint Y, Z, W \u2286 V:\n\n(Y \u22a5\u22a5 Z|W)D \u21d0\u21d2 (Y \u22a5\u22a5 Z\nFor disjoint Y, W \u2286 V, where Wi (cid:66) W \u2229 Si, R (cid:66) Si \\ Wi:\n\n(Y \u22a5\u22a5 Si |W \\ Wi )DWi ,R(W)\n\n\u21d0\u21d2 (Y \u22a5\u22a5 Fi\n\n(cid:12)(cid:12)(cid:12)W, F[k] )Aug(D)\n(cid:12)(cid:12)(cid:12)W, F[k]\\{i} )Aug(D)\n\n(1)\n\n(2)\n\nIn order to characterize causal graphs that are I-Markov equivalent, we draw some insight from the\nMarkov equivalence of causal graphs with latents. Ancestral graphs, and more speci\ufb01cally MAGs,\nwere proposed as a representation to encode the d-separation statements of a causal graph among\nthe measured variables while not explicitly encoding the latent nodes. The de\ufb01nition below (Def. 4)\nintroduces the augmented MAG that is constructed over an augmented graph. Since all the constraints\nin the I-Markov de\ufb01nition can be tested by d-separation statements in the augmented graph, then an\naugmented MAG preserves all those constraints. For example, Figs. 2c and 2d present the augmented\nMAGs corresponding to the augmented graphs in Figs. 2a and 2b, respectively. Notice that Fx and Y\nare adjacent in both MAGs since they are not separable by any set in the augmented graphs.\nDe\ufb01nition 4 (Augmented MAG). Given a causal graph D = (V \u222a L, E) and an intervention set I,\nthe augmented MAG is the MAG constructed over V from AugI(D), i.e., MAG(AugI(D)).\nBelow, we derive a characterization for two causal graphs to be I-Markov equivalent \u2013 two causal\ngraphs are I-Markov equivalent if their corresponding augmented MAGs satisfy the three conditions\ngiven in Theorem 2. For example, the two augmented MAGs in Figures 2c and 2d satisfy the three\nconditions, hence the original causal graphs are in the same I-Markov equivalence class.\nTheorem 2. Two causal graphs D1 = (V \u222a L1, E1) and D2 = (V \u222a L2, E2) are I-Markov\nequivalent for a set of controlled experiments I if and only if for M1 = MAG(AugI(D1)) and\nM2 = MAG(AugI(D2)):\n\n1. M1 and M2 have the same skeleton;\n2. M1 and M2 have the same unshielded colliders;\n3. If a path p is a discriminating path for a node Y in both M1 and M2, then Y is a collider on\n\nthe path in one graph if and only if it is a collider on the path in the other.\n\n5 Learning by Combining Observations and Experiments\n\nIn this section, we develop an algorithm to learn the augmented graph from a combination of\nobservational and interventional data, which consequently recovers the causal graph. However,\nsimilar to the observational case, it is typically impossible to completely determine the causal graph\nfrom the available measured data, especially when latents are present. Then, the objective is to learn\na class of augmented MAGs consistent with data. For this, we de\ufb01ne an augmented PAG as follows.\nDe\ufb01nition 5. Given a causal graph D and an intervention set I, let M = MAG(AugI(D)) and\nlet [M] be the set of augmented MAGs corresponding to all the causal graphs that are I-Markov\nequivalent to D. An Augmented PAG for D, denoted G = PAG(AugI(D)), is a graph such that:\n\n1. G has the same adjacencies as M, and any member of [M] does; and\n2. every non-circle mark in G is an invariant mark in [M].\n\nAs with any learning algorithm, some faithfulness assumption is needed to infer graphical properties\nfrom the corresponding distributional constraints. Hence, we assume that the given interventional\ndistributions are c-faithful to the causal graph D as de\ufb01ned below.\n\n7\n\n\fAlgorithm 1 Algorithm for Learning Augmented PAG\n1: function LearnAugPAG(I, (PI)I\u2208I , V)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n\nif S epFlag = True then\n\n(F ,S, \u03c3) \u2190 CreateAugmentedNodes(I, V)\nV \u2190 V \u222a F\nPhase I: Learn Adjacencies and Seperating Sets\nForm the complete graph G on V where between every pair of nodes there is an edge \u25e6\u2212\u25e6.\nfor Every pair X, Y \u2208 V do\nif X \u2208 F \u2227 Y \u2208 F then\nelse\n\nS epS et(X, Y) \u2190 \u2205, S epFlag(X, Y) = True\n(S epS et(X, Y), S epFlag) \u2190 Do-Constraints((PI)I\u2208I , X, Y, V,F , \u03c3)\nRemove the edge between X, Y in G.\n\nPhase II: Learn Unshielded Colliders\nFor every unshielded triple (cid:104)X, Z, Y(cid:105) in G, orient it as X\u2217\u2192 Z \u2190\u2217Y i\ufb00 Z (cid:60) S epS et(X, Y)\nPhase III: Apply Orientation Rules\nApply 7 FCI rules in [28] together with the following 2 additional rules until none applies.\nRule 8: For any Fk \u2208 F , orient adjacent edges out of Fk.\nRule 9: For any Fk \u2208 F that is adjacent to a node Y (cid:60) Sk\nIf |Sk| = 1, orient X \u2217\u2212\u2217 Y as X \u2192 Y for X \u2208 Sk.\n\nAlgorithm 2 Creating F-nodes.\n1: function CreateAugmentedNodes(I, V)\nF = \u2205,S = \u2205, k = 0, \u03c3 : N \u2192 2V \u00d7 2V\n2:\nfor all pairs I, J \u2208 I, if I(cid:52)J (cid:60) S do\n3:\n4:\nreturn F ,S, \u03c3\n\nSet k \u2190 k + 1, set Sk = I(cid:52)J, add Fk to F , add Sk to S, set \u03c3(k) = (I, J).\n\nDe\ufb01nition 6. Consider a causal graph D = (V \u222a L, E). A tuple of distributions (PI)I\u2208I \u2208 P(D, V)\nis called c-faithful to graph D if the converse for each of the conditions given in De\ufb01nition 1 holds.\nAlgorithm 1 presents a modi\ufb01cation of the FCI algorithm to learn augmented PAGs. To explain the\nalgorithm, we \ufb01rst describe FCI which, given an independence model over the measured variables,\nproceeds in three phases [25]: In phase I, the algorithm initializes a complete graph with circle edges\n(\u25e6\u2212\u25e6), then it removes the edge between any pair of nodes if a separating set between the pair exists and\nrecords the set. In phase II, the algorithm identi\ufb01es unshielded triples (cid:104)A, B, C(cid:105) and orients the edges\ninto B if B is not in the separating set of A and C. Finally, in phase III, FCI applies the orientation\nrules. Only one of the rules uses separating sets while the rest use MAG properties, and soundness\nand completeness of the previous phases \u2013 the skeleton is correct and all the unshielded colliders are\ndiscovered. We note that FCI looks for any separating sets, and not necessarily the minimal ones. We\nalso observe that if two nodes X, Y are separated given Z in AugI(D), they are also separated given\nZ \u222a F since F are root nodes by construction, i.e., all the edges incident on F-nodes are out of them.\nAlgorithm 1 follows a similar \ufb02ow to that of the FCI. In phase I, it learns the skeleton of the augmented\nPAG. Function CreateAugmentedNodes(\u00b7) in Alg. 2 creates the F-nodes by computing the set S of\nunique symmetric di\ufb00erence sets from all pairs of interventions in I. Sigma (\u03c3) maps every F-node\nto a source pair of interventions, which is used later on to perform the do-tests. The algorithm starts\nby creating a complete graph of circle edges between V \u222a F . Then, it removes the edge between any\ntwo nodes X and Y if a separating set exists. If the two nodes are F-nodes, then they are separated by\nthe empty set by construction. Otherwise, it calls the function Do-Constraints(\u00b7) in Alg. 3 to search\nfor a separating set using the corresponding do-constraints. The function routine works as follows: If\nthe two nodes are random variables (and not F-nodes), then an arbitrary distribution is chosen and we\n\ufb01nd a subset W that establishes conditional independence between X and Y (rule 1 of Thm. 1). Else,\none of the two nodes is an F-node; without loss of generality, we choose it to be X. The algorithm\nthen looks for a subset W that satis\ufb01es the invariance of Corollary 1, i.e., PI(y|w) = PJ(y|w).\nPhase II of Alg. 1 is similar to the FCI counterpart. For the edge orientation phase, note that the\naugmented MAG is a MAG indeed, hence all the FCI orientation rules still apply. Therefore, phase III\n\n8\n\n\fS epS et = \u2205, S epFlag = False\nif X (cid:60) F \u2227 Y (cid:60) F then\nPick I \u2208 I arbitrarily.\nfor W \u2286 V \\ F do\n\nAlgorithm 3 Find m-separation sets via Calculus Tests.\n1: function Do-Constraints(I, (PI)I\u2208I , X, Y, V,F , \u03c3)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\nelse\n\nSuppose X \u2208 F , Y (cid:60) F and X = Fi without loss of generality.\n(I, J) = \u03c3(i)\nfor W \u2286 V \\ (F \u222a Y) do\n\nif PI(y|w) = PJ(y|w) then S epS et = W,F \\ {Fi}, S epFlag = True\n\nif PI(y|w, x) = PI(y|w) then S epS et = W \u222a F , S epFlag = True\n\nreturn (S epS et, S epFlag)\n\nFx\n\nX\n\nZ\nW\n\nFx\n\nX\n\nY\n\nFx\n\n\u25e6\n\nX\n\nY\n\nFx\n\nX\n\n\u25e6\u25e6\nZ\nW\u25e6\n\nY\n\n\u25e6\u25e6\nZ\nW\u25e6\n\nY\n\nZ\nW\n\n(a) Causal Graph D\n\n(b) MAG(Aug(D))\n\n(c) Before rule 9.\n\n(d) After rule 9.\n\nFigure 3: An example of learning the augmented PAG from the distributions P, Px consistent with the\ngiven causal graph. Rule 9 allows orienting the tail at X\u25e6\u2192 Y.\n\nuses the FCI orientation rules along with the following two new ones. The algorithm keeps applying\nthe rules until none applies anymore.\nRule 8 (F-node Edges): For any edge adjacent to an F node, orient the edge out of the F node.\nRule 9 (Inducing Paths): If Fk \u2208 F is adjacent to a node Y (cid:60) Sk and |Sk| = 1, e.g., Sk = {X}, then\norient X \u2217\u2212\u2217 Y out of X, i.e., X \u2192 Y. The intuition for this rule is as follows: If Fk is adjacent to\na node Y (cid:60) Sk in G, then there is an inducing path p between Fk and Y in AugI(D), where D is\nany causal graph in the equivalence class. Since Fk is a root node and by the properties of inducing\npaths, the subpath of p from X to Y is an inducing path as well and X is an ancestor of Y in AugI(D).\nHence, the edge between X and Y is out of X and into Y in MAG(AugI(D)) and consequently in G.\nWe give an example to illustrate the steps of the algorithm in Figure 3, where I = {\u2205,{X}}. Figure 3a\nshows the augmented causal graph, i.e., AugI(D), and Figure 3b shows the corresponding augmented\nMAG, i.e., MAG(AugI(D)). Nodes Fx and Z are separable in AugI(D) given the empty set and this\ncan be tested by the do-constraint P(Z) = PX(Z). Similarly, we can infer the separation of Fx and W\nby the test P(W|X) = PX(W|X). Figure 3c shows the graph obtained after applying the seven rules of\nthe FCI together with Rule 8. Finally, by applying Rule 9, we infer that the edge between X and Y\nhas a tail at X and we obtain the graph in Figure 3d. The soudness of the algorithm is shown next.\nTheorem 3. Consider a set of interventional distributions (PI)I\u2208I c-faithful to a causal graph\nD = (V \u222a L), where I is a set of controlled experiments. Algorithm 1 is sound, i.e., every adjacency\nand orientation is common for all MAG(Aug(D(cid:48))) where D(cid:48) is I-Markov equivalent to D.\n\n6 Conclusions\n\nWe investigate the problem of learning the causal structure underlying a phenomenon of interest\nfrom a combination of observational and experimental data. We pursue this endeavor by noting\nthat a generalization of the converse of Pearl\u2019s do-calculus (Thm. 1) leads to new tests that can be\nevaluated against data. These tests, in turn, translate into constraints over the structure itself. We then\nde\ufb01ne an interventional equivalence class based on such criteria (Def. 1), and then derive a graphical\ncharacterization for the equivalence of two causal graphs (Thm. 2). Finally, we develop an algorithm\nto learn an interventional equivalence class from data, which includes new orientation rules.\n\n9\n\n\fAcknowledgements\n\nBareinboim and Jaber are supported in parts by grants from NSF IIS-1704352, IIS-1750807 (CA-\nREER), IBM Research, and Adobe Research. Kocaoglu and Shanmugam are supported by the\nMIT-IBM Watson AI Lab.\n\nReferences\n[1] Elias Bareinboim, Carlos Brito, and Judea Pearl. Local characterizations of causal bayesian\nnetworks. In Graph Structures for Knowledge Representation and Reasoning (IJCAI), pages\n1\u201317. Springer Berlin Heidelberg, 2012.\n\n[2] Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings\n\nof the National Academy of Sciences, 113(27):7345\u20137352, July 2016.\n\n[3] David Maxwell Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of\n\nMachine Learning Research, 3:507\u2013554, 2002.\n\n[4] Juan Correa and Elias Bareinboim. A calculus for stochastic interventions: Causal e\ufb00ect\nidenti\ufb01cation and surrogate experiments. Technical report, R-51, Causal Arti\ufb01cial Intelligence\nLaboratory, Columbia University, New York, 2019.\n\n[5] A Philip Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical\n\nSociety, Series B, 41(1):1\u201331, 1979.\n\n[6] A Philip Dawid. In\ufb02uence diagrams for causal modelling and inference. International Statistical\n\nReview, 70:161\u2013189, 2002.\n\n[7] Frederick Eberhardt. Causation and Intervention. PhD thesis, Department of Philosophy,\n\nCarnegie Mellon University, 2007.\n\n[8] AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Elias Bareinboim. Budgeted\nIn International Conference on Machine\n\nexperiment design for causal structure learning.\nLearning (ICML), pages 1719\u20131728, 2018.\n\n[9] Alain Hauser and Peter B\u00fchlmann. Characterization and greedy learning of interventional\nmarkov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research,\n13(1):2409\u20132464, 2012.\n\n[10] Alain Hauser and Peter B\u00fchlmann. Two optimal strategies for active learning of causal networks\nfrom interventional data. In Proceedings of Sixth European Workshop on Probabilistic Graphical\nModels, 2012.\n\n[11] Antti Hyttinen, Frederick Eberhardt, and Patrik Hoyer. Experiment selection for causal discovery.\n\nJournal of Machine Learning Research, 14:3041\u20133071, 2013.\n\n[12] Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design for\nlearning causal graphs with latent variables. In Advances in Neural Information Processing\nSystems, pages 7018\u20137028, 2017.\n\n[13] Sara Magliacane, Tom Claassen, and Joris M Mooij. Joint causal inference on observational\n\nand experimental datasets. arXiv preprint arXiv:1611.10351, 2016.\n\n[14] Christopher Meek. Strong completeness and faithfulness in bayesian networks. In Proceedings\nof the Eleventh conference on Uncertainty in arti\ufb01cial intelligence, pages 411\u2013418. Morgan\nKaufmann Publishers Inc., 1995.\n\n[15] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA,\n\n1988.\n\n[16] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,\n\n2000. 2nd edition, 2009.\n\n10\n\n\f[17] Judea Pearl. Aspects of graphical models connected with causality. Proceedings of the 49th\n\nSession of the International Statistical Institute, 1(August):399\u2013401, 1993.\n\n[18] Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669\u2013688, 1995.\n\n[19] Thomas Richardson and Peter Spirtes. Ancestral graph markov models. The Annals of Statistics,\n\n30(4):962\u20131030, 2002.\n\n[20] Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath.\nLearning causal graphs with small interventions. In Advances in Neural Information Processing\nSystems, pages 3195\u20133203, 2015.\n\n[21] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. A\n\nBradford Book, 2001.\n\n[22] So\ufb01a Trianta\ufb01llou and Ioannis Tsamardinos. Constraint-based causal discovery from multiple\ninterventions over overlapping variable sets. J. Mach. Learn. Res., 16(1):2147\u20132205, January\n2015.\n\n[23] Yuhao Wang, Liam Solus, Karren Yang, and Caroline Uhler. Permutation-based causal inference\nalgorithms with interventions. In Advances in Neural Information Processing Systems, pages\n5822\u20135831, 2017.\n\n[24] Karren Yang, Abigail Kato\ufb00, and Caroline Uhler. Characterizing and learning equivalence\n\nclasses of causal dags under interventions. In ICML, 2018.\n\n[25] Jiji Zhang. Causal inference and reasoning in causally insu\ufb03cient systems. PhD thesis,\n\nDepartment of Philosophy, Carnegie Mellon University, 2006.\n\n[26] Jiji Zhang. A characterization of markov equivalence classes for directed acyclic graphs with\nlatent variables. In Proceedings of the Twenty-Third Conference on Uncertainty in Arti\ufb01cial\nIntelligence, UAI\u201907, pages 450\u2013457. AUAI Press, 2007.\n\n[27] Jiji Zhang. Causal reasoning with ancestral graphs. Journal of Machine Learning Research,\n\n9(Jul):1437\u20131474, 2008.\n\n[28] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence of\n\nlatent confounders and selection bias. Arti\ufb01cial Intelligence, 172(16):1873\u20131896, 2008.\n\n11\n\n\f", "award": [], "sourceid": 8128, "authors": [{"given_name": "Murat", "family_name": "Kocaoglu", "institution": "MIT-IBM Watson AI Lab"}, {"given_name": "Amin", "family_name": "Jaber", "institution": "Purdue University"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "IBM Research, NY"}, {"given_name": "Elias", "family_name": "Bareinboim", "institution": "Purdue"}]}