{"title": "Identifying Causal Effects via Context-specific Independence Relations", "book": "Advances in Neural Information Processing Systems", "page_first": 2804, "page_last": 2814, "abstract": "Causal effect identification considers whether an interventional probability distribution can be uniquely determined from a passively observed distribution in a given causal structure. If the generating system induces context-specific independence (CSI) relations, the existing identification procedures and criteria based on do-calculus are inherently incomplete. We show that deciding causal effect non-identifiability is NP-hard in the presence of CSIs. Motivated by this, we design a calculus and an automated search procedure for identifying causal effects in the presence of CSIs. The approach is provably sound and it includes standard do-calculus as a special case. With the approach we can obtain identifying formulas that were unobtainable previously, and demonstrate that a small number of CSI-relations may be sufficient to turn a previously non-identifiable instance to identifiable.", "full_text": "Identifying Causal Effects\n\nvia Context-speci\ufb01c Independence Relations\n\nSanttu Tikka\n\nDepartment of Mathematics and Statistics\n\nUniversity of Jyvaskyla, Finland\n\nsanttu.tikka@jyu.fi\n\nAntti Hyttinen\n\nHIIT, Department of Computer Science\n\nUniversity of Helsinki, Finland\nantti.hyttinen@helsinki.fi\n\nJuha Karvanen\n\nDepartment of Mathematics and Statistics\n\nUniversity of Jyvaskyla, Finland\n\njuha.t.karvanen@jyu.fi\n\nAbstract\n\nCausal effect identi\ufb01cation considers whether an interventional probability dis-\ntribution can be uniquely determined from a passively observed distribution in\na given causal structure. If the generating system induces context-speci\ufb01c inde-\npendence (CSI) relations, the existing identi\ufb01cation procedures and criteria based\non do-calculus are inherently incomplete. We show that deciding causal effect\nnon-identi\ufb01ability is NP-hard in the presence of CSIs. Motivated by this, we design\na calculus and an automated search procedure for identifying causal effects in\nthe presence of CSIs. The approach is provably sound and it includes standard\ndo-calculus as a special case. With the approach we can obtain identifying for-\nmulas that were unobtainable previously, and demonstrate that a small number of\nCSI-relations may be suf\ufb01cient to turn a previously non-identi\ufb01able instance to\nidenti\ufb01able.\n\n1\n\nIntroduction\n\nStatistical independence of random variables is a central concept in any data analysis and prediction\ntask. An important generalization of this concept is context-speci\ufb01c independence (CSI) [26, 6]. For\na simple example consider an antibiotic that normally has a dose\u2013response effect on the number of\nbacteria. A genetic mutation makes the bacteria resistant to the antibiotic meaning that in the context\nof this mutation the dose and the number of bacteria are independent. CSI-relations have been utilized\nto analyze, for example, gene expression data [2], dynamics of pneumonia [33], prognosis of heart\ndisease [22], proteins [15], parliament elections [22] and occurence of plants [22]. CSIs have also\nbeen used to speed up exact probabilistic inference [8, 12] and to improve structure learning [9, 19].\nHowever, CSIs have received much less attention in causal inference and in particular, causal effect\nidenti\ufb01ability, despite their great potential in allowing for further identi\ufb01ability results.\nIn the structural causal model (SCM) framework, the knowledge about causal mechanisms under in-\nvestigation is represented as a directed acyclic graph (DAG). When some nodes represent unobserved\nlatent variables, all information can be determined from a corresponding semi-Markovian graph.\nAssuming the qualitative information given by the graph, the aim in causal effect identi\ufb01cation is to\ndetermine whether a causal effect P (Y | do(X), Z) can be uniquely determined from the available\npassively observed distribution. The known causal structure, whichever formalism is used, speci\ufb01es\n(generalized) conditional independence properties of the system through d-separation. These warrant\nthe manipulation of interventional distributions with the rules of do-calculus and thus the derivation\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fL\n\nX\n\nY\n\nA\n\n(a)\n\nP (X | A, L) X = 0 X = 1\n0.9\nAL = 00\n0.9\nAL = 01\n0.5\nAL = 10\nAL = 11\n0.4\n\n0.1\n0.1\n0.5\n0.6\n\nA\n\n0\n\n1\n\n(0.1, 0.9)\n\nL\n\n0\n\n1\n\n(b)\n\nL\n\nA = 0\n\nX\n\nAL = 1\u2217\n\nY\n\nIX\n\nA\n\n(d)\n\n(0.5, 0.5)\n\n(0.6, 0.4)\n\n(c)\n\nL\nAIX = 0\u2217,\u22171\nAL = 1\u2217\n\nX\nIX L = 1\u2217\nA\n\n(e)\n\nY\n\nFigure 1: (a) L is latent unobserved variable. (b) CPT for P (X | A, L). (c) Decision tree with\nP (X | A, L) given in the leaf nodes. (d) corresponding labeled DAG (LDAG). (e) LDAG with an\nintervention node added for X.\n\nof identifying formulas [23, 24]. The ID algorithm implements this inference: it can identify the\ncausal effect whenever it can be non-parametrically identi\ufb01ed [28, 17, 32].\nWhen we have further information on the generative causal model, the completeness results of the\nprevious approaches do not apply anymore: more causal effects become identi\ufb01able and do-calculus\nbased methods will report false non-identi\ufb01ability. One such piece of still qualitative information are\nCSI-relations. One example is shown in Fig. 1(a). The causal effect P (Y | do(X)) is non-identi\ufb01able\nby do-calculus here due to the back-door path through latent factor L. However, if we know CSIs\nX \u22a5\u22a5 L| A = 0 and X \u22a5\u22a5 Y | A = 1, L the causal effect is identi\ufb01able (see Eq. 1 in Sec. 4.2).\nAccounting for CSIs imposes additional challenges for deciding causal effect identi\ufb01ability and to the\nderivation of identifying formulas. Instead of a graphical models for conditional independence, we\nneed to employ inherently more complicated graphical models for CSI. As we shall show, derivation\nof causal effects requires context-speci\ufb01c reasoning. All this is well worthwhile if it warrants the\nidenti\ufb01ability of new causal effects.\nWe formulate the problem of causal effect identi\ufb01ability in the presence of CSIs for binary variables\nand show that deciding non-identi\ufb01ability is NP-hard (Sec. 3). Motivated by this we develop a\ncalculus, and a search procedure over the rules of the calculus (Sec. 4 and 5). To make our search\nfeasible, we eliminate redundant contexts, implement new separation criteria and use a well-motivated\nheuristic. With these techniques we scale up to network sizes often reported in literature. Most\nimportantly, we show a host of examples where do-calculus cannot identify a causal effect but our\nsearch procedure leveraging on CSIs can prove identi\ufb01ability (Sec. 6). Impact for future research and\nalternative approaches are discussed in Sec. 7.\n\n2 Preliminaries: Graphical Models for Context-speci\ufb01c Independence\nOur starting point is causal effect identi\ufb01cation over a DAG G = (V , E). The set W \u2286 V denotes\na set of observed variables, marked by circular nodes. Since we also take into account the local\nstructure, we mark any unobserved variables explicitly as rectangular nodes in the graph (as opposed\nto the semi-Markovian representation with bi-directed edges). The set pa(Y ) denotes the parents of\na node Y regardless of their observability. Notation x is used to denote an assignment to random\nvariables X, and val(X) is used to denote the set of all possible assigments to X. All variables are\nassumed to be binary.\nThere are different ways of representing the local structure in the local conditional probability\ndistribution (CPD) of a node given its parents [20, 9]. One of the most popular ways of modeling the\nlocal structure is to cast some of the probabilities identical in the CPDs. For example the conditional\n\n2\n\n\fprobability table (CPT) of X in Fig. 1(b) has identical probabilities in the \ufb01rst two rows. One way to\nmodel such local structure is to use decision trees as in Fig. 1(c), see Koller et al. [20] for others.\nImportantly, local structure induces local CSIs of the form Y \u22a5\u22a5 X | pa(Y ) \\ X = (cid:96), denoting that Y\nis independent of the value of a parent X when the other parents of Y are assigned to values (cid:96). The\nlocal CPT in Fig. 1(b) implies X \u22a5\u22a5 Y | A = 0. The decision tree in Fig. 1(c) also shows this local\nCSI: once going down the branch with A = 0 the value of X is not in\ufb02uenced by the value of L.\nIn this paper, we employ the idea of Pensar et. al. [25] and mark local CSIs as labels on the\nedges of the DAG. A DAG (V , E) together with a set of labels L de\ufb01nes a labeled DAG (LDAG)\nG = (V , E,L), where for each edge X \u2192 Y \u2208 E there is a label L \u2208 L, which is a (possibly\nempty) set of assignments to pa(Y )\\ X i.e., other parents of Y . Each assignment in the label encodes\na local CSI: if (cid:96) \u2208 L, then Y \u22a5\u22a5 X | pa(Y ) \\ X = (cid:96). Symbol \u2217 is used as a shortcut notation for\nany value. For example, the label AL = 1\u2217 on X \u2192 Y in Fig. 1(d) implies that X \u22a5\u22a5 Y | A = 1, L.\nFinally, throughout the paper, we restrict our attention to regular maximal LDAGs. Maximality\nrequires that all labels that follow from other labels are recorded in the edges. Regularity means that\nedges absent in every context are not included in the graph. See Pensar et. al. [25] for details.\nAny LDAG can be turned into a context s speci\ufb01c DAG by removing edges that are spurious (i.e.,\nirrelevant) when variables S have values s as follows. The nodes appearing in the label L on some\nX \u2192 Y can be partitioned into two sets A and B: nodes in A are assigned to a by the context s,\nwhile nodes in B are not. Then, the edge X \u2192 Y \u2208 E is not present in the context s speci\ufb01c DAG\n(i.e., the edge is spurious) if (a, b) \u2208 L for all possible assignments b. For example, the context\nA = 1 speci\ufb01c DAG of Fig. 1(d) is identical to the underlying DAG except for X \u2192 Y being absent.\nA suf\ufb01cient condition for a non-local CSI to be implied by an LDAG structure is given by CSI-\nseparation criterion [6]: If sets of nodes X and Y are d-separated given C, S in the context s speci\ufb01c\nDAG of G, then X \u22a5\u22a5 Y | C, s is implied by G. Note that d-separation is a special case when S = \u2205.\nFor example, the labeling in Fig. 1(d) implies that X \u22a5\u22a5 L| A = 0 by this criterion, as the edge\nL \u2192 X is absent in the context A = 0 speci\ufb01c DAG.\nWe assume a positive distribution over the variables V [17]. This makes causal effects well-de\ufb01ned\nand justi\ufb01es conditioning on any subset of variables or their particular assignments.\n\n3 Causal Effect Identi\ufb01cation for CSI-based Graphical Models\n\nAs the \ufb01rst contribution we formalize causal identi\ufb01ability problem in the presence of CSIs. Identi\ufb01a-\nbility [24, 29] considers whether a causal effect can be uniquely identi\ufb01ed in models with a given\n\ufb01xed structure. If an effect is non-identi\ufb01able, there are (at least) two models that agree with the\nobservations and have the same given structure but disagree on the causal effect.\nWe use LDAGs to de\ufb01ne identi\ufb01ability in the presence of CSIs, as LDAGs offer a simple and intuitive\nvisual view of the causal structure and local CSIs. The LDAG is assumed known based on the\nbackground knowledge on the examined study, similarly as semi-Markovian graphs are standardly\ndrawn for do-calculus. For example, consider (again) the case where an antibiotic A had a dose-\nresponse effect to H only if a genetic mutation M had not taken place. Hence, we would mark label\nM = 1 on the edge A \u2192 H. Thus, the causal effect identi\ufb01cation problem can be formulated as:\nInput: An LDAG G over V , P (W ) for W \u2286 V , a query P (Y | do(X), Z) s.t. Y , X, Z \u2282 W .\nTask: Output a formula for P (Y | do(X), Z) over P (W ), or decide that it is non-identi\ufb01able.\nWhen no labels appear on the edges of an LDAG, the causal structure can be directly cast as a\nsemi-Markovian graph. Thus, the setting of do-calculus is a special case of this one.\n\n3.1 On Computational Complexity\n\nIn contrast to causal effect identi\ufb01ability over semi-Markovian graphs, which has polynomial decision\nprocedures [28, 17], taking local structure and CSIs into account makes the corresponding decision\nproblem NP-hard. (The proofs for all theorems are given in the supplementary material.)\nTheorem 1. Deciding non-identi\ufb01ability of a causal effect given an LDAG over V and a passively\nobserved distribution over W \u2286 V is NP-hard.\n\n3\n\n\fRule 1 (Insertion/Deletion of observations):\n\nP (Y | do(X), Z, W ) = P (Y | do(X), W ) if Y \u22a5\u22a5 Z | X, W || X\n\nRule 2 (Action/Observation exchange):\n\nP (Y | do(X), do(Z), W ) = P (Y | do(X), Z, W ) if Y \u22a5\u22a5 I Z | X, Z, W || X\n\nRule 3 (Insertion/Deletion of actions):\n\nP (Y | do(X), do(Z), W ) = P (Y | do(X), W ) if Y \u22a5\u22a5 I Z | X, W || X\n\nFigure 2: Rules of do-calculus. The sets X, Y , Z and W are disjoint. Notation || X means that the\ncondition is evaluated in a graph in which edges into X are removed. I Z denotes the intervention\nnodes of variables Z (see Sec. 4.1).\n\nThe proof of Theorem 1 shows that 3-SAT can be reduced to the identi\ufb01ability of P (Y | do(X))\nfrom P (X, Y ). On an intuitive level, the intricate structure in the local CPDs allows for representing\ninstances of NP-hard decision problems. This result is related to NP-hardness results of exact\ninference [10], implication problem of CSIs [20, 11] and the complexity results for Halpern\u2019s actual\ncausation [1], however, we are not aware of other NP-hardness results for causal effect identi\ufb01ability.\n\n4 A Calculus for Determining Identi\ufb01ability\n\nIn light of Theorem 1, fast algorithms for determining identi\ufb01ability of a causal effects may be\ngenerally unobtainable. Thus, we take here an approach similar to [14, 23, 16] and formulate a\ncalculus called CSI-calculus which can be used to show identi\ufb01ability for particular instantiations of\nthe problem. CSI-calculus is an extension of do-calculus of Fig. 2. In the \ufb01rst subsection we show\nthat due to the versatile graphical model used (LDAG), we only need to consider identi\ufb01cation of\nconditional probabilities (i.e., the do-operation is not needed). The second subsection gives the rules\nof CSI-calculus.\n\n4.1 Reduction to the Identi\ufb01ability of Conditional Probabilities in LDAGs\n\nInterventions can be encoded naturally with the use of intervention variables and CSIs [6, 23, 13].\nHere we show how this can be done for LDAGs.\nFor any LDAG (V , E,L), we can construct an augmented LDAG that has the capacity to represent\ninterventions as follows. Each node X \u2208 V is augmented by an intervention node IX and an edge\nIX \u2192 X. If IX = 0, then X is in its passive observational state determined by its parents pa(X). If\nIX = 1, then X is intervened on and its value is determined independently from its parents.\nFor every X \u2208 V and every label LZ \u2208 L of every incoming edge Z \u2192 X such that Z (cid:54)= IX, we\nZ by including the assignments IX = \u2217, pa(X) \\ (IX \u222a Z) = (cid:96)\nconstruct the augmented label L(cid:48)\nfor every (cid:96) \u2208 LZ and IX = 1, pa(X) \\ (IX \u222a Z) = \u2217. In other words, L(cid:48)\nZ renders the edge\nZ \u2192 X spurious when IX = 1 or in any context where LZ would. Fig. 1(e) shows an LDAG that is\nconstructed from the LDAG in Fig. 1(d) by adding an intervention node for X.\nUsing the above construction, an interventional distribution P (Y | do(X)) is now simply a conditional\ndistribution P (Y | X, IX = 1) . Thus, we can essentially drop the do-operator from the problem\nde\ufb01nition, and model interventions using intervention nodes and CSIs instead. To simplify the\nnotation, we omit intervention nodes for variables that are in their passive observational state from\nformulas. We do still include the do-operator when possible for improved readability.\n\n4.2 Rules of the Calculus\n\nFigure 3 describes the rules of CSI-calculus. In the rules we use terms that apply for all assignments\n(large letters) and to particular assignments (small letters). We do this in order to make the derivations\nshorter and identifying formulas more understandable. A valid calculus can be formed by omitting\nall large letters, but our experiments (Sec. 6) suggest that such a calculus is far less ef\ufb01cient.\n\n4\n\n\fRule 1 (Insertion/Deletion of observations):\n\nP (Y 1, y2 | Z1, z2, X 1, x2) = P (Y 1, y2 | X 1, x2) if Y 1, Y 2 \u22a5\u22a5 Z1, Z2 | X 1, x2\n\nRule 2 (Marginalization/Sum-rule):\n\nP (Y 1, y2 | X 1, x2) =(cid:80)\nZP (Y 1, y2, Z | X 1, x2)\n(cid:80)\nP (Y 1, Z1, z2 | X 1, x2)\n\nP (Y 1 | Z1, z2, X 1, x2) =\n\nP (Y 1, Z1, z2 | X 1, x2)\n\nY 1\n\nRule 3 (Conditioning):\n\nRule 4 (Product-rule):\n\nP (Y 1, y2, Z1, z2 | X 1, x2) = P (Y 1, y2 | Z1, z2, X 1, x2)P (Z1, z2 | X 1, x2)\n\nRule 5 (General-by-case reasoning):\n\nRule 6\n\nRule 7\n\nRule 8\n\nP (Y 1, y2, Z | X 1, x2) =\n(Case-by-general reasoning (a)):\n\nP (Y 1, y2, 1 \u2212 z | X 1, x2) = P (Y 1, y2 | X 1, x2) \u2212 P (Y 1, y2, z | X 1, x2)\n(Case-by-case reasoning):\n\n(cid:26) P (Y 1, y2, Z = 0| X 1, x2)\nP (Y 1, y2, z | X 1, x2) = P (Y 1, y2, Z | X 1, x2)(cid:12)(cid:12)Z=z\nP (Y 1, y2 | X 1, x2, z) = P (Y 1, y2 | X 1, x2, Z)(cid:12)(cid:12)Z=z\n\nP (Y 1, y2, Z = 1| X 1, x2)\n\n(Case-by-general reasoning (b)):\n\nif Z = 0\nif Z = 1\n\nFigure 3: Rules of CSI-calculus. The sets X 1, X 2, Y 1, Y 2, Z1 and Z2 are disjoint. We write w as\nshorthand for the explicit assignment W = w.\n\nRule 1 is directly the de\ufb01nition of context-speci\ufb01c independence which includes conditional inde-\npendence as a special case. Rule 1 can be applied in both directions, when the term on the left is\nidenti\ufb01ed, so is the term on the right and vice versa, provided that the separation condition is satis\ufb01ed.\nMarginalization, conditioning and factorization from standard probability calculus are operationalized\nby rules 2\u20134, respectively. Rule 5 uses the law of total probability to obtain the probability of the\ncomplement. Rules 2\u20135 are applied from right to left: when the expressions on the right are identi\ufb01ed,\nthen so is the term on the left. Rule 5 is also valid when Y 1 and Y 2 are empty sets: in this case the\nrule should be understood as P (1 \u2212 z | X 1, x2) = 1 \u2212 P (z | X 1, x2).\nRule 6 explicates that if we know the expression for each assignment Z = z then we also know\nthe expression without a speci\ufb01c assignment to Z. When rules 4\u20136 are applied, both distributions\non the right-hand side must be known. Rules 7 and 8 formulate the fact that if an expression is\nknown for all assignments to Z, it is also known for a speci\ufb01c assignment Z = z. For rules 5\u20138, it\nis assumed that Z is a singleton for convenience. This assumption does not restrict identi\ufb01ability\nsince operations involving sets can be carried out by applying the rules for each member of the set\nsequentially. For identi\ufb01able queries, the formula in terms of the joint distribution P (W ) is easily\nobtained by backtracking the chain of manipulations that resulted in identi\ufb01cation.\nImportantly, CSI-calculus includes standard do-calculus of Fig. 2 as a special case.\nTheorem 2. CSI-calculus subsumes do-calculus.\n\nThis means that any formula that is derivable with standard do-calculus over a DAG G (w. latents), is\nalso derivable using CSI-calculus over the LDAG formed by simply adding intervention nodes and\nlabels as described in Section 4.1. After this augmentation, Rule 1 fully encompasses the three rules\nof do-calculus [23, 24]; this is shown in the proof of the theorem.\nMore importantly, the calculus of Fig. 3 can identify causal effects that are not identi\ufb01able with the\nstandard do-calculus. For the example of Fig. 1, the following formula can be obtained:\nP (Y | do(X)) = P (Y | A = 0, X)P (A = 0) + P (Y | A = 1)P (A = 1).\n\n(1)\nA simple derivation of this formula using CSI-calculus is shown in Fig. 4. Note that the back-door\nA P (A)P (Y | A, X) is not valid here: conditioning on X when A = 1\nbiases Y through X \u2190 L \u2192 Y .\n\nformula P (Y | do(X)) =(cid:80)\n\n5\n\n\fR3\nP (Y, A | X)\n\nP (Y, X, A)\n\nR2\n\nR2\n\nP (A)\n\nP (Y, A)\n\nR7\n\nP (Y, A = 0 | X)\nR3\n\nR1: A \u22a5\u22a5 IX\n\nP (A | IX = 1)\n\nR1: A \u22a5\u22a5 X | IX = 1\n\nP (A | X, IX = 1)\n\nP (Y, | X, A = 0)\n\nR7\n\nP (Y, A = 1)\n\nR3\nP (Y | A = 1)\n\nR1: Y \u22a5\u22a5 IX | A = 1\n\nP (Y | X, A = 0, IX = 1)\n\nR1: Y \u22a5\u22a5 IX | X, A = 0\n\nR7\nP (A = 0 | X, IX = 1)\n\nR4\nR4\nP (Y, A = 0 | X, IX = 1)\n\nR6\n\nR7\nP (A = 1 | X, IX = 1)\nR4\n\nR1: Y \u22a5\u22a5 X | A = 1, IX = 1\n\nP (Y | A = 1, IX = 1)\n\nP (Y | X, A = 1, IX = 1)\n\nR4\n\nP (Y, A = 1 | X, IX = 1)\n\nR6\n\nP (Y, A | X, IX = 1)\n\nR2\n\nP (Y | X, IX = 1)\n\nFigure 4: A derivation of P (Y | do(X)) from P (X, Y, A) in the example of Fig. 1. The applied rules\nand CSIs are marked next to the edges connecting the terms. The identifying formula is Eq. 1.\n\n5 A Search for Causal Effect Identi\ufb01cation\n\nIn contrast to the setting of standard do-calculus, due the formidable number of contexts and the\ncausal structure being described by arguably more complex graph formalism, applying the rules of\nCSI-calculus by hand is impossible (recall also Theorem 1 on NP-hardness). Hence, we follow the\napproach of [30, 18] and devise a forward search procedure over the rules of CSI-calculus that is able\nto automatically output identifying formulas and derivations such as Fig 4.\nHowever, for any instance, there are a vast number of terms that may end up being useful in identifying\nthe query term; in fact, the derivation in Fig. 4 only shows the terms that were actually needed (in\nhindsight). For applying rule 1 we need to check a coNP-hard separation criterion, in contrast to the\npolynomial check of d-separation in the standard do-calculus setting. Hence, we focus here on how\nto ef\ufb01ciently evaluate separation criteria (Sec. 5.1), combine contexts (Sec. 5.2) and implement the\nheuristic search (Sec. 5.3) without weakening the theoretical properties (Sec. 5.4).\n\n5.1\n\nImplementing Separation Criteria\n\nRule 1 requires the evaluation of possibly non-local CSIs. Recall from Section 2, that CSI-separation\nis only a suf\ufb01cient criterion; in practice it misses many of the important independence relations. For\na feasible search procedure we need a suf\ufb01ciently fast way to check a suf\ufb01cient separation criterion.\nThe following suf\ufb01cient criterion is implemented in the search for this purpose.\nTheorem 3. If there exists a set C such that Y \u22a5\u22a5 Z | X, w, C is implied by an LDAG G and one of\nthe following is also implied by G: (i) Y \u22a5\u22a5 C | X, w, (ii) C \u22a5\u22a5 Z | X, w, (iii) Y \u22a5\u22a5 C | X, Z, w,\nor (iv) Z \u22a5\u22a5 C | X, Y , w, then also Y \u22a5\u22a5 Z | X, w is implied by G.\nWhen a CSI statement Y \u22a5\u22a5 Z | X, w is encountered by the search, the following procedure is\napplied: First, we verify whether the CSI is directly encoded in a label. If it is, we can stop and if it is\nnot, we continue by applying the CSI-separation criterion. If the CSI-separation criterion does not\nhold, we continue by attempting to \ufb01nd a set C that satis\ufb01es Y \u22a5\u22a5 Z | X, w, c for all c \u2208 val(C).\nTheorem 3 is then applied recursively to verify whether all of the required CSIs Y \u22a5\u22a5 C | X, w,\nC \u22a5\u22a5 Z | X, w, Y \u22a5\u22a5 C | X, Z, w or Z \u22a5\u22a5 C | X, Y , w hold in G. To guarantee that the recursion\nterminates, each variable can appear only once in each branch of the recursion. We further reduce the\nnumber of evaluated CSIs by caching them during the search.\n\n5.2 Eliminating Redundant Contexts\n\nThe number of possible contexts increases exponentially with the number of variables. It is therefore\nimportant to determine which contexts should be considered when CSIs are evaluated. Different\ncontexts often share the same context-speci\ufb01c DAG. We de\ufb01ne the equivalence relation s\u223c as follows:\n\n6\n\n\flet I\u2217 be the set of all distributions derived from P (cid:48) using the rules of Section 5.\nfor each new candidate distribution P \u2217 \u2208 I\u2217, do\n\nAlgorithm 1\nInput: Target Q = P (Y | do(X), Z), LDAG G and input I = {P (W )}.\nOutput: A formula F for Q in terms of P (W ) or NA.\n1: let U be the set of unexpanded terms, initially U := I.\n2: for P (cid:48) \u2208 U:\n3:\n4:\n5:\n6:\n7:\n8:\n9: Mark P (cid:48) as expanded: remove P (cid:48) from U.\n10: return NA.\n\nif an additional input is required that is not in I, then continue.\nif CSI relation of the current rule is not satis\ufb01ed by G, then continue.\nif P \u2217 = Q, then derive a formula F for Q by backtracking and return F .\nAdd P \u2217 to I, add P \u2217 to U.\n\ns\u223c s2 if and only if the context s1 speci\ufb01c DAG is the same as the context s2 speci\ufb01c DAG,\ns1\nwhere s1, s2 \u2208 val(S). When evaluating the CSI Y \u22a5\u22a5 Z | X, w, C of Theorem 3, we do not have\nto determine d-separation for every c \u2208 val(C) and w, c speci\ufb01c DAG. It suf\ufb01ces to restrict our\nattention to the context-speci\ufb01c DAGs given by the representatives of val(C)/ s\u223c.\nTheorem 4. Let R be a set of representatives of val(C)/ s\u223c. If Y is CSI-separated from Z by X in\nthe context w, c in G for all c \u2208 R, then Y is CSI-separated from Z by X in the context w, c in G\nfor all c \u2208 val(C).\nThe de\ufb01nition of intervention nodes can also be used in this way. In general, an arbitrary context\nS = s can render a number of edges spurious in the LDAG. However, if the context contains the\nassignment IX = 1 for any node X, we know that every incoming edge of X except IX \u2192 X will\nbe made spurious by de\ufb01nition without requiring any further veri\ufb01cation.\n\n5.3\n\nImplementing the Search\n\nAlgorithm 1 shows the pseudo-code which implements the calculus of Section 4 and is capable of\nsolving problems that fall under the formulation of Section 3 through the use of a search heuristic\nand elimination of redundant contexts. A single distribution is called a term, which is considered\nexpanded if every valid manipulation has been performed on it.\nThe input distribution is marked as unexpanded on line 1 and iteration over the unexpanded terms\nbegins on line 2. In order to guide the search to identify the most promising terms, we relate the\nidenti\ufb01ed distributions to the target Q through a heuristic proximity function and always expand\nthe closest term in U \ufb01rst. Note that if we were to expand only the closest term to the target\ngreedily, several identi\ufb01able instances would be left non-identi\ufb01ed because the identifying formulas\nand derivations are highly non-trivial. More details about the proximity function are given in the\nsupplementary material. If multiple terms share the maximal value of the proximity function, the\nterm that was identi\ufb01ed \ufb01rst is selected. Next, the rules of Section 4 are applied to P (cid:48) and the derived\ncandidate distributions are added to the set I\u2217 on line 3. Note that not every distribution in I\u2217 is\nnecessarily identi\ufb01ed at this point.\nIteration over the set I\u2217 begins on line 4. Here the candidate terms P \u2217 in I\u2217 that can be identi\ufb01ed\nare added to the set I. Previously identi\ufb01ed terms are not identi\ufb01ed again. Line 5 veri\ufb01es that both\nrequired terms are identi\ufb01ed for rules 4\u20136. Line 6 applies Theorem 3 to check the required CSI\nrelation for rule 1. Tests for d-separation are carried out via relevant path separation [7].\nIf all requirements are met, P \u2217 is identi\ufb01ed either as the target on line 7 or as a new unexpanded\ndistribution on line 8. Once all candidate distributions are processed, we mark P (cid:48) as expanded\non line 9. Note that P (cid:48) can still appear as a second required term on line 5 when another term is\nbeing expanded. Finally, if the target was not identi\ufb01ed and the set of unexpanded distributions was\nexhausted, we deem the target non-identi\ufb01able by the search and return NA on line 10.\n\n5.4 Theoretical Properties\n\nThe formulated search is sound in the following sense.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 5: (a) Running times of Algorithm 1. Full CS is a naive version which does not combine\ncontexts. (b) Time usage of each rule with error bars showing the standard error.\n\nTheorem 5 (Soundness). Algorithm 1 always terminates: if it returns an expression, it is correct.\n\nIn the setting of standard do-calculus, where no labels are present (in addition to those de\ufb01ning\ninterventions) the search is complete for (conditional) causal effect identi\ufb01ability. This is because the\nseparation condition is general enough to capture all conditional independences used by do-calculus\nas shown by Theorem 2.\n\n6 Experiments and Examples\n\nWe implemented the search in C++ and the code is available in the R-package dosearch on CRAN\n[31]. First we will present a simulation study on the search and then show a host of examples where\nidenti\ufb01ability can be shown with our approach. Experiments were performed on a modern desktop\ncomputer (single thread, Intel Core i7-4790, 3.4 GHz).\nWe considered DAGs with n = 7, 8, 9 nodes with 100 DAGs for each n. Edges for the DAGs were\nsampled randomly with average degree of 3. We sampled labels on the edges (local CSIs) with\nprobability 0.5. Two of the nodes were considered latent and the aim was to determine whether\nP (Y | do(X)) can be identi\ufb01ed. Fig. 5(a) shows the running times of Algorithm 1 with a 30 minute\ntimeout. The search times when all contexts are considered separate (i.e., the terms have \ufb01xed\nassigned values for all variables) are included as a baseline (full CS). Using terms that combine\nassignments as formulated in CSI-calculus considerably speeds up the execution times.\nIn the same simulation, we examined the effect of applying the individual rules on the total running\ntimes, as shown in Fig. 5(b). Rules 1 and 4 dominate the running time. For rule 1, considerable time\nis spent on checking whether the conditional independence constraints hold (recall that this step is\nalso (co)NP-hard). Rule 4 combines two previously identi\ufb01ed terms, and therefore a single term may\nhelp to identify further terms in a large number of ways.\nImportantly, the search implementing CSI-calculus can prove identi\ufb01ability of P (Y | do(X)) for\nthe LDAGs in Fig. 6 which would be non-identi\ufb01able otherwise via standard do-calculus. Non-\nidenti\ufb01ability can be veri\ufb01ed by running ID on the underlying DAGs without labels or by noting that\neach graph includes a hedge. In Fig. 6(a) P (Y | do(X)) = P (Y | X, W = 1). Intuitively, node W\nacts similarly as an intervention node and hence conditioning on W = 1 eliminates the back-door path.\nIn Fig. 6(b) P (Y | do(X)) = P (Y ), because X and Y are independent when X is intervened on due\nto the labels. In Fig. 6(c) P (Y | do(X)) = P (Y | Z = 0, X)P (Z = 0) + P (Y | Z = 1)P (Z = 1),\nadjusting for Z is needed, which opens up a new d-connecting path through H and Q. Fortunately,\nwhen Z = 0 there is no confounding path, and when Z = 1 there is a confounding path but no\ndirected path from Z. In Fig. 6(d), the causal effect is identi\ufb01able and the output by Algorithm 1 is:\n\nP (Y | do(X)) = P (A = 1)(cid:80)\n+ P (A = 0)(cid:80)\n\nW P (Y | X, W, A = 1)P (W | A = 1)\n\nZP (Z | X, A = 0)(cid:80)\n\nX(cid:48)P (Y | X(cid:48), Z, A = 0)P (X(cid:48) | A = 0)\n\nWhen A = 1, the \ufb01rst term resembles the back-door formula, adjusting for W . When A = 0, the\nsecond term resembles the front-door formula through Z. Since A \u22a5\u22a5 X, IX in the LDAG, we are\n\n8\n\n0300100Sorted instance #Time per instance (min)n = 7, Alg. 1n = 8, Alg. 1n = 9, Alg. 1n = 7, Full CSn = 8, Full CSn = 9, Full CS05Rule 1Rule 2Rule 3Rule 4Rule 5Rule 6Rule 7Rule 8Average time (min)n = 7n = 8n = 9\fZ\n\nW = 1\n\nW\n\nX\n\nY\n\nX\n\nA = 0\n\n(a)\n\n(b)\n\nA\n\nH\n\nQ\n\nAH = 1\u2217\n\nY\n\nXZ = \u22170\n\nZ\n\n(c) X\n\nZQ = 1\u2217\n\nY\n\nZ\n\nH\n\nW\n\nAZXU = 0 \u2217 \u2217 \u2217\n\nA\n\nL\nAM = 0\u2217\n\nM\n\nX\n\nA = 1\n\nZ\n\nAZUW = 0 \u2217 \u2217 \u2217\n\nY\n\nW\n\nZ\n\nX\n\nY\n\n(d)\nFigure 6: LDAGs such that P (Y | do(X)) is identi\ufb01able using CSIs, but not with standard do-calculus.\n\n(e)\n\nN\n\nAZXW = 1 \u2217 \u2217 \u2217\n\nU\n\nA\n\nallowing for a back-door type formula P (Y | do(X)) =(cid:80)\n\nable to combine the formulas. In Fig. 6(e) when A = 0, the confounding path from Y to IX vanishes\n\nZ P (Z | A = 0)P (Y | X, Z, A = 0).\n\n7 Discussion and Conclusion\n\nIn this paper, we considered causal effect identi\ufb01ability in the presence of context-speci\ufb01c inde-\npendence relations, which commonly arise from causal mechanisms over discrete variables. We\nformalized the problem employing LDAGs, showed that deciding causal effect non-identi\ufb01ability\nis NP-hard when CSIs are present, developed a calculus, and designed a readily usable automatic\nsearch procedure for \ufb01nding identifying formulas. We showed that with only a few additional CSIs,\nour approach may enable identi\ufb01ability in previously non-identi\ufb01able cases.\nCurrently, we are at the level of a calculus and a search procedure over the calculus. Although the\npresented rules and the search are sound, completeness results are harder to obtain. Despite that the\ngeneral decision problem is NP-hard, one could think of applying polynomial ID over context-speci\ufb01c\nDAGs and then combining the results in order to obtain a complete decision procedure. However,\nthe following theorem shows that identi\ufb01ability in context-speci\ufb01c DAGs is not a direct indicator of\ngeneral identi\ufb01ability.\nTheorem 6. Causal effect P (Y | do(X)) may be non-identi\ufb01able from P (W ) even if P (Y | do(X))\nis identi\ufb01able in the context s speci\ufb01c DAGs for every s \u2208 val(S) or if P (Y | do(X), s) is identi\ufb01able\nin the context s speci\ufb01c DAGs for every s \u2208 val(S) where S contains only observed variables.\n\nHence further research is needed for a similar theory as for do-calculus, which resulted in com-\npleteness proofs through hedges, ID and IDC algorithms [28, 17, 27], if it is possible here. The\ngeneralization to categorical variables is mostly imminent, but designing a feasible search procedure\nis certainly an additional challenge. As such, the presented approach can already leverage from\ninterventional distributions [3] by modifying the set of inputs I of Algorithm 1.\nWe believe our approach using CSIs will have an impact on a variety of related problems. We would\nlike to use our approach to solve cases of transportability, selection bias and missing data problems\n[4, 5, 21]. The methodology presented is likely to render causal effects and distributions identi\ufb01able\nalso in these problems, provided that there are CSI relations present.\n\nAcknowledgments\n\nST was supported by Academy of Finland grant 311877 (Decision analytics utilizing causal models\nand multiobjective optimization). AH was supported by Academy of Finland grant 295673.\n\n9\n\n\fReferences\n\n[1] G. Aleksandrowicz, H. Chockler, J. Y. Halpern, and A. Ivrii. The computational complexity of\n\nstructure-based causality. Journal of Arti\ufb01cial Intelligence Research, 58:431\u2013451, 2017.\n\n[2] Y. Barash and N. Friedman. Context-speci\ufb01c Bayesian clustering for gene expression data.\n\nJournal of Computational Biology, 9(2):169\u2013191, 2002.\n\n[3] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identi\ufb01ability. In\nN. de Freitas and K. Murphy, editors, Proceedings of the 28th Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 113\u2013120. AUAI Press, 2012.\n\n[4] E. Bareinboim and J. Pearl. Transportability from multiple environments with limited ex-\nperiments: Completeness results. In Proceedings of the 27th Annual Conference on Neural\nInformation Processing Systems, pages 280\u2013288, 2014.\n\n[5] E. Bareinboim and J. Tian. Recovering causal effects from selection bias. In Proceedings of the\n\n29th AAAI Conference on Arti\ufb01cial Intelligence, pages 3475\u20133481, 2015.\n\n[6] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-speci\ufb01c independence in\nBayesian networks. In Proceedings of the 12th International Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 115\u2013123. Morgan Kaufmann, 1996.\n\n[7] C. J. Butz, A. E. dos Santos, and J. S. Oliveira. Relevant path separation: A faster method for\ntesting independencies in Bayesian networks. In 8th International Conference on Probabilistic\nGraphical Models, pages 74\u201385, 2016.\n\n[8] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. Arti\ufb01cial\n\nIntelligence, 172(6-7):772\u2013799, 2008.\n\n[9] D. M. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian\nnetworks with local structure. In 13th International Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 80\u201389. Morgan Kaufmann, 1997.\n\n[10] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian belief\n\nnetworks. Arti\ufb01cial Intelligence, 42(2):393\u2013405, 1990.\n\n[11] J. Corander, A. Hyttinen, J. Kontinen, J. Pensar, and J. V\u00e4\u00e4n\u00e4nen. A logical approach to\n\ncontext-speci\ufb01c independence. Annals of Pure and Applied Logic, 170(9):975\u2013992, 2019.\n\n[12] G. H. Dal, A. W. Laarman, and P. J. F. Lucas. Parallel probabilistic inference by weighted\nmodel counting. In Proceedings of Machine Learning Research \u2013 Volume 72, pages 97\u2013108.\nPMLR, 2018.\n\n[13] A. P. Dawid. In\ufb02uence diagrams for causal modelling and inference. International Statistical\n\nReview, 70(2):161\u2013189, 2002.\n\n[14] D. Galles and J. Pearl. Testing identi\ufb01ability of causal effects. In Proceedings of the 11th\nConference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 185\u2013195. Morgan\nKaufmann, 1995.\n\n[15] B. Georgi, J. Schultz, and A. Schliep. Context-speci\ufb01c independence mixture modelling\nfor protein families. In European Conference on Principles of Data Mining and Knowledge\nDiscovery, pages 79\u201390. Springer, 2007.\n\n[16] J. Y. Halpern. Axiomatizing causal reasoning. Journal of Arti\ufb01cial Intelligence Research,\n\n12:317\u2013337, 2000.\n\n[17] Y. Huang and M. Valtorta. Identi\ufb01ability in causal Bayesian networks: a sound and complete\nalgorithm. In Proceedings of the 21st National Conference on Arti\ufb01cial intelligence \u2013 Volume 2,\npages 1149\u20131154. AAAI Press, 2006.\n\n[18] A. Hyttinen, F. Eberhardt, and M. J\u00e4rvisalo. Do-calculus when the true graph is unknown. In\nProceedings of the 31st Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 395\u2013404.\nAUAI Press, 2015.\n\n[19] A. Hyttinen, J. Pensar, J. Kontinen, and J. Corander. Structure learning for Bayesian networks\nover labeled DAGs. In Proceedings of Machine Learning Research \u2013 Volume 72, pages 133\u2013144.\nPMLR, 2018.\n\n[20] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, 2009.\n\n10\n\n\f[21] K. Mohan, J. Pearl, and J. Tian. Graphical models for inference with missing data. In Advances\n\nin Neural Information Systems, volume 26, pages 1277\u20131285, 2013.\n\n[22] H. Nyman, J. Pensar, T. Koski, and J. Corander. Strati\ufb01ed graphical models-context-speci\ufb01c\n\nindependence in graphical models. Bayesian Analysis, 9(4):883\u2013908, 2014.\n\n[23] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669\u2013688, 1995.\n[24] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, second\n\nedition, 2009.\n\n[25] J. Pensar, H. J. Nyman, T. Koski, and J. Corander. Labeled directed acyclic graphs: a gen-\neralization of context-speci\ufb01c independence in directed graphical models. Data Mining and\nKnowledge Discovery, 29(2):503\u2013533, 2015.\n\n[26] S. E. Shimony. Explanation, irrelevance, and statistical independence. In Proceedings of the 9th\nNational conference on Arti\ufb01cial intelligence \u2013 Volume 1, pages 482\u2013487. AAAI Press, 1991.\n[27] I. Shpitser and J. Pearl. Identi\ufb01cation of conditional interventional distributions. In Proceedings\nof the 22nd Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 437\u2013444. AUAI Press,\n2006.\n\n[28] I. Shpitser and J. Pearl. Identi\ufb01cation of joint interventional distributions in recursive semi-\nMarkovian causal models. In Proceedings of the 21st National Conference on Arti\ufb01cial Intelli-\ngence \u2013 Volume 2, pages 1219\u20131226. AAAI Press, 2006.\n\n[29] I. Shpitser and J. Pearl. Complete identi\ufb01cation methods for the causal hierarchy. Journal of\n\nMachine Learning Research, 9:1941\u20131979, 2008.\n\n[30] S. Tikka, A. Hyttinen, and J. Karvanen. Causal effect identi\ufb01cation from multiple incomplete\n\ndata sources: A general search-based approach. https://arxiv.org/abs/1902.01073, 2019.\n\n[31] S. Tikka, A. Hyttinen, and J. Karvanen. dosearch: Causal Effect Identi\ufb01cation from Multiple\n\nIncomplete Data Sources, 2019. R package version 1.0.3.\n\n[32] S. Tikka and J. Karvanen. Identifying causal effects with the R package causaleffect. Journal of\n\nStatistical Software, 76(12):1\u201330, 2017.\n\n[33] S. Visscher, P. Lucas, I. Flesch, and K. Schurink. Using temporal context-speci\ufb01c independence\nIn Conference on Arti\ufb01cial\n\ninformation in the exploratory analysis of disease processes.\nIntelligence in Medicine in Europe, pages 87\u201396. Springer, 2007.\n\n11\n\n\f", "award": [], "sourceid": 1595, "authors": [{"given_name": "Santtu", "family_name": "Tikka", "institution": "University of Jyv\u00e4skyl\u00e4"}, {"given_name": "Antti", "family_name": "Hyttinen", "institution": "University of Helsinki"}, {"given_name": "Juha", "family_name": "Karvanen", "institution": "University of Jyvaskyla"}]}