{"title": "Causal Inference through a Witness Protection Program", "book": "Advances in Neural Information Processing Systems", "page_first": 298, "page_last": 306, "abstract": "One of the most fundamental problems in causal inference is the estimation of a causal effect when variables are confounded. This is difficult in an observational study because one has no direct evidence that all confounders have been adjusted for. We introduce a novel approach for estimating causal effects that exploits observational conditional independencies to suggest ``weak'' paths in a unknown causal graph. The widely used faithfulness condition of Spirtes et al. is relaxed to allow for varying degrees of ``path cancellations'' that will imply conditional independencies but do not rule out the existence of confounding causal paths. The outcome is a posterior distribution over bounds on the average causal effect via a linear programming approach and Bayesian inference. We claim this approach should be used in regular practice to complement other default tools in observational studies.", "full_text": "Causal Inference through a\nWitness Protection Program\n\nRicardo Silva\n\nDepartment of Statistical Science and CSML\n\nUniversity College London\n\nricardo@stats.ucl.ac.uk\n\nRobin Evans\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nevans@stats.ox.ac.uk\n\nAbstract\n\nOne of the most fundamental problems in causal inference is the estimation of a\ncausal effect when variables are confounded. This is dif\ufb01cult in an observational\nstudy because one has no direct evidence that all confounders have been adjusted\nfor. We introduce a novel approach for estimating causal effects that exploits\nobservational conditional independencies to suggest \u201cweak\u201d paths in a unknown\ncausal graph. The widely used faithfulness condition of Spirtes et al. is relaxed\nto allow for varying degrees of \u201cpath cancellations\u201d that will imply conditional\nindependencies but do not rule out the existence of confounding causal paths. The\noutcome is a posterior distribution over bounds on the average causal effect via\na linear programming approach and Bayesian inference. We claim this approach\nshould be used in regular practice to complement other default tools in observa-\ntional studies.\n\n1 Contribution\n\nWe provide a new methodology to bound the average causal effect (ACE) of a variable X on a\nvariable Y . For binary variables, the ACE is de\ufb01ned as\nE[Y | do(X = 1)] \u2212 E[Y | do(X = 0)] = P (Y = 1| do(X = 1)) \u2212 P (Y = 1| do(X = 0)), (1)\nwhere do(\u00b7) is the operator of Pearl [14], denoting distributions where a set of variables has been\nintervened upon by an external agent. In the interest of space, we assume the reader is familiar\nwith the concept of causal graphs, the basics of the do operator, and the basics of causal discovery\nalgorithms such as the PC algorithm of Spirtes et al. [22]. We provide a short summary for context\nin Section 2\nThe ACE is in general not identi\ufb01able from observational data. We obtain upper and lower bounds\non the ACE by exploiting a set of (binary) covariates, which we also assume are not effects of X or\nY (justi\ufb01ed by temporal ordering or other background assumptions). Such covariate sets are often\nfound in real-world problems, and form the basis of most observational studies done in practice [21].\nHowever, it is not obvious how to obtain the ACE as a function of the covariates. Our contribution\nmodi\ufb01es the results of Entner et al. [6], who exploit conditional independence constraints to obtain\npoint estimates of the ACE, but give point estimates relying on assumptions that might be unstable\nin practice. Our modi\ufb01cation provides a different interpretation of their search procedure, which we\nuse to generate candidate instrumental variables [11]. The linear programming approach of Dawid\n[5] and Ramsahai [16] is then modi\ufb01ed to generate bounds on the ACE by introducing constraints on\nsome causal paths, motivated as relaxations of [6]. The new setup can be computationally expensive,\nso we introduce further relaxations to the linear program to generate novel symbolic bounds, and a\nfast algorithm that sidesteps the full linear programming optimization with some simple, message\npassing-like, steps.\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: (a) A generic causal graph where X and Y are confounded by some U. (b) The same\nsystem in (a) where X is intervened upon by an external agent. (c) A system where W and Y are\nindependent given X. (d) A system where it is possible to use faithfulness to discover that U is\nsuf\ufb01cient to block all back-door paths between X and Y . (e) Here, U itself is not suf\ufb01cient.\n\nSection 2 introduces the background of the problem and Section 3 our methodology. Section 4\ndiscusses an analytical approximation of the main results, and a way by which this provides scaling-\nup possibilities for the approach. Section 5 contains experiments with synthetic and real data.\n\n2 Background: Instrumental Variables, Witnesses and Admissible Sets\n\nwe can obtain the distribution P (Y = y | do(X = x)) by simply calculating(cid:80)\n\nAssuming X is a potential cause of Y , but not the opposite, a cartoon of the causal system containing\nX and Y is shown in Figure 1(a). U represents the universe of common causes of X and Y . In\ncontrol and policy-making problems, we would like to know what happens to the system when the\ndistribution of X is overridden by some external agent (e.g., a doctor, a robot or an economist).\nThe resulting modi\ufb01ed system is depicted in Figure 1(b), and represents the family of distributions\nindexed by do(X = x): the graph in (a) has undergone a \u201csurgery\u201d that wipes out edges, as originally\ndiscussed by [22] in the context of graphical models. Notice that if U is observed in the dataset, then\nu P (Y = y | X =\nx, U = u)P (U = u) [22]. This was popularized by [14] as the back-door adjustment. In general\nP (Y = y | do(X = x)) can be vastly different from P (Y = y | X = x).\nThe ACE is simple to estimate in a randomized trial: this follows from estimating the conditional\ndistribution of Y given X under data generated as in Figure 1(b). In contrast, in an observational\nstudy [21] we obtain data generated by the system in Figure 1(a). If one believes all relevant con-\nfounders U have been recorded in the data then back-door adjustment can be used, though such\ncompleteness is uncommon. By postulating knowledge of the causal graph relating components\nof U, one can infer whether a measured subset of the causes of X and Y is enough [14, 23, 15].\nWithout knowledge of the causal graph, assumptions such as faithfulness [22] are used to infer it.\nThe faithfulness assumption states that a conditional independence constraint in the observed distri-\nbution exists if and only if a corresponding structural independence exists in the underlying causal\ngraph. For instance, observing the independence W \u22a5\u22a5 Y | X, and assuming faithfulness and the\ncausal order, we can infer the causal graph Figure 1(c); in all the other graphs this conditional in-\ndependence in not implied. We deduce that no unmeasured confounders between X and Y exist.\nThis simple procedure for identifying chains W \u2192 X \u2192 Y is useful in exploratory data analysis\n[4], where a large number of possible causal relations X \u2192 Y are unquanti\ufb01ed but can be screened\nusing observational data before experiments are performed. The idea of using faithfulness is to be\nable to sometimes identify such quantities.\nEntner et al. [6] generalize the discovery of chain models to situations where a non-empty set of\ncovariates is necessary to block all back-doors. Suppose W is a set of covariates which are known\nnot to be effects of either X or Y , and we want to \ufb01nd an admissible set contained in W: a set\nof observed variables which we can use for back-door adjustment to get P (Y = y | do(X = x)).\nEntner\u2019s \u201cRule 1\u201d states the following:\nRule 1: If there exists a variable W \u2208 W and a set Z \u2286 W\\{W} such that:\n\n(i) W \\\u22a5\u22a5 Y | Z\nthen infer that Z is an admissible set.\n\n(ii) W \u22a5\u22a5 Y | Z \u222a {X}.\n\n2\n\nUXYUXYWXYUXYWUXYWU\u2019\fA point estimate of the ACE can then be found using Z. Given that (W, Z) satis\ufb01es1 Rule 1, we\ncall W a witness for the admissible set Z. The model in Figure 1(c) can be identi\ufb01ed with Rule\n1, where W is the witness and Z = \u2205. In this case, a so-called Na\u00a8\u0131ve Estimator2 P (Y = 1| X =\n1) \u2212 P (Y = 1| X = 0) will provide the correct ACE. If U is observable in Figure 1(d), then it can\nbe identi\ufb01ed as an admissible set for witness W . Notice that in Figure 1(a), taking U as a scalar, it\nis not possible to \ufb01nd a witness since there are no remaining variables. Also, if in Figure 1(e) our\ncovariate set W is {W, U}, then no witness can be found since U(cid:48) cannot be blocked. Hence, it is\npossible for a procedure based on Rule 1 to answer \u201cI don\u2019t know whether an admissible set exists\u201d\neven when a back-door adjustment would be possible if one knew the causal graph. However, using\nthe faithfulness assumption alone one cannot do better: Rule 1 is complete for non-zero effects\nwithout more information [6].\nDespite its appeal, the faithfulness assumption is not without dif\ufb01culties. Even if unfaithful distri-\nbutions can be ruled out as pathological under seemingly reasonable conditions [13], distributions\nwhich lie close to (but not on) a simpler model may in practice be indistinguishable from distribu-\ntions within that simpler model at \ufb01nite sample sizes. To appreciate these complications, consider\nthe structure in Figure 1(d) with U unobservable. Here W is randomized but X is not, and we would\nlike to know the ACE of X on Y 3. W is sometimes known as an instrumental variable (IV), and we\ncall Figure 1(d) the standard IV structure; if this structure is known, optimal bounds LIV \u2264 ACE\n\u2264 UIV can be obtained without further assumptions, using only observational data over the binary\nvariables W , X and Y [1]. There exist distributions faithful to the IV structure but which at \ufb01nite\nsample sizes may appear to satisfy the Markov property for the structure W \u2192 X \u2192 Y ; in practice\nthis can occur at any \ufb01nite sample size [20]. The true average causal effect may lie anywhere in the\ninterval [LIV ,UIV ] (which can be rather wide), and may differ considerably from the na\u00a8\u0131ve estimate\nappropriate for the simpler structure. While we emphasize that this is a \u2018worst-case scenario\u2019 anal-\nysis and by itself should not rule out faithfulness as a useful assumption, it is desirable to provide a\nmethod that gives greater control over violations of faithfulness.\n\n3 Methodology: the Witness Protection Program\n\nThe core of our idea is (i) to invert the usage of Entner\u2019s Rule 1, so that pairs (W, Z) should provide\nan instrumental variable bounding method instead of a back-door adjustment; (ii) express violations\nof faithfulness as bounded violations of local independence; (iii) \ufb01nd bounds on the ACE using a\nlinear programming formulation.\nLet (W, Z) be any pair found by a search procedure that decides when Rule 1 holds. W will play the\nrole of an instrumental variable, instead of being discarded. A standard IV bounding procedure such\nas [1] can be used conditional on each individual value z of Z, then averaged over P (Z). The lack of\nan edge W \u2192 Y given Z can be justi\ufb01ed by faithfulness (as W \u22a5\u22a5 Y |{X, Z}). For the same reason,\nthere might be no (conditional) dependence between W and a possible unmeasured common parent\nof X and Y . However, assuming faithfulness itself is not interesting, as a back-door adjustment\ncould be directly obtained. Allowing unconstrained dependencies induced by edges W \u2192 Y and\n(W, U ) (any direction) is also a non-starter, as all bounds will be vacuous [16].\nConsider instead the (partial) parameterization in Table 1 of the joint distribution of {W, X, Y, U},\nwhere U is latent and not necessarily a scalar. For simplicity of presentation, assume we are condi-\ntioning everywhere on a particular value z of Z, but which we supress from our notation as this will\nnot be crucial to developments in this Section. Under this notation, the ACE is given by\n\n\u03b711P (W = 1) + \u03b710P (W = 0) \u2212 \u03b701P (W = 1) \u2212 \u03b700P (W = 0).\n\n(2)\n\n1The work in [6] aims also at identifying zero effects with a \u201cRule 2\u201d. For simplicity we assume that the\n\neffect of interest was already identi\ufb01ed as non-zero.\n\n2Sometimes we use the word \u201cestimator\u201d to mean a functional of the probability distribution instead of a\nstatistical estimator that is a function of samples of this distribution. Context should make it clear when we\nrefer to an actual statistic or a functional.\n\n3A classical example is in non-compliance: suppose W is the assignment of a patient to either drug or\nplacebo, X is whether the patient actually took the medicine or not, and Y is a measure of health status. The\ndoctor controls W but not X. This problem is discussed by [14] and [5].\n\n3\n\n\fyx.w \u2261 P (Y = y, X = x| W = w, U )\n\u03b6 (cid:63)\n\nU P (Y = y, X = x| W = w, U )P (U | W = w)\n\n= P (Y = y, X = x| W = w)\n\n\u03b6yx.w \u2261 (cid:80)\n\u03b7xw \u2261 (cid:80)\n\u03b4w \u2261 (cid:80)\n\nxw \u2261 P (Y = 1| X = x, W = w, U )\n\u03b7(cid:63)\n= P (Y = 1| do(X = x), W = w)\n\nU P (Y = 1| X = x, W = w, U )P (U | W = w)\n\nw \u2261 P (X = 1| W = w, U )\n\u03b4(cid:63)\n= P (X = 1| W = w).\n\nU P (X = x| W = w, U )P (U | W = w)\n\nTable 1: A partial parameterization of a causal DAG model over some {U, W, X, Y }. Notice that\nsuch parameters cannot be functionally independent, and this is precisely what we will exploit.\n\nWe now introduce the following assumptions,\n\n|\u03b7(cid:63)\nx1 \u2212 \u03b7(cid:63)\n\nx0| \u2264 \u0001w\nxw \u2212 P (Y = 1| X = x, W = w)| \u2264 \u0001y\n|\u03b7(cid:63)\n|\u03b4(cid:63)\nw \u2212 P (X = 1| W = w)| \u2264 \u0001x\n\u03b2P (U ) \u2264 P (U | W = w) \u2264 \u00af\u03b2P (U ).\n\n(3)\n(4)\n(5)\n(6)\nSetting \u0001w = 0, \u03b2 = \u00af\u03b2 = 1 recovers the standard IV structure. Further assuming \u0001y = \u0001x = 0\nrecovers the chain structure W \u2192 X \u2192 Y . Deviation from these values corresponds to a violation\nof faithfulness, as the premises of Rule 1 can only be satis\ufb01ed by enforcing functional relationships\namong the conditional probability tables of each vertex. Using this parameterization in the case\n\u0001y = \u0001x = 1, \u03b2 = \u00af\u03b2 = 1, Ramsahai [16], extending [5], used the following linear programming to\nobtain bounds on the ACE (for now, assume that \u03b6yx.w and P (W = w) are known constants):\n\n{\u03b7(cid:63)\n\n1. There is a 4-dimensional polytope where parameters {\u03b7(cid:63)\n\nxw} can take values: for \u0001w = \u0001y =\n1, this is the unit hypercube [0, 1]4. Find the extreme points of this polytope (up to 12 points\nw}.\nfor the case where \u0001w > 0). Do the same for {\u03b4(cid:63)\nw}\u00d7\n\nyx.w by mapping them from the points in {\u03b4(cid:63)\n2. Find the extreme points of the joint space \u03b6 (cid:63)\nw)(1\u2212x)\u03b7(cid:63)\nxw.\n\nxw}, since \u03b6 (cid:63)\n\n3. Using the extreme points of the 12-dimensional joint space {\u03b6 (cid:63)\n\nxw}, \ufb01nd the dual\npolytope of this space in terms of linear inequalities. Points in this polytope are convex\nxw}, shown by [5] to correspond to the marginalization over\ncombinations of {\u03b6 (cid:63)\nsome arbitrary P (U ). This results in contraints over {\u03b6yx.w} \u00d7 {\u03b7xw}.\n4. Maximize/minimize (2) with respect to {\u03b7xw} subject to the constraints found in Step 3 to\n\nyx.w} \u00d7 {\u03b7(cid:63)\n\nyx.w} \u00d7 {\u03b7(cid:63)\n\nw)x(1 \u2212 \u03b4(cid:63)\n\nyx.w = (\u03b4(cid:63)\n\nobtain upper/lower bounds on the ACE.\n\n\u03b6yx.w in the constraints by \u03bayx.w \u2261(cid:80)\nLikewise, substitute every occurrence of \u03b7xw in the constraints by \u03c9xw \u2261(cid:80)\nconstraints \u03b6yx.w/ \u00af\u03b2 \u2264 \u03bayx.w \u2264 \u03b6yx.w/\u03b2,(cid:80)\n\nAllowing for the case where \u0001x < 1 or \u0001y < 1 is just a matter changing the \ufb01rst step, where\nbox constraints are set on each individual parameter as a function of the known P (Y = y, X =\nx| W = w), prior to the mapping in Step 2. The resulting constraints are now implicitly non-linear\nin P (Y = y, X = x| W = w), but at this stage this does not matter as they are treated as constants.\nTo allow for the case \u03b2 < 1 < \u00af\u03b2, use exactly the same procedure, but substitute every occurrence of\nyx.wP (U ); notice the difference between \u03bayx.w and \u03b6yx.w.\nxwP (U ). Instead\nof plugging in constants for the values of \u03bayx.w and turning the crank of a linear programming\nsolver, we \ufb01rst treat {\u03bayx.w} (and {\u03c9xw}) as unknowns, linking them to observables and \u03b7xw by the\nyx \u03bayx.w = 1 and \u03b7xw/ \u00af\u03b2 \u2264 \u03c9xw \u2264 \u03b7xw/\u03b2. Finally, the\nmethod can be easily implemented using a package such as Polymake (http://www.poymake.org) or\nSCDD for R. More details are given in the Supplemental Material.\nIn this paper, we will not discuss in detail how to choose the free parameters of the relaxation. Any\nchoice of \u0001w \u2265 0, \u0001y \u2265 0, \u0001x \u2265 0, 0 \u2264 \u03b2 \u2264 1 \u2264 \u00af\u03b2 is guaranteed to provide bounds that are at\n\nU \u03b7(cid:63)\n\nU \u03b6 (cid:63)\n\n4\n\n\fcause-effect indices X and Y\n\ninput : Binary data matrix D; set of relaxation parameters \u03b8; covariate index set W;\noutput: A list of pairs (witness, admissible set) contained in W\nL \u2190 \u2205;\nfor each W \u2208 W do\n\nfor every admissible set Z \u2286 W\\{W} identi\ufb01ed by W and \u03b8 given D do\n\nB \u2190 posterior over upper/lowed bounds on the ACE as given by (W, Z, X, Y,D, \u03b8);\nif there is no evidence in B to falsify the (W, Z, \u03b8) model then\n\nL \u2190 L \u222a {B};\n\nend\n\nend\nend\nreturn L\n\nAlgorithm 1: The outline of the Witness Protection Program algorithm.\n\nleast as conservative as the back-door adjusted point estimator of [6], which is always covered by\nthe bounds. Background knowledge, after a user is suggested a witness and admissible set, can be\nused here. In Section 5 we experiment with a few choices of default parameters. To keep focus,\nin what follows we will discuss only computational aspects. We develop a framework for choosing\nrelaxation parameters in the Supplemental, and expect to extend it in follow-up publications.\nAs the approach provides the witness a degree of protection against faithfulness violations, using a\nlinear program, we call this framework the Witness Protection Program (WPP).\n\n3.1 Bayesian Learning\n\nThe previous section treated \u03b6yx.w and P (W = w) as known. A common practice is to replace\nthem by plug-in estimators (and in the case of a non-empty admissible set Z, an estimate of P (Z)\nis also necessary). Such models can also be falsi\ufb01ed, as the constraints generated are typically only\nsupported by a strict subset of the probability simplex. In principle, one could \ufb01t parameters without\nconstraints, and test the model by a direct check of satis\ufb01ability of the inequalities using the plug-in\nvalues. However, this does not take into account the uncertainty in the estimation. For the standard\nIV model, [17] discuss a proper way of testing such models in a frequentist sense.\nOur models can be considerably more complicated. Recall that constraints will depend on the ex-\nyx.w} parameters. As implied by (4) and (5), extreme points will be functions\ntreme points of the {\u03b6 (cid:63)\nof \u03b6yx.w. Writing the constraints fully in terms of the observed distribution will reveal non-linear\nrelationships. We approach the problem in a Bayesian way. We will assume \ufb01rst the dimensionality\nof Z is modest (say, 10 or less), as this is the case in most applications of faithfulness to causal\ndiscovery. We parameterize P (Y, X, W | Z) as a full 2 \u00d7 2 \u00d7 2 contingency table4.\nGiven that the dimensionality of the problem is modest, we assign to each three-variate distribution\nP (Y, X, W | Z = z) an independent Dirichet prior for every possible assigment of Z, constrained\nby the inequalities implied by the corresponding polytopes. The posterior is also a 8-dimensional\nconstrained Dirichlet distribution, where we use rejection sampling to obtain a posterior sample by\nproposing from the unconstrained Dirichlet. A Dirichlet prior can also be assigned to P (Z). Using\na sample from the posterior of P (Z) and a sample (for each possible value z) from the posterior of\nP (Y, X, W | Z = z), we obtain a sample upper and lower bound for the ACE.\nThe full algorithm is shown in Algorithm 1. The search procedure is left unspeci\ufb01ed, as different\nexisting approaches can be plugged in into this step. See [6] for a discussion. In Section 5 we deal\nwith small dimensional problems only, using the brute-force approach of performing an exhaustive\nsearch for Z. In practice, brute-force can be still valuable by using a method such as discrete PCA\n[3] to reduce W\\{W} to a small set of binary variables. To decide whether the premises in Rule 1\nhold, we merely perform Bayesian model selection with the BDeu score [2] between the full graph\n{W \u2192 X, W \u2192 Y, X \u2192 Y } (conditional on Z) and the graph with the edge W \u2192 Y removed. Our\n4That is, we allow for dependence between W and Y given {X, Z}, interpreting the decision of indepen-\n\ndence used in Rule 1 as being only an indicator of approximate independence.\n\n5\n\n\fxw (\u03ba0x(cid:48).w + \u03ba1x(cid:48).w)\n\n\u03c9xw \u2265 \u03ba1x.w + LY U\n\u03c9xw \u2264 1 \u2212 (\u03ba0x.w(cid:48) \u2212 \u0001w(\u03ba0x.w(cid:48) + \u03ba1x.w(cid:48)))/U XU\nxw(cid:48)\nx(cid:48)w \u2264 \u03ba1x.w + \u0001w(\u03ba0x(cid:48).w + \u03ba1x(cid:48).w)\n\n\u03c9xw \u2212 \u03c9xw(cid:48)U XU\n\n\u03c9xw + \u03c9x(cid:48)w \u2212 \u03c9x(cid:48)w(cid:48) \u2265 \u03ba1x(cid:48).w + \u03ba1x.w \u2212 \u03ba1x(cid:48).w(cid:48) + \u03ba1x.w(cid:48) \u2212 \u03c7xw(cid:48)( \u00afU + L + 2\u0001w) + L\n\n(7)\n(8)\n(9)\n(10)\n\nTable 2: Some of the algebraic bounds found by symbolic manipulation of linear inequali-\nties. Notation: x, w \u2208 {0, 1}, x(cid:48) = 1 \u2212 x and w(cid:48) = 1 \u2212 w are the complementary values.\nxw \u2261 min(1, P (Y = 1|X = x, W =\nxw \u2261 max(0, P (Y = 1|X = x, W = w) \u2212 \u0001y), U Y U\nLY U\nw) + \u0001y); LXU\nxw de\ufb01ned accordingly. Finally,\n\u00afU \u2261 max{U Y U\nxw } and \u03c7xw \u2261 \u03ba1x.w + \u03ba0x.w. Full set of bounds with proofs can\nbe found in the Supplementary Material.\n\nxw \u2261 max(0, P (X = x|W = w) \u2212 \u0001x), with U XU\nxw }, L \u2261 min{LY U\n\n\u201cfalsi\ufb01cation test\u201d in Step 5 is a simple and pragmatical one: our initial trial of rejection sampling\nproposes M samples, and if more than 95% of them are rejected, we take this as an indication that\nthe proposed model provides a bad \ufb01t. The \ufb01nal result is a set of posterior distributions over bounds,\npossibly contradictory, which should be summarized as appropriate. Section 5 provides an example.\n\n4 Algebraic Bounds and the Back-substitution Algorithm\n\nPosterior sampling is expensive within the context of Bayesian WPP: constructing the dual polytope\nfor possibly millions of instantiations of the problem is time consuming, even if each problem is\nsmall. Moreover, the numerical procedure described in Section 3 does not provide any insight on\nhow the different free parameters {\u0001w, \u0001y, \u0001x, \u03b2, \u00af\u03b2} interact to produce bounds, unlike the analytical\nbounds available in the standard IV case. [16] derives analytical bounds under (3) given a \ufb01xed,\nnumerical value of \u0001w. We know of no previous analytical bounds as an algebraic function of \u0001w.\nIn the Supplementary Material, we provide a series of algebraic bounds as a function of our free\nparameters. Due to limited space, we show only some of the bounds in Table 2. They illustrate\nqualitative aspects of our free parameters. For instance, if \u0001y = 1 and \u03b2 = \u00af\u03b2 = 1, then LY U\nxw = 0 and\n(7) collapses to \u03b7xw \u2265 \u03b61x.w, one of the original relations found by [1] for the standard IV model.\nDecreasing \u0001y will linearly increase LY U\nxw , tightening the corresponding lower bound in (7). If also\n\u0001w = 0 and \u0001x = 1, from (8) it follows \u03b7xw \u2264 1 \u2212 \u03b60x.w(cid:48). Equation (3) implies \u03c9x(cid:48)w \u2212 \u03c9x(cid:48)w(cid:48) \u2264 \u0001w,\nand as such by setting \u0001w = 0 we have that (10) implies \u03b7xw \u2265 \u03b71x.w + \u03b71x.w(cid:48) \u2212 \u03b71x(cid:48).w(cid:48) \u2212 \u03b70x.w(cid:48),\none of the most complex relationships in [1]. Further geometric intuition about the structure of the\nbinary standard IV model is given by [19].\nThese bounds are not tight, in the sense that we opted not to fully exploit all possible algebraic\ncombinations for some results, such as (10): there we use L \u2264 \u03b7(cid:63)\nw \u2264 1 instead of\nall possible combinations resulting from (4) and (5). The proof idea in the Supplementary Material\ncan be further re\ufb01ned, at the expense of clarity. Because our derivation is a further relaxation, the\nimplied bounds are more conservative (i.e., wider).\nBesides providing insight on the structure of the problem, this gives a very ef\ufb01cient way of checking\nwhether a proposed parameter vector {\u03b6 (cid:63)\nyx.w} is valid, as well as \ufb01nding the bounds: use back-\nsubstitution on the symbolic set of constraints to \ufb01nd box constraints Lxw \u2264 \u03c9xw \u2264 Uxw. The\nproposed parameter will be rejected whenever an upper bound is smaller than a lower bound, and (2)\ncan be trivially optimized conditioning only on the box constraints\u2014this is yet another relaxation,\nadded on top of the ones used to generate the algebraic inequalities. We initialize by intersecting\nall algebraic box constraints (of which (7) and (8) are examples); next we re\ufb01ne these by scanning\nrelations \u00b1\u03c9xw \u2212 a\u03c9xw(cid:48) \u2264 c such as (9) in lexicographical order, and tightening the bounds of\n\u03c9xw using the current upper and lower bounds on \u03c9xw(cid:48) where possible. We then identify constraints\nLxww(cid:48) \u2264 \u03c9xw \u2212 \u03c9xw(cid:48) \u2264 Uxww(cid:48) starting from \u2212\u0001w \u2264 \u03c9xw \u2212 \u03c9xw(cid:48) \u2264 \u0001w and the existing bounds,\nand plug into relations \u00b1\u03c9xw + \u03c9x(cid:48)w \u2212 \u03c9x(cid:48)w(cid:48) \u2264 c (as exempli\ufb01ed by (10)) to get re\ufb01ned bounds\non \u03c9xw as functions of (Lx(cid:48)ww(cid:48),Ux(cid:48)ww(cid:48)). We iterate this until convergence, which is guaranteed\nsince bounds never widen at any iteration. This back-substitution of inequalities follows the spirit\n\nxw \u2264 \u00afU and 0 \u2264 \u03b4(cid:63)\n\n6\n\n\fof message-passing and it is an order of magnitude more ef\ufb01cient than the fully numerical solution,\nwhile not increasing the width of the bounds by too much. In the Supplementary Material we provide\nevidence for this claim. In our experiments in Section 5, the back-substitution method was used in\nthe testing stage of WPP. After collecting posterior samples, we calculated the posterior expected\nvalue of the contingency tables and run the numerical procedure to obtain the \ufb01nal tight bound5.\n\n5 Experiments\n\nWe describe a set of synthetic studies, followed by one study with the in\ufb02uenza data discussed by\n[9, 18]. In the synthetic study setup, we compare our method against NE1 and NE2, two na\u00a8\u0131ve point\nestimators de\ufb01ned by back-door adjustment on the whole of W and on the empty set, respectively.\nThe former is widely used in practice, even when there is no causal basis for doing so [15]. The\npoint estimator of [6], based solely on the faithfulness assumption, is also assessed.\nWe generate problems where conditioning on the whole set W is guaranteed to give incorrect es-\ntimates6. Here, |W| = 8. We analyze two variations: one where it is guaranteed that at least one\nvalid witness \u00d7 admissible set pair exists; in the other, latent variables in the graph are common\nparents also of X and Y , so no valid witness exists. We divide each variation into two subcases: in\nthe \ufb01rst, \u201chard\u201d subcase, parameters are chosen (by rejection sampling) so that NE1 has a bias of\nat least 0.1 in the population; in the second, no such selection exists, and as such our exchangeable\nparameter sampling scheme makes the problem relatively easy. We summarize each WPP bound by\nthe posterior expected value of the lower and upper bounds. In general WPP returns more than one\nbound: we select the upper/lower bound corresponding to the (W, Z) pair where the sum of BDeu\nscores for W \\\u22a5\u22a5 Y | Z and W \u22a5\u22a5 Y | Z \u222a {X} is highest.\nOur main evaluation metric for an estimate is the Euclidean distance (henceforth, \u201cerror\u201d) between\nthe true ACE and the closed point in the given estimate, whether the estimate is a point or an interval.\nFor methods that provide point estimates (NE1, NE2, and faithfulness), this means just the absolute\nvalue of the difference between the true ACE and the estimated ACE. For WPP, the error of the\ninterval [L,U] is zero if the true ACE lies in this interval. We report error average and error tail\nmass at 0.1, the latter meaning the proportion of cases where the error exceeds 0.1. The comparison\nis not straightforward, since the trivial interval [\u22121, 1] will always have zero bias according to this\nde\ufb01nition. This is a trade-off, to be set according to an agreed level of information loss, measured\nby the width of the resulting intervals. This is discussed in the Supplemental. We run simulations\nat two levels of parameters: \u03b2 = 0.9, \u00af\u03b2 = 1.1, and the same con\ufb01guration except for \u03b2 = \u00af\u03b2 = 1.\nThe former gives somewhat wide intervals. As Manski emphasizes [11], this is the price for making\nfewer assumptions. For the cases where no witness exists, Entner\u2019s Rule 1 should theoretically report\nno solution. In [6], stringent thresholds for accepting the two conditions of Rule 1 are adopted.\nInstead we take a more relaxed approach, using a uniform prior on the hypothesis of independence,\nand a BDeu prior with effective sample size of 10. As such, due to the nature of our parameter\nrandomization, almost always (typically > 90%) the method will propose at least one witness. Given\nthis theoretical failure, for the problems where no exact solution exists, we assess how sensitive the\nmethods are given conclusions taken from \u201capproximate independencies\u201d instead of exact ones.\nWe simulate 100 datasets for each one of the four cases (hard case/easy case, with theoretical solu-\ntion/without theoretical solution), 5000 points per dataset, 1000 Monte Carlo samples per decision.\nResults are summarized in Table 3 for the case \u0001w = \u0001x = \u0001y = 0.2, \u03b2 = 0.9, \u00af\u03b2 = 1.1. Notice\n\n5Sometimes, however, the expected contingency table given by the back-substitution method would fall\noutside the feasible region of the fully speci\ufb01ed linear program \u2013 this is expected to happen from time to time,\nas the analytical bounds are looser. In such a situation, we report the bounds given by the back-substitution\nsamples.\n6In detail: we generate graphs where W \u2261 {Z1, Z2, . . . , Z8}. Four independent latent variables\nL1, . . . , L4 are added as parents of each {Z5, . . . , Z8}; L1 is also a parent of X, and L2 a parent of Y .\nL3 and L4 are each randomly assigned to be a parent of either X or Y , but not both. {Z5, . . . , Z8} have no\nother parents. The graph over Z1, . . . , Z4 is chosen by adding edges uniformly at random according to the lex-\nicographic order. In consequence using the full set W for back-door adjustment is always incorrect, as at least\nfour paths X \u2190 L1 \u2192 Zi \u2190 L2 \u2192 Y are active for i = 5, 6, 7, 8. The conditional probabilities of a vertex\ngiven its parents are generated by a logistic regression model with pairwise interactions, where parameters are\nsampled according to a zero mean Gaussian with standard deviation 10 / number of parents. Parameter values\nare truncated so that all conditional probabilities are between 0.025 and 0.975.\n\n7\n\n\fCase (\u03b2 = 1, \u00af\u03b2 = 1)\nHard/Solvable\nEasy/Solvable\nHard/Unsolvable\nEasy/Unsolvable\n\nNE1\n\n0.12\n0.01\n0.16\n0.09\n\n1.00\n0.01\n1.00\n0.32\n\nNE2\n\nFaith.\n\nWPP\n\n0.02\n0.07\n0.20\n0.14\n\n0.03\n0.24\n0.88\n0.56\n\n0.05\n0.02\n0.19\n0.12\n\n0.05\n0.01\n0.95\n0.53\n\n0.01\n0.00\n0.07\n0.03\n\n0.01\n0.00\n0.25\n0.08\n\nWidth\n0.24\n0.24\n0.24\n0.23\n\nTable 3: Summary of the outcome of the synthetic studies. Each entry for particular method is a pair\n(bias average, bias tail mass at 0.1) of the respective methods, as explained in the main text. The last\ncolumn is the median width of the WPP interval. In a similar experiment with \u03b2 = 0.9, \u00af\u03b2 = 1.1,\nWPP achieves nearly zero error, with interval widths around 0.50. A much more detailed table for\nmany other cases is provided in the Supplementary Material.\n\nthat WPP is quite stable, while the other methods have strengths and weaknesses depending on the\nsetup. For the unsolvable cases, we average over the approximately 99% of cases where some solu-\ntion was reported\u2014in theory, no conditional independences hold and no solution should be reported,\nbut WPP shows empirical robustness for the true ACE in these cases.\nOur empirical study concerns the effect of in\ufb02uenza vaccination on a patient being hospitalized later\non with chest problems. X = 1 means the patient got a \ufb02u shot, Y = 1 indicates the patient\nwas hospitalized. A negative ACE therefore suggests a desirable vaccine. The study was originally\ndiscussed by [12]. Shots were not randomized, but doctors were randomly assigned to receive a\nreminder letter to encourage their patients to be inoculated, recorded as GRP. This suggests the\nstandard IV model in Figure 1(d), with W = GRP and U unobservable. Using the bounds of [1] and\nobserved frequencies gives an interval of [\u22120.23, 0.64] for the ACE. WPP could not validate GRP\nas a witness, instead returning as the highest-scoring pair the witness DM (patient had history of\ndiabetes prior to vaccination) with admissible set composed of AGE (dichotomized at 60 years) and\nSEX. Here, we excluded GRP as a possible member of an admissible set, under the assumption that\nit cannot be a common cause of X and Y . Choosing \u0001w = \u0001y = \u0001x = 0.2 and \u03b2 = 0.9, \u00af\u03b2 = 1.1, we\nobtain the posterior expected interval [\u22120.10, 0.17]. This does not mean the vaccine is more likely\nto be bad (positive ACE) than good: the posterior distribution is over bounds, not over points, being\ncompletely agnostic about the distribution within the bounds. Notice that even though we allow for\nfull dependence between all of our variables, the bounds are considerably stricter than in the standard\nIV model due to the weakening of hidden confounder effects postulated by observing conditional\nindependences. Posterior plots and sensitivity analysis are included in the Supplementary Material;\nfor further discussion see [18, 9].\n\n6 Conclusion\n\nOur model provides a novel compromise between point estimators given by the faithfulness assump-\ntions and bounds based on instrumental variables. We believe such an approach should become\na standard item in the toolbox of anyone who needs to perform an observational study. R code\nis available at http://www.homepages.ucl.ac.uk/\u223cucgtrbd/wpp. Unlike risky Bayesian\napproaches that put priors directly on the parameters of the unidenti\ufb01able latent variable model\nP (Y, X, W, U | Z), the constrained Dirichlet prior does not suffer from massive sensitivity to the\nchoice of hyperparameters, as discussed at length by [18] and the Supplementary Material. By fo-\ncusing on bounds, WPP keeps inference more honest, providing a compromise between a method\npurely based on faithfulness and purely theory-driven analyses that overlook competing models\nsuggested by independence constraints. As future work, we will look at a generalization of the\nprocedure beyond relaxations of chain structures W \u2192 X \u2192 Y . Much of the machinery here\ndeveloped, including Entner\u2019s Rules, can be adapted to the case where causal ordering is unknown:\nthe search for \u201cY-structures\u201d [10] generalizes the chain structure search to this case. Also, we will\nlook into ways on suggesting plausible values for the relaxation parameters, already touched upon in\nthe Supplementary Material. Finally, the techniques used to derive the symbolic bounds in Section 4\nmay prove useful in a more general context and complement other methods to \ufb01nd subsets of useful\nconstraints such as the information theoretical approach of [8] and the graphical approach of [7].\nAcknowledgements. We thank McDonald, Hiu and Tierney for their \ufb02u vaccine data, and the\nanonymous reviewers for their valuable feedback.\n\n8\n\n\fReferences\n[1] A. Balke and J. Pearl. Bounds on treatment effects from studies with imperfect compliance.\n\nJournal of the American Statistical Association, pages 1171\u20131176, 1997.\n\n[2] W. Buntine. Theory re\ufb01nement on Bayesian networks. Proceedings of the 7th Conference on\n\nUncertainty in Arti\ufb01cial Intelligence (UAI1991), pages 52\u201360, 1991.\n\n[3] W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. Proceedings of 20th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI2004), pages 59\u201366, 2004.\n\n[4] L. Chen, F. Emmert-Streib, and J. D. Storey. Harnessing naturally randomized transcription to\n\ninfer regulatory relationships among genes. Genome Biology, 8:R219, 2007.\n\n[5] A.P. Dawid. Causal inference using in\ufb02uence diagrams: the problem of partial compliance. In\nP.J. Green, N.L. Hjort, and S. Richardson, editors, Highly Structured Stochastic Systems, pages\n45\u201365. Oxford University Press, 2003.\n\n[6] D. Entner, P. Hoyer, and P. Spirtes. Data-driven covariate selection for nonparametric estima-\n\ntion of causal effects. JMLR W&CP: AISTATS 2013, 31:256\u2013264, 2013.\n\n[7] R. Evans. Graphical methods for inequality constraints in marginalized DAGs. Proceedings of\n\nthe 22nd Workshop on Machine Learning and Signal Processing, 2012.\n\n[8] P. Geiger, D. Janzing, and B. Sch\u00a8olkopf. Estimating causal effects by bounding confounding.\nProceedings of the 30th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 240\u2013249,\n2014.\n\n[9] K. Hirano, G. Imbens, D. Rubin, and X.-H. Zhou. Assessing the effect of an inuenza vaccine\n\nin an encouragement design. Biometrics, 1:69\u201388, 2000.\n\n[10] S. Mani, G. Cooper, and P. Spirtes. A theoretical study of Y structures for causal discovery.\nProceedings of the 22nd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI2006), pages\n314\u2013323, 2006.\n\n[11] C. Manski. Identi\ufb01cation for Prediction and Decision. Harvard University Press, 2007.\n[12] C. McDonald, S. Hiu, and W. Tierney. Effects of computer reminders for in\ufb02uenza vaccination\n\non morbidity during in\ufb02uenza epidemics. MD Computing, 9:304\u2013312, 1992.\n\n[13] C. Meek. Strong completeness and faithfulness in Bayesian networks. Proceedings of the\nEleventh Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI1995), pages 411\u2013418,\n1995.\n\n[14] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.\n[15] J. Pearl. Myth, confusion, and science in causal analysis. UCLA Cognitive Systems Laboratory,\n\nTechnical Report (R-348), 2009.\n\n[16] R. Ramsahai. Causal bounds and observable constraints for non-deterministic models. Journal\n\nof Machine Learning Research, pages 829\u2013848, 2012.\n\n[17] R. Ramsahai and S. Lauritzen. Likelihood analysis of the binary instrumental variable model.\n\nBiometrika, 98:987\u2013994, 2011.\n\n[18] T. Richardson, R. Evans, and J. Robins. Transparent parameterizatios of models for potential\nIn J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and\n\noutcomes.\nM. West, editors, Bayesian Statistics 9, pages 569\u2013610. Oxford University Press, 2011.\n\n[19] T. Richardson and J. Robins. Analysis of the binary instrumental variable model. In R. Dechter,\nH. Geffner, and J.Y. Halpern, editors, Heuristics, Probability and Causality: A Tribute to Judea\nPearl, pages 415\u2013444. College Publications, 2010.\n\n[20] J. Robins, R. Scheines, P. Spirtes, and L. Wasserman. Uniform consistency in causal inference.\n\nBiometrika, 90:491\u2013515, 2003.\n\n[21] P. Rosenbaum. Observational Studies. Springer-Verlag, 2002.\n[22] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge Uni-\n\nversity Press, 2000.\n\n[23] T. VanderWeele and I. Shpitser. A new criterion for confounder selection. Biometrics,\n\n64:1406\u20131413, 2011.\n\n9\n\n\f", "award": [], "sourceid": 229, "authors": [{"given_name": "Ricardo", "family_name": "Silva", "institution": "UCL"}, {"given_name": "Robin", "family_name": "Evans", "institution": "University of Oxford"}]}