{"title": "BACKSHIFT: Learning causal cyclic graphs from unknown shift interventions", "book": "Advances in Neural Information Processing Systems", "page_first": 1513, "page_last": 1521, "abstract": "We propose a simple method to learn linear causal cyclic models in the presence of latent variables. The method relies on equilibrium data of the model recorded under a specific kind of interventions (``shift interventions''). The location and strength of these interventions do not have to be known and can be estimated from the data. Our method, called BACKSHIFT, only uses second moments of the data and performs simple joint matrix diagonalization, applied to differences between covariance matrices. We give a sufficient and necessary condition for identifiability of the system, which is fulfilled almost surely under some quite general assumptions if and only if there are at least three distinct experimental settings, one of which can be pure observational data. We demonstrate the performance on some simulated data and applications in flow cytometry and financial time series.", "full_text": "BACKSHIFT: Learning causal cyclic graphs from\n\nunknown shift interventions\n\nDominik Rothenh\u00a8ausler\u21e4\n\nSeminar f\u00a8ur Statistik\n\nETH Z\u00a8urich, Switzerland\n\nrothenhaeusler@stat.math.ethz.ch\n\nChristina Heinze\u21e4\nSeminar f\u00a8ur Statistik\n\nETH Z\u00a8urich, Switzerland\n\nheinze@stat.math.ethz.ch\n\nJonas Peters\n\njonas.peters@tuebingen.mpg.de\n\nMax Planck Institute for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\nNicolai Meinshausen\nSeminar f\u00a8ur Statistik\n\nETH Z\u00a8urich, Switzerland\n\nmeinshausen@stat.math.ethz.ch\n\nAbstract\n\nWe propose a simple method to learn linear causal cyclic models in the presence\nof latent variables. The method relies on equilibrium data of the model recorded\nunder a speci\ufb01c kind of interventions (\u201cshift interventions\u201d). The location and\nstrength of these interventions do not have to be known and can be estimated from\nthe data. Our method, called BACKSHIFT, only uses second moments of the data\nand performs simple joint matrix diagonalization, applied to differences between\ncovariance matrices. We give a suf\ufb01cient and necessary condition for identi\ufb01-\nability of the system, which is ful\ufb01lled almost surely under some quite general\nassumptions if and only if there are at least three distinct experimental settings,\none of which can be pure observational data. We demonstrate the performance on\nsome simulated data and applications in \ufb02ow cytometry and \ufb01nancial time series.\n\nIntroduction\n\n1\nDiscovering causal effects is a fundamentally important yet very challenging task in various disci-\nplines, from public health research and sociological studies, economics to many applications in the\nlife sciences. There has been much progress on learning acyclic graphs in the context of structural\nequation models [1], including methods that learn from observational data alone under a faithful-\nness assumption [2, 3, 4, 5], exploiting non-Gaussianity of the data [6, 7] or non-linearities [8].\nFeedbacks are prevalent in most applications, and we are interested in the setting of [9], where we\nobserve the equilibrium data of a model that is characterized by a set of linear relations\n\nx = Bx + e,\n\n(1)\nwhere x 2 Rp is a random vector and B 2 Rp\u21e5p is the connectivity matrix with zeros on the diag-\nonal (no self-loops). Allowing for self-loops would lead to an identi\ufb01ability problem, independent\nof the method. See Section B in the Appendix for more details on this setting. The graph corre-\nsponding to B has p nodes and an edge from node j to node i if and only if Bi,j 6= 0. The error\nterms e are p-dimensional random variables with mean 0 and positive semi-de\ufb01nite covariance ma-\ntrix \u2303e = E(eeT ). We do not assume that \u2303e is a diagonal matrix which allows the existence of\nlatent variables.\nThe solutions to (1) can be thought of as the deterministic equilibrium solutions (conditional on the\nnoise term) of a dynamic model governed by \ufb01rst-order difference equations with matrix B in the\n\n\u21e4Authors contributed equally.\n\n1\n\n\fsense of [10]. For well-de\ufb01ned equilibrium solutions of (1), we need that IB is invertible. Usually\nwe also want (1) to converge to an equilibrium when iterating as x(new) Bx(old) + e or in other\nwords limm!1 Bm \u2318 0. This condition is equivalent to the spectral radius of B being strictly\nsmaller than one [11]. We will make an assumption on cyclic graphs that restricts the strength of the\nfeedback. Speci\ufb01cally, let a cycle of length \u2318 be given by (m1, . . . , m\u2318+1 = m1) 2{ 1, . . . , p}1+\u2318\nand mk 6= m` for 1 \uf8ff k <` \uf8ff \u2318. We de\ufb01ne the cycle-product CP(B) of a matrix B to be the\nmaximum over cycles of all lengths 1 <\u2318 \uf8ff p of the path-products\n\n(2)\n\nCP(B) :=\n\nmax\n\n(m1,...,m\u2318,m\u2318+1) cycle\n\n1<\u2318\uf8ffp\n\nY1\uf8ffk\uf8ff\u2318Bmk+1,mk .\n\nThe cycle-product CP(B) is clearly zero for acyclic graphs. We will assume the cycle-product to be\nstrictly smaller than one for identi\ufb01ability results, see Assumption (A) below. The most interesting\ngraphs are those for which CP(B) < 1 and for which the spectral radius of B is strictly smaller than\none. Note that these two conditions are identical as long as the cycles in the graph do not intersect,\ni.e., there is no node that is part of two cycles (for example if there is at most one cycle in the graph).\nIf cycles do intersect, we can have models for which either (i) CP(B) < 1 but the spectral radius\nis larger than one or (ii) CP(B) > 1 but the spectral radius is strictly smaller than one. Models in\nsituation (ii) are not stable in the sense that the iterations will not converge under interventions. We\ncan for example block all but one cycle. If this one single unblocked cycle has a cycle-product larger\nthan 1 (and there is such a cycle in the graph if CP(B) > 1), then the solutions of the iteration are\nnot stable1. Models in situation (i) are not stable either, even in the absence of interventions. We\ncan still in theory obtain the now instable equilibrium solutions to (1) as (I B)1e and the theory\nbelow applies to these instable equilibrium solutions. However, such instable equilibrium solutions\nare arguably of little practical interest. In summary: all interesting feedback models that are stable\nunder interventions satisfy both CP(B) < 1 and have a spectral radius strictly smaller than one. We\nwill just assume CP(B) < 1 for the following results.\nIt is impossible to learn the structure B of this model from observational data alone without making\nfurther assumptions. The LINGAM approach has been extended in [11] to cyclic models, exploiting\na possible non-Gaussianity of the data. Using both experimental and interventional data, [12, 9]\ncould show identi\ufb01ability of the connectivity matrix B under a learning mechanism that relies on\ndata under so-called \u201csurgical\u201d or \u201cperfect\u201d interventions. In their framework, a variable becomes\nindependent of all its parents if it is being intervened on and all incoming contributions are thus ef-\nfectively removed under the intervention (also called do-interventions in the classical sense of [13]).\nThe learning mechanism makes then use of the knowledge where these \u201csurgical\u201d interventions oc-\ncurred. [14] also allow for \u201cchanging\u201d the incoming arrows for variables that are intervened on;\nbut again, [14] requires the location of the interventions while we do not assume such knowledge.\n[15] consider a target variable and allow for arbitrary interventions on all other nodes. They neither\npermit hidden variables nor cycles.\nHere, we are interested in a setting where we have either no or just very limited knowledge about\nthe exact location and strength of the interventions, as is often the case for data observed under\ndifferent environments (see the example on \ufb01nancial time series further below) or for biological data\n[16, 17]. These interventions have been called \u201cfat-hand\u201d or \u201cuncertain\u201d interventions [18]. While\n[18] assume acyclicity and model the structure explicitly in a Bayesian setting, we assume that the\ndata in environment j are equilibrium observations of the model\n\nxj = Bxj + cj + ej,\n\n(3)\nwhere the random intervention shift cj has a mean and covariance \u2303c,j. The location of these\ninterventions (or simply the intervened variables) are those components of cj that are not zero\nwith probability one. Given these locations, the interventions simply shift the variables by a value\ndetermined by cj; they are therefore not \u201csurgical\u201d but can be seen as a special case of what is\ncalled an \u201cimperfect\u201d, \u201cparametric\u201d [19] or \u201cdependent\u201d intervention [20] or \u201cmechanism change\u201d\n[21]. The matrix B and the error distribution of ej are assumed to be identical in all environments.\nIn contrast to the covariance matrix for the noise term ej, we do assume that \u2303c,j is a diagonal\n1The blocking of all but one cycle can be achieved by do-interventions on appropriate variables under the\nfollowing condition: for every pair of cycles in the graph, the variables in one cycle cannot be a subset of the\nvariables in the other cycle. Otherwise the blocking could be achieved by deletion of appropriate edges.\n\n2\n\n\fmatrix, which is equivalent to demanding that interventions at different variables are uncorrelated.\nThis is a key assumption necessary to identify the model using experimental data. Furthermore, we\nwill discuss in Section 4.2 how a violation of the model assumption (3) can be detected and used to\nestimate the location of the interventions.\nIn Section 2 we show how to leverage observations under different environments with different inter-\nventional distributions to learn the structure of the connectivity matrix B in model (3). The method\nrests on a simple joint matrix diagonalization. We will prove necessary and suf\ufb01cient conditions for\nidenti\ufb01ability in Section 3. Numerical results for simulated data and applications in \ufb02ow cytometry\nand \ufb01nancial data are shown in Section 4.\n\n2 Method\n2.1 Grouping of data\nLet J be the set of experimental conditions under which we observe equilibrium data from\nmodel (3). These different experimental conditions can arise in two ways: (a) a controlled ex-\nperiment was conducted where the external input or the external imperfect interventions have been\ndeliberately changed from one member of J to the next. An example are the \ufb02ow cytometry data\n[22] discussed in Section 4.2. (b) The data are recorded over time. It is assumed that the exter-\nnal input is changing over time but not in an explicitly controlled way. The data are grouped into\nconsecutive blocks j 2J of observations, see Section 4.3 for an example.\n2.2 Notation\nAssume we have nj observations in each setting j 2J . Let Xj be the (nj \u21e5 p)-matrix of obser-\nvations from model (3). For general random variables aj 2 Rp , the population covariance matrix\nin setting j 2J is called \u2303a,j = Cov(aj), where the covariance is under the setting j 2J .\nFurthermore, the covariance on all settings except setting j 2J is de\ufb01ned as an average over all\nenvironments except for the j-th environment, (|J |1)\u2303c,j :=Pj02J \\{j} \u2303c,j0. The population\nT ). Let the (p \u21e5 p)-dimensional \u02c6\u2303a,j be the empirical\nGram matrix is de\ufb01ned as Ga,j = E(ajaj\ncovariance matrix of the observations Aj 2 Rnj\u21e5p of variable aj in setting j 2J . More precisely,\nlet \u02dcAj be the column-wise mean-centered version of Aj. Then \u02c6\u2303a,j := (nj 1)1 \u02dcAT\n\u02dcAj. The\nempirical Gram matrix is denoted by \u02c6Ga,j := n1\n\nj\n\nj AT\n\nj Aj.\n\n2.3 Assumptions\nThe main assumptions have been stated already but we give a summary below.\n(A) The data are observations of the equilibrium observations of model (3). The matrix I B is\ninvertible and the solutions to (3) are thus well de\ufb01ned. The cycle-product (2) CP(B) is strictly\nsmaller than one. The diagonal entries of B are zero.\n\n(B) The distribution of the noise ej (which includes the in\ufb02uence of latent variables) and the con-\nnectivity matrix B are identical across all settings j 2J . In each setting j 2J , the interven-\ntion shift cj and the noise ej are uncorrelated.\n(C) Interventions at different variables in the same setting are uncorrelated, that is \u2303c,j is an (un-\n\nknown) diagonal matrix for all j 2J .\n\nWe will discuss a stricter version of (C) in Section D in the Appendix that allows the use of Gram\nmatrices instead of covariance matrices. The conditions above imply that the environments are\ncharacterized by different interventions strength, as measured by the variance of the shift c in each\nsetting. We aim to reconstruct both the connectivity matrix B from observations in different en-\nvironments and also aim to reconstruct the a-priori unknown intervention strength and location in\neach environment. Additionally, we will show examples where we can detect violations of the model\nassumptions and use these to reconstruct the location of interventions.\n\n2.4 Population method\nThe main idea is very simple. Looking at the model (3), we can rewrite\n\n(I B)xj = cj + ej.\n\n3\n\n(4)\n\n\fThe population covariance of the transformed observations are then for all settings j 2J given by\n(5)\nThe last term \u2303e is constant across all settings j 2J (but not necessarily diagonal as we allow\nhidden variables). Any change of the matrix on the left-hand side thus stems from a shift in the\ncovariance matrix \u2303c,j of the interventions. Let us de\ufb01ne the difference between the covariance\nof c and x in setting j as\n\n(I B)\u2303x,j(I B)T = \u2303c,j + \u2303e.\n\n\u2303c,j := \u2303c,j \u2303c,j,\nAssumption (B) together with (5) implies that\n\nand \u2303x,j := \u2303x,j \u2303x,j.\n\n(6)\n\n(I B)\u2303x,j(I B)T = \u2303c,j\n\n8j 2J .\n\n(7)\nUsing assumption (C), the random intervention shifts at different variables are uncorrelated and the\nright-hand side in (7) is thus a diagonal matrix for all j 2J . Let D\u21e2 Rp\u21e5p be the set of all\ninvertible matrices. We also de\ufb01ne a more restricted space Dcp which only includes those members\nof D that have entries all equal to one on the diagonal and have a cycle-product less than one,\n(8)\n\nD :=nD 2 Rp\u21e5p : D invertibleo\nDcp :=nD 2 Rp\u21e5p : D 2D and diag(D) \u2318 1 and CP(I D) < 1o.\nUnder Assumption (A), I B 2D cp. Motivated by (7), we now consider the minimizer\nD = argminD02DcpXj2J\nL(D0\u2303x,jD0T ), where L(A) :=Xk6=l\n\nis the loss L for any matrix A and de\ufb01ned as the sum of the squared off-diagonal elements. In\nSection 3, we present necessary and suf\ufb01cient conditions on the interventions under which D =\nI B is the unique minimizer of (10). In this case, exact joint diagonalization is possible so that\nL(D\u2303x,jDT ) = 0 for all environments j 2J . We discuss an alternative that replaces covariance\nwith Gram matrices throughout in Section D in the Appendix. We now give a \ufb01nite-sample version.\n\n(9)\n\n(10)\n\nA2\nk,l\n\n2.5 Finite-sample estimate of the connectivity matrix\nIn practice, we estimate B by minimizing\nthe empirical counterpart of (10) in two\nsteps. First, the solution of the optimiza-\ntion is only constrained to matrices in D.\nSubsequently, we enforce the constraint on\nthe solution to be a member of Dcp. The\nBACKSHIFT algorithm is presented in Al-\ngorithm 1 and we describe the important\nsteps in more detail below.\n\nAlgorithm 1 BACKSHIFT\nInput: Xj 8j 2J\n1: Compute d\u2303x,j 8j 2J\n2: \u02dcD = FFDIAG(d\u2303x,j)\n3: \u02c6D = PermuteAndScale( \u02dcD)\n4: \u02c6B = I \u02c6D\nOutput: \u02c6B\n\nSteps 1 & 2. First, we minimize the following empirical, less constrained variant of (10)\n\nwhere the population differences between covariance matrices are replaced with their empirical\ncounterparts and the only constraint on the solution is that it is invertible, i.e. \u02dcD 2D . For the\noptimization we use the joint approximate matrix diagonalization algorithm FFDIAG [23].\n\nStep 3. The constraint on the cycle product and the diagonal elements of D is enforced by (a)\npermuting and (b) scaling the rows of \u02dcD. Part (b) simply scales the rows so that the diagonal\nelements of the resulting matrix \u02c6D are all equal to one. The more challenging \ufb01rst step (a) consists\nof \ufb01nding a permutation such that under this permutation the scaled matrix from part (b) will have\na cycle product as small as possible (as follows from Theorem 3, at most one permutation can lead\nto a cycle product less than one). This optimization problem seems computationally challenging at\n\ufb01rst, but we show that it can be solved by a variant of the linear assignment problem (LAP) (see e.g.\n[24]), as proven in Theorem 3 in the Appendix. As a last step, we check whether the cycle product\nof \u02c6D is less than one, in which case we have found the solution. Otherwise, no solution satisfying\nthe model assumptions exists and we return a warning that the model assumptions are not met. See\nAppendix B for more details.\n\n4\n\n\u02dcD := argminD02DXj2J\n\nL(D0(d\u2303x,j)D0T ),\n\n(11)\n\n\fComputational cost. The computational complexity of BACKSHIFT is O(|J |\u00b7n\u00b7p2) as computing\nthe covariance matrices costs O(|J |\u00b7n\u00b7p2), FFDIAG has a computational cost of O(|J |\u00b7p2) and both\nthe linear assignment problem and computing the cycle product can be solved in O(p3) time. For\ninstance, this complexity is achieved when using the Hungarian algorithm for the linear assignment\nproblem (see e.g. [24]) and the cycle product can be computed with a simple dynamic programming\napproach.\n\n2.6 Estimating the intervention variances\nOne additional bene\ufb01t of BACKSHIFT is that the location and strength of the interventions can be\nestimated from the data. The empirical, plug-in version of Eq. (7) is given by\n\n(I \u02c6B)d\u2303x,j(I \u02c6B)T = d\u2303c,j = b\u2303c,j b\u2303c,j\n\nSo the element (d\u2303c,j)kk is an estimate for the difference between the variance of the interven-\n\ntion at variable k in environment j, namely (\u2303c,j)kk, and the average in all other environments,\n(\u2303c,j)kk. From these differences we can compute the intervention variance for all environments\nup to an offset. By convention, we set the minimal intervention variance across all environments\nequal to zero. Alternatively, one can let observational data, if available, serve as a baseline against\nwhich the intervention variances are measured.\n\n8j 2J .\n\n(12)\n\nIdenti\ufb01ability\n\n3\nLet for simplicity of notation,\n\n\u2318j,k := (\u2303c,j)kk\n\nbe the variance of the random intervention shifts cj at node k in environment j 2J as per the\nde\ufb01nition of \u2303c,j in (6). We then have the following identi\ufb01ability result (the proof is provided\nin Appendix A).\nTheorem 1. Under assumptions (A), (B) and (C), the solution to (10) is unique if and only if for all\nk, l 2{ 1, . . . , p} there exist j, j0 2J such that\n\n\u2318j,k\u2318j0,l 6= \u2318j,l\u2318j0,k .\n\n(13)\n\nIf none of the intervention variances \u2318j,k vanishes, the uniqueness condition is equivalent to de-\nmanding that the ratio between the intervention variances for two variables k, l must not stay iden-\ntical across all environments, that is there exist j, j0 2J such that\n\n\u2318j,k\n\u2318j,l\n\n\u2318j0,k\n\u2318j0,l\n\n6=\n\n,\n\n(14)\n\nwhich requires that the ratio of the variance of the intervention shifts at two nodes k, l is not identical\nacross all settings. This leads to the following corollary.\nCorollary 2.\n\n(i) The identi\ufb01ability condition (13) cannot be satis\ufb01ed if |J | = 2 since then \u2318j,k =\n\n\u2318j0,k for all k and j 6= j0. We need at least three different environments for identi\ufb01ability.\n\n(ii) The identi\ufb01ability condition (13) is satis\ufb01ed for all |J | 3 almost surely if the variances of the\nintervention cj are chosen independently (over all variables and environments j 2J ) from a\ndistribution that is absolutely continuous with respect to Lebesgue measure.\n\nCondition (ii) can be relaxed but shows that we can already achieve full identi\ufb01ability with a very\ngeneric setting for three (or more) different environments.\n\n4 Numerical results\nIn this section, we present empirical results for both synthetic and real data sets. In addition to\nestimating the connectivity matrix B, we demonstrate various ways to estimate properties of the\ninterventions. Besides computing the point estimate for BACKSHIFT, we use stability selection [25]\nto assess the stability of retrieved edges. We attach R-code with which all simulations and analyses\ncan be reproduced2.\n\n2An R-package called \u201cbackShift\u201d is available from CRAN.\n\n5\n\n\f6\n\n5\n\u22120.73\n\n0.54\n7\n2.1\n\n0.46\n2\n\n9\n0.52\n\n10\n1\n\n4\n\n3\n\n0.76\n0.72\n0.67\n\u22120.65\n\n0.34\n\u22120.69\n\n8\n\n1\n\n(a)\n\nE3\n\nI3\n\n3\n\nX3\n\n3\n\nW\n\n1\n\n2\n\nX1\n\nX2\n\n(b)\n\nE1\n\n1\n\nI1\n\nE2\n\n2\n\nI2\n\n1.00\n\n0.75\n\n0.50\n\nI\n\nI\n\nN\nO\nS\nC\nE\nR\nP\n\n0.25\n\n0.00\n\n0.00\n\n0.25\n\n0.75\n\n0.50\n\nRECALL\n(c)\n\nMethod\n\u25cf BACKSHIFT\nLING\n\n0\n0.5\n1\n\nInterv. strength\n\u25cf\n\u25cf\n\u25cf\nSample size\n\n1000\n10000\n\n\u25cf\n\u25cf\nHidden vars.\n\u25cf FALSE\nTRUE\n\n\u25cf\u25cf\n\n\u25cf\n\n1.00\n\nFigure 1: Simulated data. (a) True network. (b) Scheme for data generation. (c) Performance metrics for the\nsettings considered in Section 4.1. For BACKSHIFT, precision and recall values for Settings 1 and 2 coincide.\n\nSetting 1\nn = 1000\nno hidden vars.\nmI = 1\n\nSetting 2\nn = 10000\nno hidden vars.\nmI = 1\n\nT\nF\nI\n\nH\nS\nK\nC\nA\nB\n\n6\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\n2\n\n9\n\n10\n\n6\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\n2\n\n9\n\n10\n\n6\n\nSetting 3\nn = 10000\nhidden vars.\nmI = 1\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\n2\n\n9\n\n10\n\n6\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\nSHD = 0, |t| = 0.25\n\nSHD = 0, |t| = 0.25\n\nSHD = 2, |t| = 0.25\n\nSHD = 12\n\nG\nN\n\nI\n\nL\n\n6\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\n2\n\n9\n\n10\n\n6\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\n2\n\n9\n\n10\n\n6\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\n2\n\n9\n\n10\n\n6\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\nSetting 4\nn = 10000\nno hidden vars.\nmI = 0\n\nSetting 5\nn = 10000\nno hidden vars.\nmI = 0.5\n3\n\n4\n\n2\n\n9\n\n2\n\n9\n\n10\n\n6\n\n5\n\n7\n\n2\n\n9\n\n10\n\n8\n\n1\n\nSHD = 5, |t| = 0.25\n\n5\n\n7\n\n4\n\n8\n\n3\n\n1\n\n2\n\n9\n\n10\n\n10\n\n6\n\nSHD = 17, |t| = 0.91 SHD = 14, |t| = 0.68 SHD = 16, |t| = 0.98 SHD = 8, |t| = 0.25 SHD = 7, |t| = 0.29\n\nFigure 2: Point estimates of BACKSHIFT and LING for synthetic data. We threshold the point estimate of\nBACKSHIFT at t = \u00b10.25 to exclude those entries which are close to zero. We then threshold the estimate of\nLING so that the two estimates have the same number of edges. In Setting 4, we threshold LING at t = \u00b10.25\nas BACKSHIFT returns the empty graph. In Setting 3, it is not possible to achieve the same number of edges as\nall remaining coef\ufb01cients in the point estimate of LING are equal to one in absolute value. The transparency\nof the edges illustrates the relative magnitude of the estimated coef\ufb01cients. We report the structural Hamming\ndistance (SHD) for each graph. Precision and recall values are shown in Figure 1(c).\n4.1 Synthetic data\n\nkI j\n\n1, . . . , I j\n\nk, where j\n\nWe compare the point estimate of BACKSHIFT against LING [11], a generalization of LINGAM to\nthe cyclic case for purely observational data. We consider the cyclic graph shown in Figure 1(a) and\ngenerate data under different scenarios. The data generating mechanism is sketched in Figure 1(b).\nSpeci\ufb01cally, we generate ten distinct environments with non-Gaussian noise. In each environment,\nthe random intervention variable is generated as (cj)k = j\np are drawn i.i.d.\nfrom Exp(mI) and I j\np are independent standard normal random variables. The intervention\nshift thus acts on all observed random variables. The parameter mI regulates the strength of the\nintervention. If hidden variables exist, the noise term (ej)k of variable k in environment j is equal to\nkW j, where the weights 1, . . . , p are sampled once from a N (0, 1)-distribution and the random\nvariable W j has a Laplace(0, 1) distribution. If no hidden variables are present, then (ej)k, k =\n1, . . . , p is sampled i.i.d. Laplace(0, 1). In this set of experiments, we consider \ufb01ve different settings\n(described below) in which the sample size n, the intervention strength mI as well as the existence\nof hidden variables varies.\nWe allow for hidden variables in only one out of \ufb01ve settings as LING assumes causal suf\ufb01ciency\nand can thus in theory not cope with hidden variables. If no hidden variables are present, the pooled\ndata can be interpreted as coming from a model whose error variables follow a mixture distribution.\nBut if one of the error variables comes from the second mixture component, for example, the other\n\n1, . . . , j\n\n6\n\n\fErk\n\nAkt\n\nPIP3\n\nPKA\n\nPIP2\n\nPLCg\n\np38\n\nPKC\n\n(a)\n\nMek\n\nRaf\n\nJNK\n\nErk\n\nAkt\n\nPIP3\n\nPKA\n\nPIP2\n\nPLCg\n\np38\n\nPKC\n\n(b)\n\nMek\n\nRaf\n\nJNK\n\nErk\n\nAkt\n\nPIP3\n\nPKA\n\nMek\n\nRaf\n\nJNK\n\nErk\n\nAkt\n\nPIP3\n\nPKA\n\nPIP2\n\nPLCg\n\np38\n\nPKC\n\n(c)\n\nMek\n\nRaf\n\nJNK\n\nPIP2\n\nPLCg\n\np38\n\nPKC\n\n(d)\n\nFigure 3: Flow cytometry data. (a) Union of the consensus network (according to [22]), the reconstruction by\n[22] and the best acyclic reconstruction by [26]. The edge thickness and intensity re\ufb02ect in how many of these\nthree sources that particular edge is present. (b) One of the cyclic reconstructions by [26]. The edge thickness\nand intensity re\ufb02ect the probability of selecting that particular edge in the stability selection procedure. For\nmore details see [26]. (c) BACKSHIFT point estimate, thresholded at \u00b10.35. The edge intensity re\ufb02ects the\nrelative magnitude of the coef\ufb01cients and the coloring is a comparison to the union of the graphs shown in\npanels (a) and (b). Blue edges were also found in [26] and [22], purple edges are reversed and green edges were\nnot previously found in (a) or (b). (d) BACKSHIFT stability selection result with parameters E(V ) = 2 and\n\u21e1thr = 0.75. The edge thickness illustrates how often an edge was selected in the stability selection procedure.\n\nerror variables come from the second mixture component, too. In this sense, the data points are not\nindependent anymore. This poses a challenge for LING which assumes an i.i.d. sample. We also\ncover a case (for mI = 0) in which all assumptions of LING are satis\ufb01ed (Scenario 4).\nFigure 2 shows the estimated connectivity matrices for \ufb01ve different settings and Figure 1(c) shows\nthe obtained precision and recall values. In Setting 1, n = 1000, mI = 1 and there are no hidden\nvariables. In Setting 2, n is increased to 10000 while the other parameters do not change. We observe\nthat BACKSHIFT retrieves the correct adjacency matrix in both cases while LING\u2019s estimate is not\nvery accurate. It improves slightly when increasing the sample size. In Setting 3, we do include\nhidden variables which violates the causal suf\ufb01ciency assumption required for LING. Indeed, the\nestimate is worse than in Setting 2 but somewhat better than in Setting 1. BACKSHIFT retrieves\ntwo false positives in this case. Setting 4 is not feasible for BACKSHIFT as the distribution of the\nvariables is identical in all environments (since mI = 0). In Step 2 of the algorithm, FFDIAG does\nnot converge and therefore the empty graph is returned. So the recall value is zero while precision\nis not de\ufb01ned. For LING all assumptions are satis\ufb01ed and the estimate is more accurate than in\nthe Settings 1\u20133. Lastly, Setting 5 shows that when increasing the intervention strength to 0.5,\nBACKSHIFT returns a few false positives. Its performance is then similar to LING which returns its\nmost accurate estimate in this scenario. The stability selection results for BACKSHIFT are provided\nin Figure 5 in Appendix E.\nIn short, these results suggest that the BACKSHIFT point estimates are close to the true graph if the\ninterventions are suf\ufb01ciently strong. Hidden variables make the estimation problem more dif\ufb01cult\nbut the true graph is recovered if the strength of the intervention is increased (when increasing mI\nto 1.5 in Setting 3, BACKSHIFT obtains a SHD of zero). In contrast, LING is unable to cope with\nhidden variables but also has worse accuracy in the absence of hidden variables under these shift\ninterventions.\n\n4.2 Flow cytometry data\nThe data published in [22] is an instance of a data set where the external interventions differ be-\ntween the environments in J and might act on several compounds simultaneously [18]. There are\nnine different experimental conditions with each containing roughly 800 observations which corre-\nspond to measurements of the concentration of biochemical agents in single cells. The \ufb01rst setting\ncorresponds to purely observational data.\nIn addition to the original work by [22], the data set has been described and analyzed in [18] and\n[26]. We compare against the results of [26], [22] and the \u201cwell-established consensus\u201d, according\nto [22], shown in Figures 3(a) and 3(b). Figure 3(c) shows the (thresholded) BACKSHIFT point\nestimate. Most of the retrieved edges were also found in at least one of the previous studies. Five\nedges are reversed in our estimate and three edges were not discovered previously. Figure 3(d) shows\nthe corresponding stability selection result with the expected number of falsely selected variables\n\n7\n\n\f)\n\nI\n\nE\nC\nR\nP\nG\nO\nL\n\n(\n\n0\n\n.\n\n9\n\n5\n\n.\n\n8\n\n0\n\n.\n\n8\n\n5\n\n.\n\n7\n\n0\n\n.\n\n7\n\n5\n\n.\n\n6\n\nDAX\n\nNASDAQ\n\nS&P 500\n\nS\nN\nR\nU\nT\nE\nR\n\u2212\nG\nO\nL\n\nDAX\n\nS&P 500\n\nNASDAQ\n\n \n\nI\n\nE\nC\nN\nA\nR\nA\nV\nN\nO\nT\nN\nE\nV\nR\nE\nT\nN\n\nI\n\nI\n \n.\n\nT\nS\nE\n\nDAX\n\nS&P 500\n\nNASDAQ\n\nI\n\n \n\nE\nC\nN\nA\nR\nA\nV\nN\nO\nT\nN\nE\nV\nR\nE\nT\nN\n\nI\n\nI\n \n.\n\nT\nS\nE\n\nDAX\n\nS&P 500\n\nNASDAQ\n\n2001\n\n2003\n\n2005\n\n2007\n\n2009\n\n2011\n\n2001\n\n2003\n\n2005\n\n2007\n\n2009\n\n2011\n\n2001\n\n2003\n\n2005\n\n2007\n\n2009\n\n2011\n\n2001\n\n2003\n\n2005\n\n2007\n\n2009\n\n2011\n\nTIME\n\nTIME\n\nTIME\n\nTIME\n\n(c) BACKSHIFT\n\n(b) Daily log-returns\n\n(a) Prices (logarithmic)\nFigure 4: Financial time series with three stock indices: NASDAQ (blue; technology index), S&P 500 (green;\nAmerican equities) and DAX (red; German equities). (a) Prices of the three indices between May 2000 and end\nof 2011 on a logarithmic scale. (b) The scaled log-returns (daily change in log-price) of the three instruments\nare shown. Three periods of increased volatility are visible starting with the dot-com bust on the left to the\n\ufb01nancial crisis in 2008 and the August 2011 downturn. (c) The scaled estimated intervention variance with\nthe estimated BACKSHIFT network. The three down-turns are clearly separated as originating in technology,\nAmerican and European equities. (d) In contrast, the analogous LING estimated intervention variances have a\npeak in American equities intervention variance during the European debt crisis in 2011.\n\n(d) LING\n\nE(V ) = 2. This estimate is sparser in comparison to the other ones as it bounds the number of false\ndiscoveries. Notably, the feedback loops between PIP2 $ PLCg and PKC $ JNK were also found\nin [26].\nIt is also noteworthy that we can check the model assumptions of shift interventions, which is impor-\ntant for these data as they can be thought of as changing the mechanism or activity of a biochemical\nagent rather than regulate the biomarker directly [26]. If the shift interventions are not appropri-\nate, we are in general not able to diagonalize the differences in the covariance matrices. Large\noff-diagonal elements in the estimate of the r.h.s in (7) indicate a mechanism change that is not just\nexplained by a shift intervention as in (1). In four of the seven interventions environments with\nknown intervention targets the largest mechanism violation happens directly at the presumed inter-\nvention target, see Appendix C for details. It is worth noting again that the presumed intervention\ntarget had not been used in reconstructing the network and mechanism violations.\n\n4.3 Financial time series\n\nFinally, we present an application in \ufb01nancial time series where the environment is clearly changing\nover time. We consider daily data from three stock indices NASDAQ, S&P 500 and DAX for a\nperiod between 2000-2012 and group the data into 74 overlapping blocks of 61 consecutive days\neach. We take log-returns, as shown in panel (b) of Figure 4 and estimate the connectivity matrix,\nwhich is fully connected in this case and perhaps of not so much interest in itself. It allows us,\nhowever, to estimate the intervention strength at each of the indices according to (12), shown in\npanel (c). The intervention variances separate very well the origins of the three major down-turns of\nthe markets on the period. Technology is correctly estimated by BACKSHIFT to be at the epicenter\nof the dot-com crash in 2001 (NASDAQ as proxy), American equities during the \ufb01nancial crisis in\n2008 (proxy is S&P 500) and European instruments (DAX as best proxy) during the August 2011\ndownturn.\n\n5 Conclusion\nWe have shown that cyclic causal networks can be estimated if we obtain covariance matrices of\nthe variables under unknown shift interventions in different environments. BACKSHIFT leverages\nsolutions to the linear assignment problem and joint matrix diagonalization and the part of the com-\nputational cost that depends on the number of variables is at worst cubic. We have shown suf\ufb01cient\nand necessary conditions under which the network is fully identi\ufb01able, which require observations\nfrom at least three different environments. The strength and location of interventions can also be\nreconstructed.\n\nReferences\n[1] K.A. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, New York, USA, 1989.\n\n8\n\n\f[2] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge,\n\nUSA, 2nd edition, 2000.\n\n[3] D.M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of Machine Learning\n\nResearch, 3:507\u2013554, 2002.\n\n[4] M.H. Maathuis, M. Kalisch, and P. B\u00a8uhlmann. Estimating high-dimensional intervention effects from\n\nobservational data. Annals of Statistics, 37:3133\u20133164, 2009.\n\n[5] A. Hauser and P. B\u00a8uhlmann. Characterization and greedy learning of interventional Markov equivalence\n\nclasses of directed acyclic graphs. Journal of Machine Learning Research, 13:2409\u20132464, 2012.\n\n[6] P.O. Hoyer, D. Janzing, J.M. Mooij, J. Peters, and B. Sch\u00a8olkopf. Nonlinear causal discovery with additive\nnoise models. In Advances in Neural Information Processing Systems 21 (NIPS), pages 689\u2013696, 2009.\n[7] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyv\u00a8arinen, Y. Kawahara, T. Washio, P.O. Hoyer, and K. Bollen.\nDirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal\nof Machine Learning Research, 12:1225\u20131248, 2011.\n\n[8] J.M. Mooij, D. Janzing, T. Heskes, and B. Sch\u00a8olkopf. On causal discovery with cyclic additive noise\n\nmodels. In Advances in Neural Information Processing Systems 24 (NIPS), pages 639\u2013647, 2011.\n\n[9] A. Hyttinen, F. Eberhardt, and P. O. Hoyer. Learning linear cyclic causal models with latent variables.\n\nJournal of Machine Learning Research, 13:3387\u20133439, 2012.\n\n[10] S.L. Lauritzen and T.S. Richardson. Chain graph models and their causal interpretations. Journal of the\n\nRoyal Statistical Society, Series B, 64:321\u2013348, 2002.\n\n[11] G. Lacerda, P. Spirtes, J. Ramsey, and P.O. Hoyer. Discovering cyclic causal models by independent\nIn Proceedings of the 24th Conference on Uncertainty in Arti\ufb01cial Intelligence\n\ncomponents analysis.\n(UAI), pages 366\u2013374, 2008.\n\n[12] R. Scheines, F. Eberhardt, and P.O. Hoyer. Combining experiments to discover linear cyclic models with\nlatent variables. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages\n185\u2013192, 2010.\n\n[13] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, USA,\n\n2nd edition, 2009.\n\n[14] F. Eberhardt, P. O. Hoyer, and R. Scheines. Combining experiments to discover linear cyclic models with\nlatent variables. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages\n185\u2013192, 2010.\n\n[15] J. Peters, P. B\u00a8uhlmann, and N. Meinshausen. Causal inference using invariant prediction: identi\ufb01cation\n\nand con\ufb01dence intervals. Journal of the Royal Statistical Society, Series B, to appear., 2015.\n\n[16] A.L. Jackson, S.R. Bartz, J. Schelter, S.V. Kobayashi, J. Burchard, M. Mao, B. Li, G. Cavet, and P.S.\nLinsley. Expression pro\ufb01ling reveals off-target gene regulation by RNAi. Nature Biotechnology, 21:635\u2013\n637, 2003.\n\n[17] M.M. Kulkarni, M. Booker, S.J. Silver, A. Friedman, P. Hong, N. Perrimon, and B. Mathey-Prevot. Ev-\nidence of off-target effects associated with long dsrnas in drosophila melanogaster cell-based assays.\nNature methods, 3:833\u2013838, 2006.\n\n[18] D. Eaton and K. Murphy. Exact Bayesian structure learning from uncertain interventions. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 107\u2013114, 2007.\n\n[19] F. Eberhardt and R. Scheines. Interventions and causal inference. Philosophy of Science, 74:981\u2013995,\n\n2007.\n\n[20] K. Korb, L. Hope, A. Nicholson, and K. Axnick. Varieties of causal intervention. In Proceedings of the\n\nPaci\ufb01c Rim Conference on AI, pages 322\u2013331, 2004.\n\n[21] J. Tian and J. Pearl. Causal discovery from changes.\n\nIn Proceedings of the 17th Conference Annual\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 512\u2013522, 2001.\n\n[22] K. Sachs, O. Perez, D. Pe\u2019er, D. Lauffenburger, and G. Nolan. Causal protein-signaling networks derived\n\nfrom multiparameter single-cell data. Science, 308:523\u2013529, 2005.\n\n[23] A. Ziehe, P. Laskov, G. Nolte, and K.-R. M\u00a8uller. A fast algorithm for joint diagonalization with non-\northogonal transformations and its application to blind source separation. Journal of Machine Learning\nResearch, 5:801\u2013818, 2004.\n\n[24] R.E. Burkard. Quadratic assignment problems. In P. M. Pardalos, D.-Z. Du, and R. L. Graham, editors,\n\nHandbook of Combinatorial Optimization, pages 2741\u20132814. Springer New York, 2nd edition, 2013.\n\n[25] N. Meinshausen and P. B\u00a8uhlmann. Stability selection. Journal of the Royal Statistical Society, Series B,\n\n72:417\u2013473, 2010.\n\n[26] J.M. Mooij and T. Heskes. Cyclic causal discovery from continuous equilibrium data. In Proceedings of\n\nthe 29th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 431\u2013439, 2013.\n\n9\n\n\f", "award": [], "sourceid": 941, "authors": [{"given_name": "Dominik", "family_name": "Rothenh\u00e4usler", "institution": "ETH Zurich"}, {"given_name": "Christina", "family_name": "Heinze", "institution": "ETH Zurich"}, {"given_name": "Jonas", "family_name": "Peters", "institution": "MPI T\u00fcbingen"}, {"given_name": "Nicolai", "family_name": "Meinshausen", "institution": "ETH Zurich"}]}