{"title": "Nonlinear causal discovery with additive noise models", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 696, "abstract": "The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuous-valued data linear acyclic causal models are often used because these models are well understood and there are well-known methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that in fact the basic linear framework can be generalized to nonlinear models with additive noise. In this extended framework, nonlinearities in the data-generating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true data-generating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities.", "full_text": "Nonlinear causal discovery with additive noise models\n\nPatrik O. Hoyer\n\nUniversity of Helsinki\n\nFinland\n\nDominik Janzing\n\nJoris Mooij\n\nMPI for Biological Cybernetics\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\nT\u00a8ubingen, Germany\n\nJonas Peters\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\nBernhard Sch\u00a8olkopf\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\nAbstract\n\nThe discovery of causal relationships between a set of observed variables is a fun-\ndamental problem in science. For continuous-valued data linear acyclic causal\nmodels with additive noise are often used because these models are well under-\nstood and there are well-known methods to \ufb01t them to data. In reality, of course,\nmany causal relationships are more or less nonlinear, raising some doubts as to\nthe applicability and usefulness of purely linear methods. In this contribution we\nshow that the basic linear framework can be generalized to nonlinear models. In\nthis extended framework, nonlinearities in the data-generating process are in fact a\nblessing rather than a curse, as they typically provide information on the underly-\ning causal system and allow more aspects of the true data-generating mechanisms\nto be identi\ufb01ed. In addition to theoretical results we show simulations and some\nsimple real data experiments illustrating the identi\ufb01cation power provided by non-\nlinearities.\n\n1 Introduction\nCausal relationships are fundamental to science because they enable predictions of the consequences\nof actions [1]. While controlled randomized experiments constitute the primary tool for identifying\ncausal relationships, such experiments are in many cases either unethical, too expensive, or techni-\ncally impossible. The development of causal discovery methods to infer causal relationships from\nuncontrolled data constitutes an important current research topic [1, 2, 3, 4, 5, 6, 7, 8]. If the ob-\nserved data is continuous-valued, methods based on linear causal models (aka structural equation\nmodels) are commonly applied [1, 2, 9]. This is not necessarily because the true causal relationships\nare really believed to be linear, but rather it re\ufb02ects the fact that linear models are well understood\nand easy to work with. A standard approach is to estimate a so-called Markov equivalence class of\ndirected acyclic graphs (all representing the same conditional independencies) from the data [1, 2, 3].\nFor continuous variables, the independence tests often assume linear models with additive Gaussian\nnoise [2]. Recently, however, it has been shown that for linear models, non-Gaussianity in the data\ncan actually aid in distinguishing the causal directions and allow one to uniquely identify the gen-\nerating graph under favourable conditions [7]. Thus the practical case of non-Gaussian data which\nlong was considered a nuisance turned out to be helpful in the causal discovery setting.\nIn this contribution we show that nonlinearities can play a role quite similar to that of non-\nGaussianity: When causal relationships are nonlinear it typically helps break the symmetry between\nthe observed variables and allows the identi\ufb01cation of causal directions. As Friedman and Nach-\nman have pointed out [10], non-invertible functional relationships between the observed variables\ncan provide clues to the generating causal model. However, we show that the phenomenon is much\nmore general; for nonlinear models with additive noise almost any nonlinearities (invertible or not)\nwill typically yield identi\ufb01able models. Note that other methods to select among Markov equivalent\nDAGs [11, 8] have (so far) mainly focussed on mixtures of discrete and continuous variables.\n\n\fIn the next section, we start by de\ufb01ning the family of models under study, and then, in Section 3\nwe give theoretical results on the identi\ufb01ability of these models from non-interventional data. We\ndescribe a practical method for inferring the generating model from a sample of data vectors in\nSection 4, and show its utility in simulations and on real data (Section 5).\n\n2 Model de\ufb01nition\nWe assume that the observed data has been generated in the following way: Each observed variable\nxi is associated with a node i in a directed acyclic graph G, and the value of xi is obtained as a\nfunction of its parents in G, plus independent additive noise ni, i.e.\n\nxi := fi(xpa(i)) +n i(cid:44)\n\nis pn(n) =(cid:31)\n\n(1)\nwhere fi is an arbitrary function (possibly different for each i), xpa(i) is a vector containing the\nelements xj such that there is an edge from j to i in the DAG G, the noise variables ni may\nhave arbitrary probability densities pni(ni), and the noise variables are jointly independent, that\ni pni(ni), where n denotes the vector containing the noise variables ni. Our data then\nconsists of a number of vectors x sampled independently, each using G, the same functions fi, and\nthe ni sampled independently from the same densities pni(ni).\nNote that this model includes the special case when all the fi are linear and all the pni are Gaussian,\nyielding the standard linear\u2013Gaussian model family [2, 3, 9]. When the functions are linear but the\ndensities pni are non-Gaussian we obtain the linear\u2013non-Gaussian models described in [7].\nThe goal of causal discovery is, given the data vectors, to infer as much as possible about the gen-\nerating mechanism; in particular, we seek to infer the generating graph G. In the next section we\ndiscuss the prospects of this task in the theoretical case where the joint distribution px(x) of the\nobserved data can be estimated exactly. Later, in Section 4, we experimentally tackle the practical\ncase of a \ufb01nite-size data sample.\n\nIdenti\ufb01ability\n\n3\nOur main theoretical results concern the simplest non-trivial graph: the case of two variables. The\nexperimental results will, however, demonstrate that the basic principle works even in the general\ncase of N variables.\nFigure 1 illustrates the basic identi\ufb01ability principle for the two-variable model. Denoting the two\nvariables x and y, we are considering the generative model y := f(x) +n where x and n are\n\na\n\n5\n\ny\n\n0\n\n-5\n\nd\n\n5\n\ny\n\n0\n\n-5\n\n-5\n\n0\nx\n\n5\n\nb\n\ne\n\n1\nm\nc\n\n1\nm\nc\n\n0\n2\n0\n0\n\n.\n\n5\n1\n0\n0\n\n.\n\n0\n1\n0\n0\n\n.\n\n5\n0\n0\n0\n\n.\n\n0\n0\n0\n.\n0\n\n0\n2\n0\n\n.\n\n0\n\n5\n1\n0\n\n.\n\n0\n\n0\n1\n0\n\n.\n\n0\n\n5\n0\n0\n\n.\n\n0\n\n0\n0\n0\n\n.\n\n0\n\np(y | x)\n\n]\ns\nd\nn\ne\nh\n\ni\n\nt\n[\n\n1\nm\nc\n\n!5\n-5\n\n0\n0\ny\nyvals\n\n5\n5\n\np(y | x)\n\n]\ns\nd\nn\ne\nh\n\ni\n\nt\n[\n\n1\nm\nc\n\np(x | y)\n\ng\n\ny\n\np(x,y)>0\n\n!3\n-3\n\n!2\n\n!1\n\n1\n\n2\n\n3\n3\n\n0\n0\nx\n\nxvals[theinds]\n\np(x | y)\n\nf(x)\n\np(x)\n\np(x)\n\nnoise\n\nx\n\nnoise\n\np(x,y)>0\n\n0\n3\n0\n0\n\nc\n\n.\n\n5\n2\n0\n0\n\n.\n\n0\n2\n0\n0\n\n.\n\n5\n1\n0\n0\n\n.\n\n0\n1\n0\n0\n\n.\n\n5\n0\n0\n.\n0\n\n0\n0\n0\n.\n0\n\nf\n\n4\n0\n\n.\n\n0\n\n3\n0\n\n.\n\n0\n\n2\n0\n\n.\n\n0\n\n1\n0\n\n.\n\n0\n\n0\n0\n\n.\n\n0\n\n-5\n\n0\nx\n\n5\n\n!5\n-5\n\n0\n0\ny\nyvals\n\n5\n5\n\n!3\n-3\n\n!2\n\n!1\n\n1\n\n2\n\n3\n3\n\n0\n0\nx\n\nxvals[theinds]\n\nFigure 1: Identi\ufb01cation of causal direction based on constancy of conditionals. See main text for\na detailed explanation of (a)\u2013(f). (g) shows an example of a joint density p(x(cid:44) y) generated by a\ncausal model x (cid:31) y with y := f(x) + n where f is nonlinear, the supports of the densities px(x)\nand pn(n) are compact regions, and the function f is constant on each connected component of the\nsupport of px. The support of the joint density is now given by the two gray squares. Note that the\ninput distribution px, the noise distribution pn and f can in fact be chosen such that the joint density\nis symmetrical with respect to the two variables, i.e. p(x(cid:44) y) =p(y(cid:44) x ), making it obvious that there\nwill also be a valid backward model.\n\n\fboth Gaussian and statistically independent. In panel (a) we plot the joint density p(x, y) of the\nobserved variables, for the linear case of f(x) = x. As a trivial consequence of the model, the\nconditional density p(y | x) has identical shape for all values of x and is simply shifted by the\nfunction f(x); this is illustrated in panel (b).\nIn general, there is no reason to believe that this\nrelationship would also hold for the conditionals p(x | y) for different values of y but, as is well\nknown, for the linear\u2013Gaussian model this is actually the case, as illustrated in panel (c). Panels (d-f)\nshow the corresponding joint and conditional densities for the corresponding model with a nonlinear\nfunction f(x) = x + x3. Notice how the conditionals p(x | y) look different for different values\nof y, indicating that a reverse causal model of the form x := g(y) + \u02dcn (with y and \u02dcn statistically\nindependent) would not be able to \ufb01t the joint density. As we will show in this section, this will in\nfact typically be the case, however, not always.\nTo see the latter, we \ufb01rst show that there exist models other than the linear\u2013Gaussian and the in-\ndependent case which admit both a forward x \u2192 y and a backward x \u2190 y model. Panel (g) of\nFigure 1 presents a nonlinear functional model with additive non-Gaussian noise and non-Gaussian\ninput distributions that nevertheless admits a backward model. The functions and probability den-\nsitities can be chosen to be (arbitrarily many times) differentiable. Note that the example of panel\n(g) in Figure 1 is somewhat arti\ufb01cial: p has compact support, and x, y are independent inside the\nconnected components of the support. Roughly speaking, the nonlinearity of f does not matter since\nit occurs where p is zero \u2014 an arti\ufb01cal situation which is avoided by the requirement that from now\non, we will assume that all probability densities are strictly positive. Moreover, we assume that all\nfunctions (including densities) are three times differentiable. In this case, the following theorem\nshows that for generic choices of f, px(x), and pn(n), there exists no backward model.\nTheorem 1 Let the joint probability density of x and y be given by\np(x, y) = pn(y \u2212 f(x))px(x) ,\n\nwhere pn, px are probability densities on R. If there is a backward model of the same form, i.e.,\n\n(3)\nthen, denoting \u03bd := log pn and \u03be := log px, the triple (f, px, pn) must satisfy the following differ-\nential equation for all x, y with \u03bd(cid:48)(cid:48)(y \u2212 f(x))f(cid:48)(x) (cid:54)= 0:\n\np(x, y) = p\u02dcn(x \u2212 g(y))py(y) ,\n(cid:19)\n\n(cid:18)\n\n(2)\n\n+ f(cid:48)(cid:48)\nf(cid:48)\n\n\u2212 \u03bd(cid:48)(cid:48)(cid:48)f(cid:48)\n\u03bd(cid:48)(cid:48)\n\n\u2212 2\u03bd(cid:48)(cid:48)f(cid:48)(cid:48)f(cid:48) + \u03bd(cid:48)f(cid:48)(cid:48)(cid:48) + \u03bd(cid:48)\u03bd(cid:48)(cid:48)(cid:48)f(cid:48)(cid:48)f(cid:48)\n\n\u03be(cid:48)(cid:48)(cid:48) = \u03be(cid:48)(cid:48)\n\n(4)\nwhere we have skipped the arguments y \u2212 f(x), x, and x for \u03bd, \u03be, and f and their derivatives,\nrespectively. Moreover, if for a \ufb01xed pair (f, \u03bd) there exists y \u2208 R such that \u03bd(cid:48)(cid:48)(y \u2212 f(x))f(cid:48)(x) (cid:54)= 0\nfor all but a countable set of points x \u2208 R, the set of all px for which p has a backward model is\ncontained in a 3-dimensional af\ufb01ne space.\n\n\u2212 \u03bd(cid:48)(f(cid:48)(cid:48))2\n\n\u03bd(cid:48)(cid:48)\n\n,\n\nf(cid:48)\n\nLoosely speaking, the statement that the differential equation for \u03be has a 3-dimensional space of\nsolutions (while a priori, the space of all possible log-marginals \u03be is in\ufb01nite dimensional) amounts\nto saying that in the generic case, our forward model cannot be inverted.\nA simple corollary is that if both the marginal density px(x) and the noise density pn(y \u2212 f(x)) are\nGaussian then the existence of a backward model implies linearity of f:\nCorollary 1 Assume that \u03bd(cid:48)(cid:48)(cid:48) = \u03be(cid:48)(cid:48)(cid:48) = 0 everywhere. If a backward model exists, then f is linear.\n\nThe proofs of Theorem 1 and Corollary 1 are provided in the Appendix.\nFinally, we note that even when f is linear and pn and px are non-Gaussian, although a linear back-\nward model has previously been ruled out [7], there exist special cases where there is a nonlinear\nbackward model with independent additive noise. One such case is when f(x) = \u2212x and px and\npn are Gumbel distributions: px(x) = exp(\u2212x \u2212 exp(\u2212x)) and pn(n) = exp(\u2212n \u2212 exp(\u2212n)).\nThen taking py(y) = exp(\u2212y \u2212 2 log(1 + exp(\u2212y))), p\u02dcn(\u02dcn) = exp(\u22122\u02dcn \u2212 exp(\u2212\u02dcn)) and\ng(y) = log(1 + exp(\u2212y)) one obtains p(x, y) = pn(y \u2212 f(x))px(x) = p\u02dcn(x \u2212 g(y))py(y).\nAlthough the above results strictly only concern the two-variable case, there are strong reasons to\nbelieve that the general argument also holds for larger models. In this brief contribution we do not\npursue any further theoretical results, rather we show empirically that the estimation principle can\nbe extended to networks involving more than two variables.\n\n\f4 Model estimation\nSection 3 established for the two-variable case that given knowledge of the exact densities, the true\nmodel is (in the generic case) identi\ufb01able. We now consider practical estimation methods which\ninfer the generating graph from sample data.\nAgain, we begin by considering the case of two observed scalar variables x and y. Our basic method\nis straightforward: First, test whether x and y are statistically independent.\nIf they are not, we\ncontinue as described in the following manner: We test whether a model y := f(x) + n is consistent\nwith the data, simply by doing a nonlinear regression of y on x (to get an estimate \u02c6f of f), calculating\nthe corresponding residuals \u02c6n = y \u2212 \u02c6f(x), and testing whether \u02c6n is independent of x. If so, we\naccept the model y := f(x) + n; if not, we reject it. We then similarly test whether the reverse\nmodel x := g(y) + n \ufb01ts the data.\nThe above procedure will result in one of several possible scenarios. First, if x and y are deemed\nmutually independent we infer that there is no causal relationship between the two, and no further\nanalysis is performed. On the other hand, if they are dependent but both directional models are\naccepted we conclude that either model may be correct but we cannot infer it from the data. A\nmore positive result is when we are able to reject one of the directions and (tentatively) accept the\nother. Finally, it may be the case that neither direction is consistent with the data, in which case we\nconclude that the generating mechanism is more complex and cannot be described using this model.\nThis procedure could be generalized to an arbitrary number N of observed variables, in the following\nway: For each DAG Gi over the observed variables, test whether it is consistent with the data by\nconstructing a nonlinear regression of each variable on its parents, and subsequently testing whether\nthe resulting residuals are mutually independent. If any independence test is rejected, Gi is rejected.\nOn the other hand, if none of the independence tests are rejected, Gi is consistent with the data.\nThe above procedure is obviously feasible only for very small networks (roughly N \u2264 7 or so) and\nalso suffers from the problem of multiple hypothesis testing; an important future improvement would\nbe to take this properly into account. Furthermore, the above algorithm returns all DAGs consistent\nwith the data, including all those for which consistent subgraphs exist. Our current implementation\nremoves any such unnecessarily complex graphs from the output.\nThe selection of the nonlinear regressor and of the particular independence tests are not constrained.\nAny prior information on the types of functional relationships or the distributions of the noise should\noptimally be utilized here. In our implementation, we perform the regression using Gaussian Pro-\ncesses [12] and the independence tests using kernel methods [13]. Note that one must take care to\navoid over\ufb01tting, as over\ufb01tting may lead one to falsely accept models which should be rejected.\n\n5 Experiments\nTo show the ability of our method to \ufb01nd the correct model when all the assumptions hold we have\napplied our implementation to a variety of simulated and real data.\nFor the regression, we used the GPML code from [14] corresponding to [12], using a Gaussian kernel\nand independent Gaussian noise, optimizing the hyperparameters for each regression individually.1\nIn principle, any regression method can be used; we have veri\ufb01ed that our results do not depend\nsigni\ufb01cantly on the choice of the regression method by comparing with \u03bd-SVR [15] and with thin-\nplate spline kernel regression [16]. For the independence test, we implemented the HSIC [13] with\na Gaussian kernel, where we used the gamma distribution as an approximation for the distribution\nof the HSIC under the null hypothesis of independence in order to calculate the p-value of the test\nresult.\n\nSimulations. The main results for the two-variable case are shown in Figure 2. We simulated data\nusing the model y = x + bx3 + n; the random variables x and n were sampled from a Gaussian\ndistribution and their absolute values were raised to the power q while keeping the original sign.\n\n1The assumption of Gaussian noise is somewhat inconsistent with our general setting where the residuals\nare allowed to have any distribution (we even prefer the noise to be non-Gaussian); in practice however, the\nregression yields acceptable results as long as the noise is suf\ufb01ciently similar to Gaussian noise. In case of\nsigni\ufb01cant outliers, other regression methods may yield better results.\n\n\f(a)\n\n(b)\n\nFigure 2: Results of simulations (see main text for details): (a) The proportion of times the forward\nand the reverse model were accepted, paccept, as a function of the non-Gaussianity parameter q (for\nb = 0), and (b) as a function of the nonlinearity parameter b (for q = 1).\n\nThe parameter b controls the strength of the nonlinearity of the function, b = 0 corresponding to the\nlinear case. The parameter q controls the non-Gaussianity of the noise: q = 1 gives a Gaussian, while\nq > 1 and q < 1 produces super-Gaussian and sub-Gaussian distributions respectively. We used 300\n(x, y) samples for each trial and used a signi\ufb01cance level of 2% for rejecting the null hypothesis of\nindependence of residuals and cause. For each b value (or q value) we repeated the experiment 100\ntimes in order to estimate the acceptance probabilities. Panel (a) shows that our method can solve the\nwell-known linear but non-Gaussian special case [7]. By plotting the acceptance probability of the\ncorrect and the reverse model as a function of non-Gaussianity we can see that when the distributions\nare suf\ufb01ciently non-Gaussian the method is able to infer the correct causal direction. Then, in panel\n(b) we similarly demonstrate that we can identify the correct direction for the Gaussian marginal and\nGaussian noise model when the functional relationship is suf\ufb01ciently nonlinear. Note in particular,\nthat the model is identi\ufb01able also for positive b in which case the function is invertible, indicating\nthat non-invertibility is not a necessary condition for identi\ufb01cation.\nWe also did experiments for 4 variables w, x, y and z with a diamond-like causal\nstructure. We took w \u223c U(\u22123, 3), x = w2 + nx with nx \u223c U(\u22121, 1), y =\n\n4(cid:112)|w|+ny with ny \u223c U(\u22121, 1), z = 2 sin x+2 sin y+nz with nz \u223c U(\u22121, 1).\n\nWe sampled 500 (w, x, y, z) tuples from the model and applied the algorithm\ndescribed in Section 4 in order to reconstruct the DAG structure. The simplest\nDAG that was consistent with the data (with signi\ufb01cance level 2% for each test)\nturned out to be precisely the true causal structure. All \ufb01ve other DAGs for\nwhich the true DAG is a subgraph were also consistent with the data.\n\nw\n\nx\n\ny\n\nz\n\nReal-world data. Of particular empirical interest is how well the proposed method performs on\nreal world datasets for which the assumptions of our method might only hold approximately. Due\nto space constraints we only discuss three real world datasets here.\nThe \ufb01rst dataset, the \u201cOld Faithful\u201d dataset [17] contains data about the duration of an eruption and\nthe time interval between subsequent eruptions of the Old Faithful geyser in Yellowstone National\nPark, USA. Our method obtains a p-value of 0.5 for the (forward) model \u201ccurrent duration causes\nnext interval length\u201d and a p-value of 4.4 \u00d7 10\u22129 for the (backward) model \u201cnext interval length\ncauses current duration\u201d. Thus, we accept the model where the time interval between the current\nand the next eruption is a function of the duration of the current eruption, but reject the reverse\nmodel. This is in line with the chronological ordering of these events. Figure 3 illustrates the data,\nthe forward and backward \ufb01t and the residuals for both \ufb01ts. Note that for the forward model, the\nresiduals seem to be independent of the duration, whereas for the backward model, the residuals are\nclearly dependent on the interval length. Time-shifting the data by one time step, we obtain for the\n(forward) model \u201ccurrent interval length causes next duration\u201d a p-value smaller than 10\u221215 and for\nthe (backward) model \u201cnext duration causes current interval length\u201d we get a p-value of 1.8\u00d7 10\u22128.\nHence, our simple nonlinear model with independent additive noise is not consistent with the data\nin either direction.\nThe second dataset, the \u201cAbalone\u201d dataset from the UCI ML repository [18], contains measurements\nof the number of rings in the shell of abalone (a group of shell\ufb01sh), which indicate their age, and the\nlength of the shell. Figure 4 shows the results for a subsample of 500 datapoints. The correct model\n\u201cage causes length\u201d leads to a p-value of 0.19, while the reverse model \u201clength causes age\u201d comes\n\n01pacceptpaccept0.511.52qqcorrectreverseb=001pacceptpaccept\u2212101bbcorrectreverseq=1\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: The Old Faithful Geyser data: (a) forward \ufb01t corresponding to \u201ccurrent duration causes\nnext interval length\u201d; (b) residuals for forward \ufb01t; (c) backward \ufb01t corresponding to \u201cnext interval\nlength causes current duration\u201d; (d) residuals for backward \ufb01t.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Abalone data: (a) forward \ufb01t corresponding to \u201cage (rings) causes length\u201d; (b) residuals for\nforward \ufb01t; (c) backward \ufb01t corresponding to \u201clength causes age (rings)\u201d; (d) residuals for backward\n\ufb01t.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Altitude\u2013temperature data. (a) forward \ufb01t corresponding to \u201caltitude causes temperature\u201d;\n(b) residuals for forward \ufb01t; (c) backward \ufb01t corresponding to \u201ctemperature causes altitude\u201d; (d)\nresiduals for backward \ufb01t.\n\nwith p < 10\u221215. This is in accordance with our intuition. Note that our method favors the correct\ndirection although the assumption of independent additive noise is only approximately correct here;\nindeed, the variance of the length is dependent on age.\nFinally, we assay the method on a simple example involving two observed variables: The altitude\nabove sea level (in meters) and the local yearly average outdoor temperature in centigrade, for 349\nweather stations in Germany, collected over the time period of 1961\u20131990 [19]. The correct model\n\u201caltitude causes temperature\u201d leads to p = 0.017, while \u201ctemperature causes altitude\u201d can clearly be\nrejected (p = 8 \u00d7 10\u221215), in agreement with common understanding of causality in this case. The\nresults are shown in Figure 5.\n\n6 Conclusions\n\nIn this paper, we have shown that the linear\u2013non-Gaussian causal discovery framework can be gen-\neralized to admit nonlinear functional dependencies as long as the noise on the variables remains\nadditive. In this approach nonlinear relationships are in fact helpful rather than a hindrance, as they\ntend to break the symmetry between the variables and allow the correct causal directions to be iden-\nti\ufb01ed. Although there exist special cases which admit reverse models we have shown that in the\ngeneric case the model is identi\ufb01able. We have illustrated our method on both simulated and real\nworld datasets.\n\n0246406080100durationinterval0246\u22122002040durationresiduals of (a)4060801000246intervalduration406080100\u22124\u2212202intervalresiduals of (c)010203000.20.40.60.8ringslength0102030\u22120.4\u22120.200.20.4ringsresiduals of (a)00.510102030lengthrings00.51\u22121001020lengthresiduals of (c)0100020003000\u22125051015altitudetemperature0100020003000\u22122\u22121012altituderesiduals of (a)\u221210010200100020003000temperaturealtitude\u22121001020\u2212400\u22122000200400temperatureresiduals of (c)\fAcknowledgments\n\nWe thank Kun Zhang for pointing out an error in the original manuscript. This work was supported\nin part by the IST Programme of the European Community, under the PASCAL2 Network of Ex-\ncellence, IST-2007-216886. P.O.H. was supported by the Academy of Finland and by University of\nHelsinki Research Funds.\n\nA Proof of Theorem 1\nSet\n\nand \u02dc\u03bd := log p\u02dcn, \u03b7 := log py. If eq. (3) holds, then \u03c0(x, y) = \u02dc\u03bd(x \u2212 g(y)) + \u03b7(y) , implying\n\n\u03c0(x, y) := log p(x, y) = \u03bd(y \u2212 f(x)) + \u03be(x) ,\n\n\u22022\u03c0\n\u2202x\u2202y\n\nWe conclude\n\nUsing eq. (5) we obtain\n\n= \u2212\u02dc\u03bd(cid:48)(cid:48)(x \u2212 g(y))g(cid:48)(y)\n\nand\n\n(cid:18) \u22022\u03c0/\u2202x2\n\n\u22022\u03c0/(\u2202x\u2202y)\n\n\u2202\n\u2202x\n\n= \u02dc\u03bd(cid:48)(cid:48)(x \u2212 g(y)) .\n\n(cid:19)\n\n\u22022\u03c0\n\u2202x2\n\n= 0 .\n\n\u22022\u03c0\n\u2202x\u2202y\n\n= \u2212\u03bd(cid:48)(cid:48)(y \u2212 f(x))f(cid:48)(x) ,\n\n\u22022\u03c0\n\u2202x2\n\n= \u2202\n\u2202x\n\n(\u2212\u03bd(cid:48)(y \u2212 f(x))f(cid:48)(x) + \u03be(cid:48)(x)) = \u03bd(cid:48)(cid:48)(f(cid:48))2 \u2212 \u03bd(cid:48)f(cid:48)(cid:48) + \u03be(cid:48)(cid:48) ,\n\nwhere we have dropped the arguments for convenience. Combining eqs. (7) and (8) yields\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nand\n\n(cid:32) \u22022\u03c0\n\n(cid:33)\n\n\u2202x2\n\u22022\u03c0\n\u2202x\u2202y\n\n\u2202\n\u2202x\n\n= \u22122f(cid:48)(cid:48) + \u03bd(cid:48)f(cid:48)(cid:48)(cid:48)\n\u03bd(cid:48)(cid:48)f(cid:48)\n\n\u2212 \u03be(cid:48)(cid:48)(cid:48)\n\n1\n\n\u03bd(cid:48)(cid:48)f(cid:48)\n\n+ \u03bd(cid:48)\u03bd(cid:48)(cid:48)(cid:48)f(cid:48)(cid:48)\n(\u03bd(cid:48)(cid:48))2\n\n\u2212 \u03bd(cid:48)(f(cid:48)(cid:48))2\n\u03bd(cid:48)(cid:48)(f(cid:48))2\n\n\u2212 \u03be(cid:48)(cid:48)\n\n\u03bd(cid:48)(cid:48)(cid:48)\n(\u03bd(cid:48)(cid:48))2\n\n+ \u03be(cid:48)(cid:48)\n\nf(cid:48)(cid:48)\n\n\u03bd(cid:48)(cid:48)(f(cid:48))2\n\n.\n\nDue to eq. (6) this expression must vanish and we obtain DE (4) by term reordering. Given f, \u03bd, we\nobtain for every \ufb01xed y a linear inhomogeneous DE for \u03be:\n\n\u03be(cid:48)(cid:48)(cid:48)(x) = \u03be(cid:48)(cid:48)(x)G(x, y) + H(x, y) ,\n\n(9)\n\nwhere G and H are de\ufb01ned by\n\nG := \u2212 \u03bd(cid:48)(cid:48)(cid:48)f(cid:48)\n\u03bd(cid:48)(cid:48)\n\n+ f(cid:48)(cid:48)\nf(cid:48)\n\nand H := \u22122\u03bd(cid:48)(cid:48)f(cid:48)(cid:48)f(cid:48) + \u03bd(cid:48)f(cid:48)(cid:48)(cid:48) + \u03bd(cid:48)\u03bd(cid:48)(cid:48)(cid:48)f(cid:48)(cid:48)f(cid:48)\n\n\u2212 \u03bd(cid:48)(f(cid:48)(cid:48))2\n\nf(cid:48)\n\n.\n\n\u03bd(cid:48)(cid:48)\n\nR x\n\n(cid:90) x\n\nR x\n\nSetting z := \u03be(cid:48)(cid:48) we have z(cid:48)(x) = z(x)G(x, y) + H(x, y) . Given that such a function z exists, it is\ngiven by\n\nx0\n\nG(\u02dcx,y)d\u02dcx +\n\nz(x) = z(x0)e\n\n(10)\nLet y be \ufb01xed such that \u03bd(cid:48)(cid:48)(y \u2212 f(x))f(cid:48)(x) (cid:54)= 0 holds for all but countably many x. Then z is\ndetermined by z(x0) since we can extend eq. (10) to the remaining points. The set of all functions\n\u03be satisfying the linear inhomogenous DE (9) is a 3-dimensional af\ufb01ne space: Once we have \ufb01xed\n\u03be(x0), \u03be(cid:48)(x0), \u03be(cid:48)(cid:48)(x0) for some arbitrary point x0, \u03be is completely determined. Given \ufb01xed f and \u03bd,\n(cid:3)\nthe set of all \u03be admitting a backward model is contained in this subspace.\n\n\u02c6x G(\u02dcx,y)d\u02dcxH(\u02c6x, y)d\u02c6x .\n\nx0\n\ne\n\nB Proof of Corollary 1\n\nSimilarly to how (6) was derived, under the assumption of the existence of a reverse model one can\nderive\n\n(cid:18) \u22022\u03c0\n\n(cid:19)\n\n\u2202x2\n\n\u22022\u03c0\n\u2202x\u2202y\n\n\u00b7 \u2202\n\u2202x\n\n= \u22022\u03c0\n\u2202x2 \u00b7 \u2202\n\n\u2202x\n\n(cid:18) \u22022\u03c0\n\n(cid:19)\n\n\u2202x\u2202y\n\n\fNow using (7) and (8), we obtain\n\n(cid:0)\u03bd(cid:48)(cid:48)(f(cid:48))2 \u2212 \u03bd(cid:48)f(cid:48)(cid:48) + \u03be(cid:48)(cid:48)(cid:1) = (\u03bd(cid:48)(cid:48)(f(cid:48))2 \u2212 \u03bd(cid:48)f(cid:48)(cid:48) + \u03be(cid:48)(cid:48)) \u00b7 \u2202\n\n(\u2212\u03bd(cid:48)(cid:48)f(cid:48)) \u00b7 \u2202\n\u2202x\n\n(\u2212\u03bd(cid:48)(cid:48)f(cid:48))\n\n\u2202x\n\nwhich reduces to\n\u22122(\u03bd(cid:48)(cid:48)f(cid:48))2f(cid:48)(cid:48) + \u03bd(cid:48)(cid:48)f(cid:48)\u03bd(cid:48)f(cid:48)(cid:48)(cid:48) \u2212 \u03bd(cid:48)(cid:48)f(cid:48)\u03be(cid:48)(cid:48)(cid:48) = \u2212\u03bd(cid:48)f(cid:48)(cid:48)\u03bd(cid:48)(cid:48)(cid:48)(f(cid:48))2 + \u03be(cid:48)(cid:48)\u03bd(cid:48)(cid:48)(cid:48)(f(cid:48))2 + \u03bd(cid:48)(cid:48)\u03bd(cid:48)(f(cid:48)(cid:48))2 \u2212 \u03bd(cid:48)(cid:48)f(cid:48)(cid:48)\u03be(cid:48)(cid:48).\nSubstituting the assumptions \u03be(cid:48)(cid:48)(cid:48) = 0 and \u03bd(cid:48)(cid:48)(cid:48) = 0 (and hence \u03bd(cid:48)(cid:48) = C everywhere with C (cid:54)= 0 since\notherwise \u03bd cannot be a proper log-density) yields\n\n\u03bd(cid:48)(cid:0)y \u2212 f(x)(cid:1) \u00b7(cid:0)f(cid:48)f(cid:48)(cid:48)(cid:48) \u2212 (f(cid:48)(cid:48))2(cid:1) = 2C(f(cid:48))2f(cid:48)(cid:48) \u2212 f(cid:48)(cid:48)\u03be(cid:48)(cid:48).\n\nSince C (cid:54)= 0 there exists an \u03b1 such that \u03bd(cid:48)(\u03b1) = 0. Then, restricting ourselves to the submanifold\n{(x, y) \u2208 R2 : y \u2212 f(x) = \u03b1} on which \u03bd(cid:48) = 0, we have\n\nTherefore, for all x in the open set [f(cid:48)(cid:48) (cid:54)= 0], we have (f(cid:48)(x))2 = \u03be(cid:48)(cid:48)/(2C) which is a constant, so\nf(cid:48)(cid:48) = 0 on [f(cid:48)(cid:48) (cid:54)= 0]: a contradiction. Therefore, f(cid:48)(cid:48) = 0 everywhere.\n(cid:3)\n\n0 = f(cid:48)(cid:48)(2C(f(cid:48))2 \u2212 \u03be(cid:48)(cid:48)).\n\nReferences\n\n[1] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.\n[2] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer-Verlag, 1993. (2nd\n\ned. MIT Press 2000).\n\n[3] D. Geiger and D. Heckerman. Learning Gaussian networks. In Proc. of the 10th Annual Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, pages 235\u2013243, 1994.\n\n[4] D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery. In C. Glymour and\n\nG. F. Cooper, editors, Computation, Causation, and Discovery, pages 141\u2013166. MIT Press, 1999.\n\n[5] T. Richardson and P. Spirtes. Automated discovery of linear feedback models. In C. Glymour and G. F.\n\nCooper, editors, Computation, Causation, and Discovery, pages 253\u2013304. MIT Press, 1999.\n\n[6] R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models.\n\nJournal of Machine Learning Research, 7:191\u2013246, 2006.\n\n[7] S. Shimizu, P. O. Hoyer, A. Hyv\u00a8arinen, and A. J. Kerminen. A linear non-Gaussian acyclic model for\n\ncausal discovery. Journal of Machine Learning Research, 7:2003\u20132030, 2006.\n\n[8] X. Sun, D. Janzing, and B. Sch\u00a8olkopf. Distinguishing between cause and effect via kernel-based com-\n\nplexity measures for conditional probability densities. Neurocomputing, pages 1248\u20131256, 2008.\n\n[9] K. A. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989.\n[10] N. Friedman and I. Nachman. Gaussian process networks. In Proc. of the 16th Annual Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, pages 211\u2013219, 2000.\n\n[11] X. Sun, D. Janzing, and B. Sch\u00a8olkopf. Causal inference by choosing graphs with most plausible Markov\n\nkernels. In Proceeding of the 9th Int. Symp. Art. Int. and Math., Fort Lauderdale, Florida, 2006.\n[12] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[13] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch\u00a8olkopf. Kernel methods for measuring\n\nindependence. Journal of Machine Learning Research, 6:2075\u20132129, 2005.\n[14] GPML code. http://www.gaussianprocess.org/gpml/code.\n[15] B. Sch\u00a8olkopf, A. J. Smola, and R. Williamson. Shrinking the tube: A new support vector regression\n\nalgorithm. In Advances in Neural Information Processing 11 (Proc. NIPS*1998). MIT Press, 1999.\n\n[16] G. Wahba. Spline Models for Observational Data. Series in Applied Math., Vol. 59, SIAM, Philadelphia,\n\n1990.\n\n[17] A. Azzalini and A. W. Bowman. A look at some data on the Old Faithful Geyser. Applied Statistics,\n\n39(3):357\u2013365, 1990.\n\n[18] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n[19] Climate data collected by the Deutscher Wetter Dienst. http://www.dwd.de/.\n\n\f", "award": [], "sourceid": 266, "authors": [{"given_name": "Patrik", "family_name": "Hoyer", "institution": null}, {"given_name": "Dominik", "family_name": "Janzing", "institution": null}, {"given_name": "Joris", "family_name": "Mooij", "institution": null}, {"given_name": "Jonas", "family_name": "Peters", "institution": ""}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}