{"title": "Causal Inference and Mechanism Clustering of A Mixture of Additive Noise Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5206, "page_last": 5216, "abstract": "The inference of the causal relationship between a pair of observed variables is a fundamental problem in science, and most existing approaches are based on one single causal model. In practice, however, observations are often collected from multiple sources with heterogeneous causal models due to certain uncontrollable factors, which renders causal analysis results obtained by a single model skeptical. In this paper, we generalize the Additive Noise Model (ANM) to a mixture model, which consists of a finite number of ANMs, and provide the condition of its causal identifiability. To conduct model estimation, we propose Gaussian Process Partially Observable Model (GPPOM), and incorporate independence enforcement into it to learn latent parameter associated with each observation. Causal inference and clustering according to the underlying generating mechanisms of the mixture model are addressed in this work. Experiments on synthetic and real data demonstrate the effectiveness of our proposed approach.", "full_text": "Causal Inference and Mechanism Clustering of A\n\nMixture of Additive Noise Models\n\nShoubo Hu\u2217, Zhitang Chen\u2020, Vahid Partovi Nia\u2020, Laiwan Chan\u2217, Yanhui Geng\u2021\n\n\u2217The Chinese University of Hong Kong; \u2020Huawei Noah\u2019s Ark Lab;\n\n\u2021Huawei Montr\u00e9al Research Center\n\u2217{sbhu, lwchan}@cse.cuhk.edu.hk\n\n\u2020\u2021{chenzhitang2, vahid.partovinia, geng.yanhui}@huawei.com\n\nAbstract\n\nThe inference of the causal relationship between a pair of observed variables is a\nfundamental problem in science, and most existing approaches are based on one\nsingle causal model. In practice, however, observations are often collected from\nmultiple sources with heterogeneous causal models due to certain uncontrollable\nfactors, which renders causal analysis results obtained by a single model skeptical.\nIn this paper, we generalize the Additive Noise Model (ANM) to a mixture model,\nwhich consists of a \ufb01nite number of ANMs, and provide the condition of its causal\nidenti\ufb01ability. To conduct model estimation, we propose Gaussian Process Partially\nObservable Model (GPPOM), and incorporate independence enforcement into it\nto learn latent parameter associated with each observation. Causal inference and\nclustering according to the underlying generating mechanisms of the mixture model\nare addressed in this work. Experiments on synthetic and real data demonstrate the\neffectiveness of our proposed approach.\n\n1\n\nIntroduction\n\nUnderstanding the data-generating mechanism (g.m.) has been a main theme of causal inference. To\ninfer the causal direction between two random variables (r.v.s) X and Y using passive observations,\nmost existing approaches \ufb01rst model the relation between them using a functional model with\ncertain assumptions [18, 6, 21, 8]. Then a certain asymmetric property (usually termed cause-effect\nasymmetry), which only holds in the causal direction, is derived to conduct inference. For example,\nthe additive noise model (ANM) [6] represents the effect as a function of the cause with an additive\nindependent noise: Y = f (X) + \u0001. It is shown in [6] that there is no model of the form X = g(Y ) + \u02dc\u0001\nthat admits an ANM in the anticausal direction for most combinations (f, p(X), p(\u0001)).\nSimilar to ANM, most causal inference approaches based on functional models, such as LiNGAM\n[18], PNL [21], and IGCI [9], assume a single causal model for all observations. However, there is\nno such a guarantee in practice, and it could be very common that the observations are generated by a\nmixture of causal models due to different data sources or data collection under different conditions,\nrendering existing single-causal-model based approaches inapplicable in many problems (e.g. Fig.\n1). Recently, an approach was proposed for inferring the causal direction of mixtures of ANMs with\ndiscrete variables [12]. However, the inference of such mixture models with continuous variables\nremains a challenging problem and is not yet well studied.\nAnother question regarding mixture models addressed in this paper is how one could reveal causal\nknowledge in clustering tasks. Speci\ufb01cally, we aim at \ufb01nding clusters consistent with the causal g.m.s\nof a mixture model, which is usually vital in the preliminary phase of many research. For example in\nthe analysis of air data (see section 4.2 for detail), discovering knowledge from air data combined\nfrom several different regions (i.e. mechanisms in causal perspective) is much more dif\ufb01cult than\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Example illustrating the failure of ANM on the inference of a mixture of ANMs (a)\nthe distribution of data generated from M1 : Y = X 2 + \u0001 (red) and M2 : Y = X 5 + \u0001 (blue),\nwhere X \u223c U (0, 1) (x-axis) and \u0001 \u223c U (\u22120.1, 0.1) ; (b) Conditional p(Y |X = 0.2); (c) Conditional\np(Y |X = 0.6). It is obvious that when the data is generated from a mixture of ANMs, the consistency\nof conditionals is likely to be violated which leads to the failure of ANM.\n\nfrom data of each region separately. Most existing clustering algorithms are weak for this perspective\nas they typically de\ufb01ne similarity between observations in the form of distances in some spaces or\nmanifolds. Most of them neglect the relation among r.v.s within a feature vector (observation), and\nonly use those feature dimensions to calculate an overall distance metric as the clustering criterion.\nIn this paper, we focus on analyzing observations generated by a mixture of ANMs of two r.v.s and\ntry to answer two questions: 1) causal inference: how can we infer the causal direction between the\ntwo r.v.s? 2) mechanism clustering: how can we cluster the observations generated from the same\ng.m. together? To answer these questions, \ufb01rst as the main result of this paper, we show that the\ncausal direction of the mixture of ANMs is identi\ufb01able in most cases, and we propose a variant of\nGP-LVM [10] named Gaussian Process Partially Observable Model (GPPOM) for model estimation,\nbased on which we further develop the algorithms for causal inference and mechanism clustering.\nThe rest of the paper is organized as follows:\nin section 2, we formalize the model, show its\nidenti\ufb01ability and elaborate mechanism clustering; in section 3, model estimation method is proposed;\nwe present experiments on synthetic and real world data in section 4 and conclude in section 5.\n\n2 ANM Mixture Model\n\n2.1 Model de\ufb01nition\n\nEach observation is assumed to be generated from an\nANM and the entire data set is generated by a \ufb01nite num-\nber of related ANMs. They are called the ANM Mixture\nModel (ANM-MM) and formally de\ufb01ned as:\nDe\ufb01nition 1 (ANM Mixture Model). An ANM Mixture\nModel is a set of causal models of the same causal direc-\ntion between two continuous r.v.s X and Y . All causal\nmodels share the same form given by the following ANM:\n\nY = f (X; \u03b8) + \u0001,\n\n(1)\n\nX\n\nfn\n\nyn\n\n\u03b8\n\n\u0001n\n\nN\n\n\u03b2\n\nFigure 2: ANM Mixture Model\n\n(cid:80)C\nc=1 ac1\u03b8c (\u00b7), where ac > 0,(cid:80)C\n\nwhere X denotes the cause, Y denotes the effect, f is a\nnonlinear function parameterized by \u03b8 and the noise \u0001 \u22a5\u22a5 X. The differences between causal\nmodels in an ANM-MM stem only from different values of function parameter \u03b8. In ANM-MM, \u03b8 is\nassumed to be drawn from a discrete distribution on a \ufb01nite set \u0398 = {\u03b81,\u00b7\u00b7\u00b7 , \u03b8C}, i.e. \u03b8 \u223c p\u03b8(\u03b8) =\nc=1 ac = 1 and 1\u03b8c (\u00b7) is the indicator function of a single value \u03b8c.\nObviously in ANM-MM, all observations are generated by a set of g.m.s, which share the same\nfunction form (f) but differ in parameter values (\u03b8). This model is inspired by commonly encountered\ncases where the data-generating process is slightly different in each independent trial due to the\nin\ufb02uence of certain external factors that one can hardly control. In addition, these factors are usually\n\n2\n\n\fbelieved to be independent of the observed variables. The data-generating process of ANM-MM can\nbe represented by a directed graph in Fig. 2.\n\n2.2 Causal inference: identi\ufb01ability of ANM-MM\nLet X be the cause and Y be the effect (X \u2192 Y ) without loss of generality. As most recently\nproposed causal inference approaches, following postulate, which was originally proposed in [1], is\nadopted in the analysis of ANM-MM.\nPostulate 1 (Independence of input and function). If X \u2192 Y , the distribution of X and the function\nf mapping X to Y are independent since they correspond to independent mechanisms of nature.\n\nIn a general perspective, postulate 1 essentially claims the independence between the cause (X) and\nmechanism mapping the cause to effect [9]. In ANM-MM, we interpret the independence between\nthe cause and mechanism in an intuitive way: \u03b8, as the function parameter, captures all variability\nof mechanisms f so it should be independent of the cause X according to postulate 1. Based on\nthe independence between X and \u03b8, cause-effect asymmetry could be derived to infer the causal\ndirection.\nSince ANM-MM consists of a set of ANMs, the identi\ufb01ability result of ANM-MM can be a simple\ncorollary of that in [6] when the number of ANMs (C) is equal and there is a one-to-one correspon-\ndence between mechanisms in the forward and backward ANM-MM. In this case the condition of\nANM-MM being unidenti\ufb01able is to ful\ufb01ll C ordinary differential equations given in [6] simultane-\nously which can hardly happen in a generic case. However, C in ANM-MM in both directions may\nnot necessarily be equal and there may also exist many-to-one correspondence between ANMs in\nboth directions. In this case, the identi\ufb01ability result can not be derived as a simple corollary of [6].\nTo analyze the identi\ufb01ability result of ANM-MM, we \ufb01rst derive lemma 1 to \ufb01nd the condition of\nexistence of many-to-one correspondence (which is a generalization of the condition given in [6]),\nthen conclude the identi\ufb01ability result of ANM-MM (theorem 1) based on the condition in lemma 1.\nThe condition that there exists one backward ANM for a forward ANM-MM is:\nLemma 1. Let X \u2192 Y and they follow an ANM-MM. If there exists a backward ANM in the\nanti-causal direction, i.e.\n\nX = g(Y ) + \u02dc\u0001,\n\nthe cause distribution (pX), the noise distribution (p\u0001), the nonlinear function (f) and its parameter\ndistribution (p\u03b8) should jointly ful\ufb01ll the following ordinary differential equation (ODE)\n\n\u03be(cid:48)(cid:48)(cid:48) \u2212 G(X, Y )\nH(X, Y )\n\n\u03be(cid:48)(cid:48) =\n\nG(X, Y )V (X, Y )\n\nU (X, Y )\n\n\u2212 H(X, Y ),\n\n(2)\n\nwhere \u03be := log pX, and the de\ufb01nitions of G(X, Y ), H(X, Y ), V (X, Y ) and U (X, Y ) are provided\nin supplementary due to the page limitation.\n\ndirection by p(X, Y ) =(cid:80)C\nc=1 p(Y |X, \u03b8c)pX (X)p\u03b8(\u03b8c) = pX (X)(cid:80)C\n(cid:16) \u22022\u03c0/\u2202X\u2202Y\nward ANM. Since p(X, Y ) should be the same, by substituting p(X, Y ) = pX (X)(cid:80)C\n(cid:17)\n(cid:16) \u22022\u03c0/\u2202X\u2202Y\n\nSketch of proof. Since X and Y follow an ANM-MM, their joint density is factorized in the causal\nc=1 acp\u0001(Y \u2212 f (X; \u03b8c)). If\nthere exists a backward ANM in the anti-causal direction, i.e. X = g(Y )+\u02dc\u0001, then p(X, Y ) = p\u02dc\u0001(X\u2212\n= 0 holds, where \u03c0 = log [p\u02dc\u0001(X \u2212 g(Y ))pY (Y )], in the back-\ng(Y ))pY (Y ) and \u2202\n\u2202X\nc=1 acp\u0001(Y \u2212\n\n\u22022\u03c0/\u2202X 2\n\n(cid:17)\n\nf (X; \u03b8c)) into \u2202\n\u2202X\n\n\u22022\u03c0/\u2202X 2\n\n= 0, the condition shown in (2) is obtained.\n\nThe proof of lemma 1 follows the idea of the identi\ufb01ability of ANM in [6] and is provided in\nthe supplementary. Since the condition that one backward ANM exists for an forward ANM-MM\n(mixture of ANMs) is more restrictive than that for a single forward ANM, which is the identi\ufb01ability\nin [6], lemma 1 indicates that a backward ANM is unlikely to exist in the anticausal direction if 1) X\nand Y follow an ANM-MM; 2) postulate 1 holds. Based on lemma 1, it is reasonable to hypothesize\nthat a stronger result, which is justi\ufb01ed in theorem 1, is valid, i.e. if the g.m. follows an ANM-MM,\nthen it is almost impossible to have a backward ANM-MM in the anticausal direction.\nTheorem 1. Let X \u2192 Y and they follow an ANM-MM. If there exists a backward ANM-MM,\n\nX = g(Y ; \u03c9) + \u02dc\u0001,\n\n3\n\n\fwhere \u03c9 \u223c p\u03c9(\u03c9) =(cid:80) \u02dcC\n\n\u02dcc=1 b\u02dcc1\u03c9\u02dcc(\u00b7), b\u02dcc > 0,(cid:80) \u02dcC\n\nthen (pX, p\u0001, f, p\u03b8) should ful\ufb01ll \u02dcC ordinary differential equations similar to (2), i.e.,\n\n\u02dcc=1 b\u02dcc = 1 and \u02dc\u0001 \u22a5\u22a5 Y , in the anticausal direction,\n\n\u03be(cid:48)(cid:48)(cid:48) \u2212 G(\u02dcc)(X, Y )\nH (\u02dcc)(X, Y )\n\n\u03be(cid:48)(cid:48) =\n\nG(\u02dcc)(X, Y )V (\u02dcc)(X, Y )\n\nU (\u02dcc)(X, Y )\n\n\u2212 H (\u02dcc)(X, Y ), \u02dcc = 1, 2,\u00b7\u00b7\u00b7 , \u02dcC,\n\n(3)\n\nwhere \u03be := log pX, G(\u02dcc)(X, Y ), H (\u02dcc)(X, Y ), U (\u02dcc)(X, Y ) and V (\u02dcc)(X, Y ) are de\ufb01ned similarly to\nthose in lemma 1.\n\nProof. Assume that there exists ANM-MM in both directions. Then there exists a non overlapping\nn=1 = D1 \u222a \u00b7\u00b7\u00b7 \u222a D \u02dcC such that in each data block D\u02dcc,\npartition of the entire data D := {(xn, yn)}N\nthere is an ANM-MM in the causal direction Y = f (X; \u03b8) + \u0001, where \u03b8 \u223c p(\u02dcc)\n\u03b8 (\u03b8) is a discrete\ndistribution on a \ufb01nite set \u0398(\u02dcc) \u2286 \u0398, and an ANM in the anti-causal direction X = g(Y ; \u03c9 =\n\u03c9\u02dcc) + \u02dc\u0001. According to lemma 1, for each data block, to ensure the existence of an ANM-MM in the\ncausal direction and an ANM in the anti-causal direction, (pX, p\u0001, f, p\u03b8) should ful\ufb01ll an ordinary\ndifferential equation in the form of (2). Then the existence of backward ANM-MM requires \u02dcC\nordinary differential equations to be ful\ufb01lled simultaneously which yields (3).\n\nThen the causal direction in ANM-MM can be inferred by investigating the independence between\nthe hypothetical cause and the corresponding function parameter. According to theorem 1, if they\nare independent in the causal direction, then it is highly likely they are dependent in the anticausal\ndirection. Therefore in practice, the inferred direction is the one that shows more evidence of\nindependence between them.\n\n2.3 Mechanism clustering of ANM-MM\n\nIn ANM-MM, \u03b8, which represents function parameters, can be directly used to identify different g.m.s\nsince each parameter value corresponds to one mechanism. In other words, observations generated by\nthe same g.m. would have the same \u03b8 if the imposed statistical model is identi\ufb01able with respect to \u03b8.\nDenote the parameter associated with each observation (xn, yn) by \u03b8n, we suppose a more practical\ninherent clustering structure behind hidden \u03b8n. Formally, there is a grouping indicator of integers\nz \u2208 {1, . . . , C}N that assign each \u03b8n to one of the C clusters, through the nth element of z, e.g.\n\u03b8n belongs to cluster c if [z]n = c, c \u2208 {1, . . . , C}. Following ANM-MM, we may assume each\n\u03b8n belong to one of C components and each component follows N (\u00b5c, \u03c32). A likelihood-based\nclustering scheme suggests minimizing \u2212(cid:96) jointly with respect to all means and z\n\n(cid:26) 1\u221a\n\nN(cid:89)\n\nC(cid:89)\n\nn=1\n\nc=1\n\n(cid:18)\n\n(cid:96)(M, z) = log\n\nexp\n\n2\u03c0\u03c3\n\n\u2212 1\n2\u03c32 (\u03b8n \u2212 \u00b5c)2\n\n(cid:19)(cid:27)1([z]n=c)\n\n,\n\nwhere M = {\u00b5c}C\n\u03c32 and minimize \u2212(cid:96) using coordinate descent iteratively\n\nc=1 and 1(\u00b7) is the indicator function. To simplify further let\u2019s ignore the known\n\n\u02c6M | z = arg min\n\n\u02c6z | M = arg min\n\nc=1\n\n{n|[z]n=c}\n\n(\u03b8n \u2212 \u00b5c)2\n\n(cid:88)\n(cid:88)\n\nC(cid:88)\nC(cid:88)\n(cid:80){n|[z]n=c} \u03b8n, where nc is the size of the cth cluster\n\n(\u03b8n \u2212 \u00b5c)2.\n\nc=1\n\n{n|[z]n=c}\n\n(4)\n\n(5)\n\nM\n\nz\n\nnc =(cid:80)N\n\nThe minimizer of (4) is the mean \u02c6\u00b5c = 1\nnc\n\nn=1 1([z]n = c). The minimizer of (5) is group assignment through minimum Euclidean\ndistance. Therefore, iterating between (4) and (5) coincides with applying k-means algorithm on\nall \u03b8n and the goal of \ufb01nding clusters consistent with the g.m.s for data from ANM-MM can be\nachieved by \ufb01rstly estimating parameters associated with each observation and then conducting\nk-means directly on parameters.\n\n4\n\n\f3 ANM-MM Estimation by GPPOM\n\nWe propose Gaussian process partially observable model (GPPOM) and incorporate Hilbert-Schmidt\nindependence criterion (HSIC) [4] enforcement into GPPOM to estimate the model parameter \u03b8.\nThen we summarize algorithms for causal inference and mechanism clustering of ANM-MM.\n\n3.1 Preliminaries\n\n2 tr\n\n(cid:16)\n\n2 ln (|K|)\u2212 1\n\n2 ln(2\u03c0)\u2212 D\n\nK\u22121YYT(cid:17)\n\nDual PPCA. Dual PPCA [11] is a latent variable model in which maximum likelihood solution\nfor the latent variables is found by marginalizing out the parameters. Given a set of N centered\nD-dimensional data Y = [y1, . . . , yN ]T , dual PPCA learns the q-dimensional latent representation\nxn associated with each observation yn. The relation between xn and yn in dual PPCA is yn =\nWxn + \u0001n, where the matrix W speci\ufb01es the linear relation between yn and xn and noise \u0001n \u223c\nN (0, \u03b2\u22121I). Then by placing a standard Gaussian prior on each row of W, one obtains the marginal\nlikelihood of all observations and the objective function of dual PPCA is the log-likelihood L =\n\u2212 DN\n, where K = XXT + \u03b2\u22121I and X = [x1, . . . , xN ]T .\nGP-LVM. GP-LVM [10] generalizes dual PPCA to cases of nonlinear relation between yn and xn\nby mapping latent representations in X to a feature space, i.e. \u03a6 = [\u03c6(x1), . . . , \u03c6(xN )]T , where \u03c6(\u00b7)\ndenotes the canonical feature map. Then K = \u03a6\u03a6T + \u03b2\u22121I and \u03a6\u03a6T can be computed using kernel\ntrick. GP-LVM can also be interpreted as a new class of models which consists of D independent\nGaussian processes [19] mapping from a latent space to an observed data space [10].\nHSIC. HSIC [4], which is based on reproducing kernel Hilbert space (RKHS) theory, is widely used\nto measure the dependence between r.v.s. Let D := {(xn, yn)}N\nn=1 be a sample of size N draw\nindependently and identically distributed according to P (X, Y ), HSIC answers the query whether\nX \u22a5\u22a5 Y . Formally, denote by F and G RKHSs with universal kernel k, l on the compact domains\nX and Y, HSIC is the measure de\ufb01ned as HSIC(P (X, Y ),F,G) := (cid:107)Cxy(cid:107)2\nHS, which is essentially\nthe squared Hilbert Schmidt norm [4] of the cross-covariance operator Cxy from RKHS G to F [3].\nIt is proved in [4] that, under conditions speci\ufb01ed in [5], HSIC(P (X, Y ),F,G) = 0 if and only if\nX \u22a5\u22a5 Y . In practice, a biased empirical estimator of HSIC based on the sample D is often adopted:\n\nwhere [K]ij = k(xi, xj), [L]ij = l(yi, yj), H = I \u2212 1\n\nN\n\n(cid:126)1(cid:126)1T , and (cid:126)1 is a N \u00d7 1 vector of ones.\n\nHSICb(D) =\n\n1\nN 2 tr (KHLH) ,\n\n(6)\n\n3.2 Gaussian process partially observable model\n\nPartially observable dual PPCA. Dual PPCA is not directly applicable to model ANM-MM since:\n1) part of the r.v. that maps to the effect is visible (i.e. X); 2) the relation (i.e. f) is nonlinear; 3) r.v.s\nthat contribute to the effect should be independent (X \u22a5\u22a5 \u03b8) in ANM-MM. To tackle 1), a latent r.v.\n\u03b8 is brought in dual PPCA.\nDenote the observed effect by Y = [y1, . . . , yN ]T , observed cause by X = [x1, . . . , xN ]T , the\nmatrix collecting function parameters associated with each observation by \u0398 = [\u03b81, . . . , \u03b8N ]T and\nthe r.v. that contribute to the effect by \u02dcX = [X, \u03b8]. Similar to dual PPCA, the relation between the\nlatent representation and the observation is given by\n\nyn = \u02dcW \u02dcxn + \u0001n, n = 1, . . . , N\n\n(cid:3)T , \u02dcW is the matrix speci\ufb01es the relation between yn and \u02dcxn, \u0001n \u223c\n\nwhere \u02dcxn = (cid:2)xT\n(cid:81)D\nN (0, \u03b2\u22121I) is the additive noise. Then by placing a standard Gaussian prior on \u02dcW, i.e. p( \u02dcW) =\ni=1 N ( \u02dcwi,:|0, I), where \u02dcwi,: is the ith row of the matrix \u02dcW, the log-likelihood of the observations\nis given by\n\nn , \u03b8T\nn\n\nL(\u0398|X, Y, \u03b2) = \u2212 DN\n2\n\n(7)\nwhere \u02dcK = \u02dcX \u02dcXT + \u03b2\u22121I = [X, \u0398] [X, \u0398]T + \u03b2\u22121I = XXT + \u0398\u0398T + \u03b2\u22121I is the covariance\nmatrix after bringing in \u03b8.\n\nln(2\u03c0) \u2212 D\n2\n\nln\n\ntr\n\n2\n\n(cid:16)| \u02dcK|(cid:17) \u2212 1\n\n(cid:16) \u02dcK\u22121YYT(cid:17)\n\n,\n\n5\n\n\fGeneral nonlinear cases (GPPOM). Similar\nto the generalization from dual PPCA to GP-\nLVM, the dual PPCA with observable X and\nlatent \u03b8 can be easily generalized to nonlin-\near cases. Denote the feature map by \u03c6(\u00b7)\nand \u03a6 = [\u03c6( \u02dcx1), . . . , \u03c6( \u02dcxN )]T , then the co-\nvariance matrix is given by \u02dcK = \u03a6\u03a6T +\n\u03b2\u22121I.\nThe entries of \u03a6\u03a6T can be com-\nputed using kernel trick given a selected ker-\nnel k(\u00b7,\u00b7).\nIn this paper, we adopt the ra-\ndial basis function (RBF) kernel, which reads\n,\nk(xi, xj) = exp\nwhere \u03b3d, for d = 1, . . . , Dx, are free param-\neters and Dx is the dimension of the input. As\na result of adopting RBF kernel, the covariance\nmatrix \u02dcK in (7) can be computed as\n\nd=1 \u03b3d(xid \u2212 xjd)2(cid:17)\n\n(cid:16)\u2212(cid:80)Dx\n\n\u02dcK = \u03a6\u03a6T + \u03b2\u22121I = KX \u25e6 K\u03b8 + \u03b2\u22121I,\n\ninput\n\nAlgorithm 1: Causal Inference\n:D = {(xn, yn)}N\nobservations of two r.v.s;\n\u03bb - parameter of independence\n\nn=1 - the set of\n\noutput :The causal direction\n\n1 Standardize observations of each r.v.;\n2 Initialize \u03b2 and kernel parameters;\n3 Optimize (8) in both directions, denote the\nthe value of HSIC term by HSICX\u2192Y and\nHSICY \u2192X, respectively;\n\n4 if HSICX\u2192Y < HSICY \u2192X then\nThe causal direction is X \u2192 Y ;\n5\n6 else if HSICX\u2192Y > HSICY \u2192X then\nThe causal direction is Y \u2192 X;\n7\n8 else\n9\n10 end\n\nNo decision made.\n\nwhere \u25e6 denotes the Hadamard product, the\nentries on ith row and jth column of KX and K\u03b8 are given by [KX ]ij = k(xi, xj) and\n[K\u03b8]ij = k(\u03b8i, \u03b8j), respectively. After the nonlinear generalization, the relation between Y and \u02dcX\nreads Y = f ( \u02dcX) + \u0001 = f (X, \u03b8) + \u0001. This variant of GP-LVM with partially observable latent space\nis named GPPOM in this paper. Like GP-LVM, \u02dcX is mapped to Y by the same set of Gaussian\nprocesses in GPPOM so the differences in the g.m.s is captured by \u03b8n, the latent representations\nassociated with each observation.\n\n3.3 Model estimation by independence enforcement\n\nBoth dual PPCA and GP-LVM \ufb01nds the latent representations through log-likelihood maximization\nusing scaled conjugate gradient [14]. However, the \u03b8 can not be found by directly conducting\nlikelihood maximization since the ANM-MM requires additionally the independence between X and\n\u03b8. To this end, we include HSIC [4] in the objective. By incorporating HSIC term into the negative\nlog-likelihood of GPPOM, the optimization objective reads\n\nJ (\u0398) = arg min\n\n[\u2212L(\u0398|X, Y, \u2126) + \u03bb log HSICb(X, \u0398)],\n\n(8)\n\narg min\n\n\u0398,\u2126\n\n\u0398,\u2126\n\nwhere \u03bb is the parameter which controls the importance of the HSIC term and \u2126 is the set of all hyper\nparameters including \u03b2 and all kernel parameters \u03b3d, d = 1, . . . , Dx.\nTo \ufb01nd \u0398, we resort to the gradient descant methods. The gradient of the objective J with respect to\nlatent points in \u0398 is given by\n\n(9)\nThe \ufb01rst part on the right hand side of (9), which is the gradient of J with respect to the kernel matrix\nK\u03b8, can be computed as\n\n\u2202 [\u0398]ij\n\n= tr\n\n.\n\n(cid:35)\n\n(cid:34)(cid:18) \u2202J\n\n\u2202J\n\u2202 [\u0398]ij\n\n(cid:19)T \u2202K\u03b8\n(cid:20)(cid:16) \u02dcK\u22121YYT \u02dcK\u22121 \u2212 D \u02dcK\u22121(cid:17)T(cid:0)KX \u25e6 Jij(cid:1)(cid:21)\n\n\u2202K\u03b8\n\n+ \u03bb\n\ntr (KX HK\u03b8H))\n\n1\n\nHKX H,\n\n(10)\n\n\u2202J\n\u2202K\u03b8\n\n= \u2212 tr\n\nwhere Jij is the single-entry matrix, 1 at (i, j) and 0 elsewhere and H = I \u2212 1\n(cid:126)1(cid:126)1T . Combining\n\u2202L\n= \u2202k(\u03b8m,\u03b8n)\n,\n\u2202K\u03b8\nthrough the chain rule, all latent points in \u0398 can be optimized. With \u0398, one can conduct causal\ninference and mechanism clustering of ANM-MM. The detailed steps are given in Algorithm 1 and 2.\n\n, whose entry on the mth row and nth column is given by \u2202[K\u0398]mn\n\u2202[\u0398]ij\n\nwith \u2202K\u03b8\n\u2202[\u0398]ij\n\n\u2202[\u0398]ij\n\nN\n\n4 Experiments\n\n6\n\n\fFigure 3: Accuracy (y-axis) versus sample size (x-axis) on Y = f (X; \u03b8c) + \u0001 with different\nmechanisms. (a) f1, (b) f2, (c) f3, (d) f4.\n\nIn this section, experimental results on both\nsynthetic and real data are given to show the\nperformance of ANM-MM on causal inference\nand mechanism clustering tasks. The Python\ncode of ANM-MM is available online at https:\n//github.com/amber0309/ANM-MM.\n\n4.1 Synthetic data\n\nAlgorithm 2: Mechanism clustering\n\ninput\n\nn=1 - the set of\n\n:D = {(xn, yn)}N\nobservations of two r.v.s;\n\u03bb - parameter of independence;\nC - Number of clusters\n\noutput :The cluster labels\n\n1 Standardize observations of each r.v.;\n2 Initialize \u03b2 and kernel parameters;\n3 Find \u0398 by optimizing (8) in causal\n\ndirection;\n\n4 Apply k-means on \u03b8n, n = 1, . . . , N;\n5 return the cluster labels.\n\nIn experiments of causal inference, ANM-MM\nis compared with ANM [6], PNL [21], IGCI\n[8], ECP [20] and LiNGAM [18]. The results\nare evaluated using accuracy, which is the per-\ncentage of correct causal direction estimation\nof 50 independent experiments. Note that ANM-MM was applied using different parameter\n\u03bb \u2208 {0.001, 0.01, 0.1, 1, 10} and IGCI was applied using different reference measures and esti-\nmators. Their highest accuracy is reported.\nIn experiments of clustering, ANM-MM is compared with well-known k-means [13] (similarity-\nbased) on both raw data (k-means) and its PCA component (PCA-km), Gaussian mixture clustering\n(GMM) [16] (model-based), spectral clustering (SpeClu) [17] (spectral graph theory-based) and\nDBSCAN [2] (density-based). Clustering performance is evaluated using average adjusted Rand\nindex [7] (avgARI), which is the mean ARI over 100 experiments. High ARI (\u2208 [\u22121, 1]) indicates\ngood match between the clustering results and the ground truth. Sample size (N) is 100 in all\nsynthetic clustering experiments. Clustering results are visualized in the supplementary1.\nDifferent g.m.s and sample sizes. We examine the performance on different g.m.s (f) and sample\nsizes (N). The mechanisms adopted are the following elementary functions: 1) f1 =\n1.5+\u03b8cX 2 ; 2)\nf2 = 2 \u00d7 X \u03b8c\u22120.25; 3) f3 = exp(\u2212\u03b8cX); 4) f4 = tanh(\u03b8cX). We tested sample size N = 50, 100\nand 200 for each mechanism. Given f and N, the cause X is sampled from a uniform distribution\nU (0, 1) and then mapped to the effect by Y = f (X; \u03b8c) + \u0001, c \u2208 {1, 2}, where \u03b81 \u223c U (1, 1.1),\n\u03b82 \u223c U (3, 3.1) and \u0001 \u223c N (0, 0.052). Each mechanism generates half of the observations.\nCausal Inference. The results are shown in Fig. 3. ANM-MM and ECP outperforms others based on\na single causal model, which is consistent with our anticipation. Compared with ECP, ANM-MM\nshows slight advantages in 3 out of 4 settings. Clustering. The avgARI values are summarized in (i)\nof Table 1. ANM-MM signi\ufb01cantly outperforms other approaches in all mechanism settings.\nDifferent number of g.m.s.2 We examine the performance on different number of g.m.s (C in\nDe\ufb01nition 1). \u03b81, \u03b82 and \u0001 are the same as in previous experiments. In the setting of three mechanisms,\n\u03b83 \u223c U (0.5, 0.6). In the setting of four, \u03b83 \u223c U (0.5, 0.6) and \u03b84 \u223c U (2, 2.1). Again, the numbers\nof observations from each mechanism are the same.\n\n1\n\n1The results of PCA-km are not visualized since they are similar to and worse than those of k-means.\n2From this part on, g.m. is \ufb01xed to be f3.\n\n7\n\n\fTable 1: avgARI of synthetic clustering experiments\n\navgARI\n\nf1\n\nANM-MM 0.393\n0.014\nk-means\n0.013\nPCA-km\n0.015\nGMM\nSpeClu\n0.003\nDBSCAN 0.055\n\n(i) f\n\nf2\n\n0.660\n0.039\n0.037\n0.340\n0.129\n0.265\n\nf3\n\n0.777\n0.046\n0.044\n0.073\n0.295\n0.342\n\n(ii) C\n\n(iii) \u03c3\n\n(iv) a1\n\nf4\n\n0.682\n0.046\n0.048\n0.208\n0.192\n0.358\n\n3\n\n0.610\n0.194\n0.056\n0.237\n0.285\n0.257\n\n4\n\n0.447\n0.165\n0.041\n0.202\n0.175\n0.106\n\n0.01\n0.798\n0.049\n0.047\n0.191\n0.595\n0.527\n\n0.1\n0.608\n0.042\n0.040\n0.025\n0.048\n0.110\n\n0.25\n0.604\n0.047\n0.052\n0.048\n0.044\n0.521\n\n0.75\n0.867\n0.013\n0.014\n0.381\n-0.008\n0.718\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: Accuracy (y-axis) versus (a) number of mechanisms; (b) noise standard deviation; (c)\nmixing proportion; on f3 with N = 100.\n\nCausal Inference. The results are given in Fig. 4a which shows decreasing trend for all approaches.\nHowever, ANM-MM keeps 100% when the number of mechanisms increases from 2 to 3. Clustering.\nThe avgARI values are given in (ii) and (i)f3 of Table 1. The performance of different approaches\nshow different trends which is probably due to the clustering principle they are based on. Although\nANM-MM is heavily in\ufb02uenced by C, its performance is still much better than others.\nDifferent noise standard deviations. We ex-\namine the performance on different noise stan-\ndard deviations \u03c3. \u03b81, \u03b82 are the same as in the\n\ufb01rst part of experiments. Three different cases\nwhere \u03c3 = 0.01, 0.05 and 0.1 are tested.\nCausal Inference. The results are given in Fig.\n4b. The change in \u03c3 in this range does not signif-\nicantly in\ufb02uence the performance of most causal\ninference approaches. ANM-MM keeps 100%\naccuracy for all choice of \u03c3. Clustering. The\navgARI values are given in (iii) and (i)f3 of\nTable 1. As our anticipation, the clustering re-\nsults heavily rely on \u03c3 and all approaches show\na decreasing trend in avgARI as \u03c3 increases.\nHowever, ANM-MM is the most robust against\nlarge \u03c3.\nDifferent mixing proportions. We examine the performance on different mixing proportions (ac in\nDe\ufb01nition 1). \u03b81, \u03b82 and \u03c3 are the same as in the \ufb01rst part of experiments. Cases where a1 = 0.25, 0.5\nand 0.75 (corresponding a2 = 0.75, 0.5 and 0.25) are tested.\nCausal Inference. The results on different a1 are given in Fig. 4c. Approaches based on a single\ncausal model are sensitive to the change in a1 whereas ECP and ANM-MM are more robust and\noutperform others. Clustering. The avgARI values of experiments on different a1 are given in (iv) and\n(i)f3 of Table 1. The results of comparing approaches are signi\ufb01cantly affected by a1 and ANM-MM\nshows best robustness against the change in a1.\n\nFigure 5: Accuracy on real cause-effect pairs.\n\n8\n\n\f(a) Ground truth\n\n(b) ANM-MM\n\n(c) k-means\n\n(d) GMM\n\n(e) SpeClu\n\n(f) DBSCAN\n\nFigure 6: Ground truth and clustering results of different approaches on BAFU air data.\n\n4.2 Real data\n\nCausal inference on T\u00fcebingen cause-effect pairs. We evaluate the causal inference performance\nof ANM-MM on real world benchmark cause-effect pairs3 [15]. Nine out of 41 data sets are excluded\nin our experiment because either they consists of multivariate or categorical data (pair 47, 52, 53, 54,\n55, 70, 71, 101 and 105) or the estimated latent representations are extremely close4 (pair 12 and\n17). Fifty independent experiments are repeated for each pair, and the percentage of correct inference\nof different approaches are recorded. Then average percentage of pairs from the same data set is\ncomputed as the accuracy of the corresponding data set. In each independent experiment, different\ninference approaches are applied on 90 points randomly sampled from raw data without replacement.\nThe results are summarized in Fig. 5 with blue solid line indicating median accuracy and red dashed\nline indicating mean accuracy. It shows that the performance of ANM-MM is satisfactory, with\nhighest median accuracy of about 82%. IGCI also performs quite well, especially in terms of median,\nfollowed by PNL.\nClustering on BAFU air data. We evaluate the clustering performance of ANM-MM on real air data\nobtained online5. This data consists of daily mean values of ozone (\u00b5g/m3) and temperature (\u25e6) of\n2009 from two distinct locations in Switzerland. In our experiment, we regard the data as generating\nfrom two mechanisms (each corresponds to a location). The clustering results are visualized in Fig. 6.\nThe ARI values of ANM-MM is 0.503, whereas k-means, GMM, spectral clustering and DBSCAN\ncould only obtain ARI of -0.001, 0.003, 0.078 and 0.003, respectively. ANM-MM is the only one\nthat could reveal the property related to the location of the data g.m..\n\n5 Conclusion\n\nIn this paper, we extend the ANM to a more general model (ANM-MM) in which there are a \ufb01nite\nnumber of ANMs of the same function form and differ only in parameter values. The condition of\nidenti\ufb01ability of ANM-MM is analyzed. To estimate ANM-MM, we adopt the GP-LVM framework\nand propose a variant of it called GPPOM to \ufb01nd the optimized latent representations and further\nconduct causal inference and mechanism clustering. Results on both synthetic and real world data\nverify the effectiveness of our proposed approach.\n\n3https://webdav.tuebingen.mpg.de/cause-effect/.\n4close in the sense that |\u03b8i \u2212 \u03b8j| < 0.001.\n5https://www.bafu.admin.ch/bafu/en/home/topics/air.html\n\n9\n\n\fAcknowledgments\n\nThis work is partially supported by the Hong Kong Research Grants Council.\n\nReferences\n[1] Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B., Zhang, K., and Sch\u00f6lkopf, B.\n\n(2012). Inferring deterministic causal relations. arXiv preprint arXiv:1203.3475.\n\n[2] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering\nclusters a density-based algorithm for discovering clusters in large spatial databases with noise. In\nProceedings of the Second International Conference on Knowledge Discovery and Data Mining,\nKDD\u201996, pages 226\u2013231. AAAI Press.\n\n[3] Fukumizu, K., Bach, F. R., and Jordan, M. I. (2004). Dimensionality reduction for supervised\nlearning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73\u2013\n99.\n\n[4] Gretton, A., Bousquet, O., Smola, A., and Sch\u00f6lkopf, B. (2005a). Measuring statistical depen-\ndence with hilbert-schmidt norms. In International conference on algorithmic learning theory,\npages 63\u201377. Springer.\n\n[5] Gretton, A., Smola, A. J., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama,\nY., Pauls, J., Sch\u00f6lkopf, B., and Logothetis, N. K. (2005b). Kernel constrained covariance for\ndependence measurement. In AISTATS, volume 10, pages 112\u2013119.\n\n[6] Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Sch\u00f6lkopf, B. (2009). Nonlinear causal\ndiscovery with additive noise models. In Advances in neural information processing systems,\npages 689\u2013696.\n\n[7] Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classi\ufb01cation, 2(1):193\u2013218.\n\n[8] Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniu\u0161is, P., Steudel, B., and\nSch\u00f6lkopf, B. (2012). Information-geometric approach to inferring causal directions. Arti\ufb01cial\nIntelligence, 182:1\u201331.\n\n[9] Janzing, D. and Scholkopf, B. (2010). Causal inference using the algorithmic markov condition.\n\nIEEE Transactions on Information Theory, 56(10):5168\u20135194.\n\n[10] Lawrence, N. (2005). Probabilistic non-linear principal component analysis with gaussian\n\nprocess latent variable models. Journal of machine learning research, 6(Nov):1783\u20131816.\n\n[11] Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high\n\ndimensional data. In Advances in neural information processing systems, pages 329\u2013336.\n\n[12] Liu, F. and Chan, L. (2016). Causal discovery on discrete data with extensions to mixture model.\n\nACM Transactions on Intelligent Systems and Technology (TIST), 7(2):21.\n\n[13] MacQueen, J. (1967). Some methods for classi\ufb01cation and analysis of multivariate observations.\nIn Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,\nVolume 1: Statistics, pages 281\u2013297, Berkeley, Calif. University of California Press.\n\n[14] M\u00f8ller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural\n\nnetworks, 6(4):525\u2013533.\n\n[15] Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Sch\u00f6lkopf, B. (2016). Distinguishing\ncause from effect using observational data: methods and benchmarks. The Journal of Machine\nLearning Research, 17(1):1103\u20131204.\n\n[16] Rasmussen, C. E. (2000). The in\ufb01nite gaussian mixture model. In Advances in neural informa-\n\ntion processing systems, pages 554\u2013560.\n\n[17] Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on\n\npattern analysis and machine intelligence, 22(8):888\u2013905.\n\n10\n\n\f[18] Shimizu, S., Hoyer, P. O., Hyv\u00e4rinen, A., and Kerminen, A. (2006). A linear non-gaussian\nacyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030.\n\n[19] Williams, C. K. (1998). Prediction with gaussian processes: From linear regression to linear\n\nprediction and beyond. In Learning in graphical models, pages 599\u2013621. Springer.\n\n[20] Zhang, K., Huang, B., Zhang, J., Sch\u00f6lkopf, B., and Glymour, C. (2015). Discovery and\n\nvisualization of nonstationary causal models. arXiv preprint arXiv:1509.08056.\n\n[21] Zhang, K. and Hyv\u00e4rinen, A. (2009). On the identi\ufb01ability of the post-nonlinear causal model. In\nProceedings of the twenty-\ufb01fth conference on uncertainty in arti\ufb01cial intelligence, pages 647\u2013655.\nAUAI Press.\n\n11\n\n\f", "award": [], "sourceid": 2490, "authors": [{"given_name": "Shoubo", "family_name": "Hu", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Zhitang", "family_name": "Chen", "institution": "Noah's Ark Lab,Huawei Tech. Investment Co. Ltd."}, {"given_name": "Vahid", "family_name": "Partovi Nia", "institution": "Huawei Technologies"}, {"given_name": "Laiwan", "family_name": "CHAN", "institution": "Department of Computer Science and Engineering, Chinese University of Hong Kong"}, {"given_name": "Yanhui", "family_name": "Geng", "institution": "Huawei Montreal Research Centre"}]}