{"title": "Causal Discovery from Discrete Data using Hidden Compact Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 2666, "page_last": 2674, "abstract": "Causal discovery from a set of observations is one of the fundamental problems across several disciplines. For continuous variables, recently a number of causal discovery methods have demonstrated their effectiveness in distinguishing the cause from effect by exploring certain properties of the conditional distribution, but causal discovery on categorical data still remains to be a challenging problem, because it is generally not easy to find a compact description of the causal mechanism for the true causal direction. In this paper we make an attempt to find a way to solve this problem by assuming a two-stage causal process: the first stage maps the cause to a hidden variable of a lower cardinality, and the second stage generates the effect from the hidden representation. In this way, the causal mechanism admits a simple yet compact representation. We show that under this model, the causal direction is identifiable under some weak conditions on the true causal mechanism. We also provide an effective solution to recover the above hidden compact representation within the likelihood framework. Empirical studies verify the effectiveness of the proposed approach on both synthetic and real-world data.", "full_text": "Causal Discovery from Discrete Data using Hidden\n\nCompact Representation\n\nRuichu Cai 1, Jie Qiao1, Kun Zhang2, Zhenjie Zhang3, Zhifeng Hao1, 4\n1 School of Computer Science, Guangdong University of Technology, China\n\n2 Department of philosophy, Carnegie Mellon University\n\n3 Singapore R&D, Yitu Technology Ltd.\n\n4 School of Mathematics and Big Data, Foshan University, China\n\ncairuichu@gdut.edu.cn, qiaojie.chn@gmail.com, kunz1@andrew.cmu.edu,\n\nzhenjie.zhang@yitu-inc.com, zfhao@gdut.edu.cn\n\nAbstract\n\nCausal discovery from a set of observations is one of the fundamental problems\nacross several disciplines. For continuous variables, recently a number of causal\ndiscovery methods have demonstrated their effectiveness in distinguishing the cause\nfrom effect by exploring certain properties of the conditional distribution, but causal\ndiscovery on categorical data still remains to be a challenging problem, because it\nis generally not easy to \ufb01nd a compact description of the causal mechanism for the\ntrue causal direction. In this paper we make an attempt to \ufb01nd a way to solve this\nproblem by assuming a two-stage causal process: the \ufb01rst stage maps the cause to\na hidden variable of a lower cardinality, and the second stage generates the effect\nfrom the hidden representation. In this way, the causal mechanism admits a simple\nyet compact representation. We show that under this model, the causal direction is\nidenti\ufb01able under some weak conditions on the true causal mechanism. We also\nprovide an effective solution to recover the above hidden compact representation\nwithin the likelihood framework. Empirical studies verify the effectiveness of the\nproposed approach on both synthetic and real-world data.\n\n1\n\nIntroduction\n\nBecause randomized controlled experiments are usually infeasible and generally too expensive,\nobservational data-based causal discovery, has been a focus of recent research in this area [Spirtes et\nal., 2000; Pearl, 2009]. Various observational-based causal discovery methods have been proposed by\nexploring certain properties of the conditional distribution. For example, constraint-based methods\nexploit conditional independence relations between the variables in order to estimate the Markov\nequivalence class of the underlying causal graph [Spirtes et al., 2000; Pearl and Verma, 1995]. On\nlinear non-Gaussian acyclic data, the Linear, Non-Gaussian, Acyclic Model (LiNGAM) [Shimizu et\nal., 2006, 2011] has been used to reconstruct the causal network by maximizing the independence\namong the noises. On nonlinear data, additive noise model [Hoyer et al., 2009] and post-nonlinear\nmodel [Zhang and Chan, 2006; Zhang and Hyv\u00e4rinen, 2009] can be used to distinguish the cause from\neffect by considering the independence between the noise and the cause. Recently, the likelihood\nembedded with various constraints and models among the variables are conducted for score-based\nmethods [Cai et al., 2018].\n\nThough the additive noise model (Y = g(X) + E, X \u0002 E) has been extended to handle discrete data\n\n[Peters et al., 2010], causal discovery on categorical data still remains to be a challenging problem.\nNote that it is usually hard to justify the additive noise assumptions for discrete data, especially for\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Food Poisoning: A Hidden Compact Representation Example in Real World.\n\ncategorical data. In fact, the additive noise model assumes that all categories of the variables are\nplaced in the \u201cright\u201d order; furthermore, if X \u2192 Y holds according to the additive noise model,\n\nthen for any observation(x, y), there exists a function g(X) such that the noise E = y \u2212 g(x) is\nconditional distribution P(Y\u08afX = x) always has the same shape for different values of x, after being\nproperly shifted according to g(x). But the values of a categorical variable are mutually exclusive\n\nindependent of X, i.e., it has the same distribution for different values of x. In other words, the\n\n\u2032) provides a compact representation of the\n\ncategories or groups, without a meaningful order of magnitude. Thus, the additive noise model may\nnot be a proper representation of the causal mechanism for categorical variables.\nTherefore, a proper description of the causal mechanism for discrete data that helps in causal discovery\nremains under explored. In this work, we make an attempt to \ufb01nd a way to solve this problem by\nintroducing a new assumption, called Hidden Compact Representation (HCR in short) as shown in\nthe food poisoning example given in Figure 1: the \ufb01rst stage maps cause to a hidden variable of a\nlower cardinality, and the second stage generates the effect from the hidden representation. As shown\nin the example, in the \ufb01rst stage, the food (X) with four different categories is mapped to the binary\n\u2032) with the key information whether it is poisonous; in the second\ncompact hidden representation (Y\nstage, the hidden representation (poisonous or not) determines whether the patient is diagnosed as\n\nhaving food poisoning (Y ). The hidden representation(Y\nX \u0002 Y\u08afY\n\ncauses and captures key information of the causal mechanism, leaving out irrelevant information in\nthe cause. This way, the causal mechanism admits a simple, compact representation.\nLet us have a closer look at the hidden compact representation model and see whether it is possible\nto estimate it from data. First, these two stages are separated by the hidden representation, i.e.,\n\u2032 holds. As a result, we can use two conditional probabilities to express the whole causal\nmechanism from X to Y , as shown in the tables in Figure 1. Second, as a compact representation\nof the cause, the \ufb01rst stage is deterministic, all stage transfer is done with probability 1. Third, as a\ncausal mechanism, the second stage can be represented by a probabilistic mapping from the hidden\n\u2032 to the effect Y . Based on the above observations, we provide a practical method to\nvariable Y\nestimate the above HCR model under the likelihood framework. We also theoretically show that the\nmodel is identi\ufb01able under weak conditions on the causal mechanism.\nOur main contributions include 1) proposing a two-stage compact representation of the causal\nmechanism in the discrete case, 2) developing a likelihood-based framework for estimating the HCR\nmodel, and 3) conducting a theoretical analysis of the identi\ufb01ability of the underlying causal direction.\n\n2 Hidden Compact Representation Model\n\nWithout loss of generality, let X be the cause of Y in a discrete cause-effect pair, i.e., X \u2192 Y . Here,\n\u2032 \u2192 Y , to model the causal mechanism\nwe use the hidden compact representation, M \u08bc X \u2192 Y\nbehind the discrete data, with Y\n\n\u2032 as a hidden compact representation of the cause X.\n\n\u2032, cause X is mapped to a low-cardinality hidden variable Y\n\n\u2032 = f(X), where f \u08bc Z \u2192 Z is a noise-free arbitrary function. It\n\nIn the \ufb01rst stage X \u2192 Y\nIt can be expressed by using Y\n\n\u2032 deterministically.\n\n2\n\n(cid:28651)(cid:28652)(cid:28704)(cid:28652)(cid:28599)(cid:28629)(cid:28649)(cid:28647)(cid:28633)(cid:28601)(cid:28634)(cid:28634)(cid:28633)(cid:28631)(cid:28648)(cid:28604)(cid:28637)(cid:28632)(cid:28632)(cid:28633)(cid:28642)(cid:28564)(cid:28599)(cid:28643)(cid:28641)(cid:28644)(cid:28629)(cid:28631)(cid:28648)(cid:28564)(cid:28614)(cid:28633)(cid:28644)(cid:28646)(cid:28633)(cid:28647)(cid:28633)(cid:28642)(cid:28648)(cid:28629)(cid:28648)(cid:28637)(cid:28643)(cid:28642)(cid:28643)(cid:28674)(cid:28668)(cid:28678)(cid:28674)(cid:28673)(cid:28674)(cid:28680)(cid:28678)(cid:28595)(cid:28640)(cid:28680)(cid:28678)(cid:28667)(cid:28677)(cid:28674)(cid:28674)(cid:28672)(cid:28640)(cid:28680)(cid:28678)(cid:28667)(cid:28677)(cid:28674)(cid:28674)(cid:28672)(cid:28645)(cid:28668)(cid:28662)(cid:28664)(cid:28643)(cid:28674)(cid:28668)(cid:28678)(cid:28674)(cid:28673)(cid:28674)(cid:28680)(cid:28678)(cid:28641)(cid:28674)(cid:28679)(cid:28595)(cid:28643)(cid:28674)(cid:28668)(cid:28678)(cid:28674)(cid:28673)(cid:28674)(cid:28680)(cid:28678)(cid:28643)(cid:28674)(cid:28668)(cid:28678)(cid:28674)(cid:28673)(cid:28674)(cid:28680)(cid:28678)(cid:28595)(cid:28633)(cid:28668)(cid:28678)(cid:28667)(cid:28612)(cid:28612)(cid:28611)(cid:28611)(cid:28611)(cid:28611)(cid:28612)(cid:28612)(cid:28643)(cid:28674)(cid:28668)(cid:28678)(cid:28674)(cid:28673)(cid:28674)(cid:28680)(cid:28678)(cid:28633)(cid:28674)(cid:28674)(cid:28663)(cid:28595)(cid:28643)(cid:28674)(cid:28668)(cid:28678)(cid:28674)(cid:28673)(cid:28668)(cid:28673)(cid:28666)(cid:28641)(cid:28674)(cid:28677)(cid:28672)(cid:28660)(cid:28671)(cid:28611)(cid:28609)(cid:28619)(cid:28616)(cid:28611)(cid:28609)(cid:28611)(cid:28616)(cid:28641)(cid:28674)(cid:28679)(cid:28595)(cid:28643)(cid:28674)(cid:28668)(cid:28678)(cid:28674)(cid:28673)(cid:28674)(cid:28680)(cid:28678)(cid:28646)(cid:28679)(cid:28674)(cid:28672)(cid:28660)(cid:28662)(cid:28667)(cid:28595)(cid:28633)(cid:28671)(cid:28680)(cid:28611)(cid:28609)(cid:28612)(cid:28611)(cid:28611)(cid:28609)(cid:28620)(cid:28611)(cid:28611)(cid:28609)(cid:28611)(cid:28614)(cid:28611)(cid:28609)(cid:28611)(cid:28618)(cid:28615)(cid:28648)(cid:28629)(cid:28635)(cid:28633)(cid:28564)(cid:28581)(cid:28590)(cid:28564)(cid:28632)(cid:28633)(cid:28648)(cid:28633)(cid:28646)(cid:28641)(cid:28637)(cid:28642)(cid:28637)(cid:28647)(cid:28648)(cid:28637)(cid:28631)(cid:28564)(cid:28641)(cid:28629)(cid:28644)(cid:28644)(cid:28637)(cid:28642)(cid:28635)(cid:28665)(cid:28677)(cid:28674)(cid:28672)(cid:28595)(cid:28662)(cid:28660)(cid:28680)(cid:28678)(cid:28664)(cid:28595)(cid:28603)(cid:28651)(cid:28604)(cid:28595)(cid:28679)(cid:28674)(cid:28595)(cid:28667)(cid:28668)(cid:28663)(cid:28663)(cid:28664)(cid:28673)(cid:28595)(cid:28677)(cid:28664)(cid:28675)(cid:28677)(cid:28664)(cid:28678)(cid:28664)(cid:28673)(cid:28679)(cid:28660)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28595)(cid:28603)(cid:28652)(cid:28704)(cid:28604)(cid:28615)(cid:28648)(cid:28629)(cid:28635)(cid:28633)(cid:28564)(cid:28582)(cid:28590)(cid:28564)(cid:28644)(cid:28646)(cid:28643)(cid:28630)(cid:28629)(cid:28630)(cid:28637)(cid:28640)(cid:28637)(cid:28647)(cid:28648)(cid:28637)(cid:28631)(cid:28564)(cid:28641)(cid:28629)(cid:28644)(cid:28644)(cid:28637)(cid:28642)(cid:28635)(cid:28564)(cid:28665)(cid:28677)(cid:28674)(cid:28672)(cid:28595)(cid:28667)(cid:28668)(cid:28663)(cid:28663)(cid:28664)(cid:28673)(cid:28595)(cid:28677)(cid:28664)(cid:28675)(cid:28677)(cid:28664)(cid:28678)(cid:28664)(cid:28673)(cid:28679)(cid:28660)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28595)(cid:28603)(cid:28652)(cid:28704)(cid:28604)(cid:28595)(cid:28679)(cid:28674)(cid:28595)(cid:28664)(cid:28665)(cid:28665)(cid:28664)(cid:28662)(cid:28679)(cid:28595)(cid:28603)(cid:28652)(cid:28604)(cid:28651)(cid:28652)(cid:28704)(cid:28652)(cid:28704)(cid:28652)\f\u2032. This stage extracts the\nimplies that the cause X can be reduced to a hidden low-cardinality space Y\nreal, necessary causal factor behind the various cause states. As shown in Figure 1, there are four\n\u2032 extracts the key information in this causal\ndifferent values of X, and the hidden representation Y\nmechanism, i.e., whether the food is poisonous or not.\n\nIn the second stage, the effect Y is generated from the hidden representation Y\n\n\u2032 = f(x)). For instance, as shown in\n\n\u2032 by the probabilistic\n\nFigure 1, the food poisoning may misdiagnose as stomach \ufb02u with probability 0.1, which is captured\nby the conditional distribution.\nIn this hidden compact representation model, the deterministic mapping stage and probabilistic\n\u2032. Given a\nmapping stage are naturally separable by the hidden representation Y\n\u2032 \u2192 Y is\n\n\u2032, i.e., X \u0002 Y\u08afY\n\ni=1, the log-likelihood of the model M \u08bc X \u2192 Y\n\n\u2032\n\n3\n\nestimated as follows.\n\nmapping with conditional probability distribution P(Y = y\u08afY\ngroup of observations D ={(xi, yi)}m\nP(X = xi, Y\nP(X = xi)P(Y\nP(X = xi)P(Y = yi\u08afY\n\nL(M ;D)\n\n= log\n\n= log\n\n= log\n\nm(cid:53)\n\nm(cid:53)\n\nm(cid:53)\n\n\u2032 = y\n\n(cid:61)\n\n(cid:61)\n\ni=1\n\ni=1\n\n\u2032\ni\n\n\u2032\ni\n\ny\n\ny\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032 = y\n\n\u2032 = y\n\ni)\n\ni, Y = yi\u08afM)\ni\u08afX = xi)P(Y = yi\u08afY\n\u2032 = f(xi))\ni\u08afX = xi), denotes how the compact representation\ni\u08afX = xi) = 1 if\ni \u2260 f(xi), where function f denotes the true mapping.\n\n\u2032. The\n\n\u2032 = y\n\n(1)\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\ni=1\n\n\u2032 = y\n\n\u2032 = y\n\nlog(m)\n\nEquation (1) decomposes the joint probability into three components according to X \u0002 Y\u08afY\nmiddle term of the second equation, P(Y\n\u2032 is generated from X. Since this process is deterministic, we have P(Y\ni\u08afX = xi) = 0 if y\ni = f(xi) and P(Y\n\nY\n\u2032\ny\nDifferent from the previous likelihood framework, the likelihood given in equation (1) contains a\nhidden representation with an unknown cardinality. Thus, the Bayesian Information Criterion (BIC)\n[Schwarz and others, 1978] is introduced to control the complexity of the model, which provides a\ntrade-off between the goodness of \ufb01t and model complexity. The BIC is given in equation (2), which\nis an approximation of the marginal likelihood of the hidden compact representation model M based\non the data D:\n\nL\u2217(M ;D) = L(M ;D) \u2212 d\n\u2032\u08af(\u08afY\u08af \u2212 1) measures the effective number of parameters in the model. In\nwhere d =(\u08afX\u08af \u2212 1) +\u08afY\n\u2032\u08af(\u08afY\u08af \u2212 1) are the numbers of parameters for P(X), and the probabilistic\ndetail,(\u08afX\u08af \u2212 1) and\u08afY\nmapping PY\u08afY \u2032, respectively.\nin M are decomposed into two parts, \u03b8 and f, where \u03b8 includes the parameters of P(X) and\nP(Y\u08afY\n\u2032). Maximization of above objective function, i.e., maxL\u2217 = supf max\u03b8 L\u2217, involves two\nTo recover the causal model, we regard the model with the highest L\u2217 as the best one. The parameters\niterative steps. First, calculate the maximum likelihood estimator (MLE) \u02c6\u03b8 = argmax\u03b8 L(\u03b8;D) while\nbest f to achieve supf L\u2217(f ;D). Such an alternate maximization procedure eventually converges,\nmax\u03b8 L(\u03b8;D) and the MLE \u02c6\u03b8 of the likelihood L can be calculated directly as described in the\nLet \u02c6\u03b8 ={\u02c6a, \u02c6b} where \u02c6ax = \u02c6P(X = x), \u02c6by,y\u2032 = \u02c6P(Y = y\u08afY\n\u2032) denote the MLE of the distribution\nPX,PY\u08afY \u2032 respectively. The MLE of those parameters can be written as \u02c6ax = nx\u2211\n, yi = y) are the frequencies of value\ni=1 I(xi = x), and ny\u2032,y = \u2211m\ni=1 I(y\nwhere nx = \u2211m\n\u2032 = y\nX = x and Y\n\nas shown in [Bezdek and Hathaway, 2003].\nIn the \ufb01rst step, more speci\ufb01cally, while \ufb01xing the f, the maximization is equivalent to max\u03b8 L\u2217 =\n\n, Y = y in samples respectively. Such solutions can be derived by maximizing L\n\n\ufb01xing the function f. Second, \ufb01x the parameter values of \u03b8 and \ufb01nd a better model by choosing the\n\n, \u02c6by\u2032,y = ny\n,y\u2211\ny ny\n\nfollowing.\n\ni = y\n\n\u2032 = y\n\n(2)\n\nx nx\n\n2\n\n,y\n\n,\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\n\f\u2032. Following likelihood of the model\n\ny by\u2032,y = 1, \u2200y\n\nin equation (1), the solution of max\u03b8 L(\u03b8;D) is given in equation (3).\nwith the constraints conditions \u2211\n\u2032 = f(xi))\nny\u2032,ylog( ny\u2032,y\u2211\n\nx ax = 1, and \u2211\nL(\u03b8;D)\nP(X = xi)P(Y = yi\u08afY\nx (cid:53)\nnx log( nx\u2211\n\u02c6anx\n\nm(cid:53)\n= log\n= log(cid:53)\n=(cid:61)\n\n) + (cid:61)\n\n\u2032\nny\n\u02c6b\n,y\ny\u2032,y\n\n(cid:53)\n\n(cid:61)\n\nmax\n\ni=1\n\ny\u2032\n\nx\n\ny\n\n\u03b8\n\nx\n\nx nx\n\ny\u2032\n\ny\n\ny ny\u2032,y\n\n(3)\n\n)\n\nIn the second and third equalities, we collect each category and perform a MLE to estimate each\nparameter in the distribution. Consequently, the optimum solution \u02c6\u03b8 is given in the closed form,\nwhich will be used in the second step of the optimization procedure.\nIn the second step, we propose a greedy search algorithm to search the best f with supf L\u2217. Firstly,\n\nf(x) is initialized with the y0, where y0 is the mode of{y\u08af < x, y >\u2208 D}. Secondly, we perform\ngreedy search on the f(x) by enumerating all possible values for each x.\n\nIn summary, the optimization of maxL\u2217 = supf max\u03b8 L\u2217 is given in the following algorithm.\nAlgorithm 1 Optimization of maxL\u2217 = supf max\u03b8 L\u2217\nInput: Data D\nOutput: L\u2217\n\n\u2032 > do\n\nif \u02c6L\u2217 < L\u2217 then\n\nlog(m)\n\nt = t + 1\nfor each pair < x, y\n\n1: f(0)(x) \u2190 argmaxy\n\u02c6P(X = x, Y = y)\n2: while L\u2217 no longer increases do\nL\u2217 \u2190 max\u03b8 L(f\nx\u2192y\u2032 , \u03b8;D) \u2212 d\n(t\u22121)\n3:\n4:\n\u001b\u02c6x, \u02c6y\n,L\u2217\u001b\n, \u02c6L\u2217\u001b \u2190\u001bx, y\n5:\n6:\nSet f(\u02c6x) to be \u02c6y\n7:\n8:\n9:\n10:\n11: end while\n12: return L\u2217\nx\u2192y\u2032 denotes that the change of the value f(t\u22121)(x) to value y\n(t\u22121)\n(t\u22121)\nf(\u02c6x) to \u02c6y\n\u2032 in order to achieve the highest score increase. Finally, set f(\u02c6x) \u2190 \u02c6y\n\n\u2032 and let L\u2217 = \u02c6L\u2217\n\nend for\n\nend if\n\n2\n\n\u2032\n\n\u2032\n\n\u2032 > by traversing the value of y\n\nwhere f\nthe best gain pair < \u02c6x, \u02c6y\nuntil L\u2217 no longer increases.\nBased on the above proposed hidden compact representation and its BIC score, we can simply get the\nfollowing practical method for causal inference.\n\n\u2032. In each iteration, we search\nx\u2192y\u2032 . In other words, change the value\n\u2032 and update the score\n\n\u2032 in f\n\n1. Estimate the model M \u08bc X \u2192 Y\n\nL\u2217( \u02dcM ;D) respectively;\n2. If L\u2217(M ;D) > L\u2217( \u02dcM ;D), infer \u201cX \u2192 Y \u201d,\nIf L\u2217(M ;D) < L\u2217( \u02dcM ;D), infer \u201cY \u2192 X\u201d,\nIf L\u2217(M ;D) = L\u2217( \u02dcM ;D), infer \u201cnon-identi\ufb01able\u201d.\n\n\u2032 \u2192 Y , \u02dcM \u08bc Y \u2192 X\n\n\u2032 \u2192 X by maximizing L\u2217(M ;D) ,\n\nThe asymptotic correctness of this practical methods is implied by the identi\ufb01ability of the model,\nwhich is theoretically analyzed in the following section.\n\n4\n\n\f3\n\nIdenti\ufb01ability\n\nfunctions of X.)\n\nrandom in the sense that\n\nThen asymptotically, in the reverse direction there does not exist X\n\nWe shall show that under the hidden compact representation model, the causal direction is asymptoti-\ncally identi\ufb01able in the general case (under some technical conditions).\nWe \ufb01rst show the following property for the reverse direction under certain conditions on the\n\nconditional distribution P(Y\u08afX).\nTheorem 1. Assume that for the correct causal direction, the conditional distribution P(Y\u08afX) is\nA1. there does not exist values y1 \u2260 y2 such that P(Y = y1\u08af X) equals P(Y = y2\u08af X) times a\nconstant for all possible X values. (Note that both P(Y = y1\u08af X) and P(Y = y2\u08af X) are\n\u2032\u08af <\u08afY\u08af such\n\u2032 = \u02c6f(Y) with\u08afX\nthat P(X\u08afY) = P(X\u08afX\n\u2032) for all possible X and Y values, i.e., the reverse direction does not admit\na low-cardinality hidden representation \u02c6f(Y).\nProof. We have P(X, Y) = P(X)P(Y\u08afX) for the correct direction. Assume that there exists such\n\u2032 = \u02c6f(Y) to satisfy P(X\u08afY) = P(X\u08afX\n\u2032). Hence,\n\u2032). We then have P(X, Y) = P(Y)P(X\u08afX\n\u2032) = P(X)P(Y\u08afX)\nP(X\u08afX\nP(Y)\n\u2032\u08af <\u08afY\u08af, there must exist two values y1 \u2260 y2 such that \u02c6f(y1) = \u02c6f(y2), which implies\nBecause\u08afX\nP(X\u08af \u02c6f(y1)) = P(X\u08af \u02c6f(y2)). According to Equation (4), we have\n= P(X)P(Y = y2\u08afX)\nP(X)P(Y = y1\u08afX)\nP(Y = y2)\nP(Y = y1)\nP(Y = y1\u08afX) = P(Y = y2\u08afX) \u22c5 P(Y = y1)\nP(Y = y2) ,\n\na X\n\nor\n\n.\n\n(4)\n\n,\n\nwhich contradicts Assumption A1. Therefore, the reverse direction does not admit a low-cardinality\nhidden representation.\n\nNote that assumption A1 may be violated, but the chance for it to be violated should be low. Roughly\nspeaking, this assumption states that X and Y are not \u201clocally\u201d independent. Suppose assumption\n\nA1 does not hold; then there must exist y1,y2 satisfying P(Y = y1\u08afX) = cP(Y = y2\u08afX) for all\npossible values of X. One can derive that P(X\u08afY = y1) = P(X\u08afY = y2). This means if we ignore\n\nall the other possible values of Y other than y1 and y2, X and Y become independent. Generally\nspeaking, this will not hold when X and Y are dependent, especially when the cardinality of X is\nnot small. The experimental results also illustrate the plausibility of this assumption.\nAs an immediate result of Theorem 1, we have the identi\ufb01ability of the causal direction under the\nhidden compact representation model, as given in Theorem 2.\nTheorem 2. Assume that in the causal direction there exists the transformation Y\n\n\u2032 = f(X) such\n\u2032\u08af <\u08afX\u08af, and and assumption A1 holds. Then to produce the\nthat P(Y\u08afX) = P(Y\u08afY\nsame distribution P(X, Y), the reverse direction must involve more effective number of parameters\n\n\u2032), where\u08afY\n\nin the model than the causal direction.\n\nGoing one step further, Theorem 3 shows the BIC of the causal direction is asymptotically higher\nthan that of the reverse one. The proof of this theorem is provided in the supplementary material.\nTheorem 3. If the reverse direction involves more parameters than the causal direction to produce\n\nthe same distribution P(X, Y), the BIC of the causal direction is asymptotically higher than that of\n\nthe reverse one.\n\n5\n\n\f4 Experiments\n\nTo investigate the effectiveness of the proposed method based on the hidden compact representation\nmodel, we compare it with baseline algorithms on both synthetic data and the real world data. On\nsynthetic data, we simulate the data according to the hidden compact representation model. In all the\nexperiments, we generate 1000 different causal pairs and 2000 samples for each pair. On real-world\ndata, we run the algorithm on Pittsburgh Bridges dataset and Abalone dataset. The implementation of\nHCR can be found on CRAN 1.\nThe following \ufb01ve algorithms are taken as the baseline: ANM [Peters et al., 2010], SA [Liu and\nChan, 2016a], DC [Liu and Chan, 2016b], IGCI [Janzing et al., 2012] and CISC [Budhathoki and\nVreeken, 2017]. The parameter settings of all the algorithms are based on their origin work.\nTo make a fair comparison, the decision rate is used as the metric to evaluate the models\u2019 performance,\nsame as that in IGCI [Janzing et al., 2012] and CISC [Budhathoki and Vreeken, 2017].\n\n4.1 Synthetic Data with Hidden Compact Representation Model\n\nIn this set of experiments, the samples are generated according to the following two-stage proce-\ndure. Firstly, generate X from a multinomial distribution and its cardinality is randomly chosen\n\nfrom{3, 4, ..., 15}. Secondly, map each X to a value that uniformly samples from the interval\n{1, 2, ...,\u08afX\u08af}. Finally, randomly generate a conditional probability distribution P(Y\u08afY\n\u2032) and\n\u2032\u08af, ..., 15}.\n\nsample Y according to Y\n\n\u2032 and P(Y\u08afY\n\n\u2032), and\u08afY\u08af is generated from the interval{\u08afY\n\n(a) Sensitivity to Decision Rate.\n\n(b) Sensitivity to Sample Size.\n\nFigure 2: Results on the Hidden Compact Representation Model.\n\nFigure 2(a) shows the accuracy with difference decision rate. As shown in the \ufb01gure 2(a), HCR\noutperforms the baseline methods across all the decision rates. HCR achieves acceptable results\neven when the decision rate is 1, which shows HCR can reliably infer the causal direction for all the\ncause-effect pairs. In this set of experiment, ANM fails to work because its additive noise assumption\nmay not hold for the current causal mechanism.\nFigure 2(b) shows the performance of the algorithms with the sample size varying from 250 to 3000.\nThe decision rate is 1 in this set of experiments. As shown in the \ufb01gure, the performance of HCR\ngrows much faster than the baseline methods and converge to 1 when the sample size reaches 3000.\nThis shows that the hidden compact representation explores the information behind the data more\nef\ufb01ciently, compared with the other algorithms.\n\n4.2 Real-World Data\n\nTo further assess the performance of our method for the discrete casual inference, we test the\nalgorithms on two real-world datasets, Pittsburgh Bridges dataset and Abalone dataset. Both of them\nare wildly used in previous research and can be downloaded from UCI Machine Learning Repository\n[Lichman, 2013].\n\n1\n\nhttps://cran.r-project.org/package=HCR\n\n6\n\n0.000.250.500.751.000.000.250.500.751.00Decision RateAccuracyHCRANMSADCIGCICISC0.000.250.500.751.00100020003000Sample sizeAccuracyHCRANMSADCIGCICISC\fPittsburgh Bridges dataset: There are 108 bridges in this dataset. The following 4 cause-effect\npairs are known as ground truth in this experiment. They are 1) Erected (Crafts, Emerging, Mature,\nModern) \u2192 Span (Long, Medium, Short), 2) Material (Steel, Iron, Wood) \u2192 Span (Long, Medium,\nShort); 3) Material (Steel, Iron, Wood) \u2192 Lanes (1, 2, 4, 6); 4) Purpose (Walk, Aqueduct, RR,\nHighway) \u2192 type (Wood, Suspen, Simple-T, Arch, Cantilev, CONT-T).\n\nTable 1: Hidden Compact Representation on Pittsburgh Bridges Data Set.\n\nGround truth\n\nErected\u2192Span\n\nMaterial\u2192Span\n\nMaterial\u2192Lanes\n\nPurpose\u2192Type\n\n\u2032\n\nf(X) \u2192 Y\nf({Craf ts}) \u2192 1\nf({Emerging, M ature, M odern}) \u2192 2\nf({Steel}) \u2192 1\nf({Iron, W ood}) \u2192 2\nf({Steel}) \u2192 1\nf({Iron, W ood}) \u2192 2\nf({Aqueduct, Highway, W alk}) \u2192 1\nf({RR}) \u2192 2\n\nP(Y\u08afY\n\n\u2032)\n\nMedium: 0.5, Short:0.5\nLong: 0.37, Medium:0.59, Short:0.04\nLong: 0.42, Medium:0.58\nMedium:0.55, Short:0.45\n2 Lane:0.6, 4 Lane:0.33, 6 Lane:0.06\n1 Lane:0.15, 2 Lane:0.8, 4 Lane:0.04\nArch:0.18, Cantilev:0.12, CONT-T:0.12,\nSimple-T:0.24, Suspen: 0.15, Wood:0.19\nCantilev:0.06, CONT-T:0.03,\nSimple-T:0.81, NIL:0.3,wood:0.06\n\n\u2032 = 2, while Crafts is mapped to Y\n\nGenerally speaking, HCR can identify all 4 cause-effect pairs correctly. To gain an insight into the\nhidden compact representation, we give the reconstructed model in Table 1. In detail, the result on\n\u201cErected \u2192 Span\u201d shows that {Emerging, Mature, and Modern} of erected are mapped into a hidden\n\u2032 = 1. This hidden representation re\ufb02ects\ncompact representation Y\nthat Crafts is the main cause of the medium and short bridge, which is compatible with common\nsense. Similarly, from the results on \u201cMaterial \u2192 Span\u201d and \u201cMaterial \u2192 Lanes\u201d, we can see that\nthe steel belongs to modern material with high strength, while iron and wood are classic materials\nwith lower strength. This hidden property of the material causes the different span and lanes. Similar\nresults can be found in \u201cpurpose \u2192 type\u201d. All of those results on the four cause-effect pairs re\ufb02ect\nthat HCR is a proper representation of the causal mechanism for discrete data.\nFigure 3(a) shows the results of the algorithms on Pittsburgh Bridges dataset with different sample\nsizes. Because of the space limitation, only the average result of the four pairs are reported. As shown\nin the \ufb01gure, HCR outperforms the baseline methods and shows competitive performance even when\nthe sample size is smaller than 100, while the other four baselines are all failed to \ufb01nd the right causal\ndirection with such a small sample size. This re\ufb02ects HCR might be a suitable representation of the\ncausal mechanism in this real-world scenario.\n\n(a) Results on Pittsburgh Bridges Data Set.\n\n(b) Result on Abalone Data Set.\n\nFigure 3: Sensitivity to Sample Size.\n\nAbalone Data Set: This dataset contains 4177 samples and each sample has 4 different properties.\nThe ground truth contains three cause-effect pairs, Sex \u2192 {Length, Diameter, Height}. The property\nsex has three values, male, female and infant. The length, diameter, and height are measured in mm\nand treated as discrete values, similar to [Peters et al., 2010].\n\n7\n\n0.000.250.500.751.00255075100Sample SizeAccuracyHCRANMSADCIGCICISC0.000.250.500.751.0001000200030004000Sample SizeAccuracyHCRANMSADCIGCICISC\fTable 2: Hidden Compact Representation on Abalone Data Set.\nGround truth\n\n\u2032\n\nP(Y\u08afY\n\n\u2032)\n\nf(X) \u2192 Y\nf({Inf ant}) \u2192 1\nf({F emale, M ale}) \u2192 2\nf({Inf ant}) \u2192 1\nf({F emale, M ale}) \u2192 2\nf({Inf ant}) \u2192 1\nf({F emale, M ale}) \u2192 2\n\n0.43 \u00b1 0.1\n0.57 \u00b1 0.96\n0.33 \u00b1 0.088\n0.45 \u00b1 0.079\n0.11 \u00b1 0.032\n0.15 \u00b1 0.037\n\nSex\u2192Length\n\nSex\u2192Diameter\n\nSex\u2192Height\n\n\u2032 = 2. Here Y\n\nIn this dataset, HCR successfully determines the causal direction for all the three pairs, and the details\nof the model are given in Table 2. Because the properties have many discrete states, column Y shows\nits mean and standard variance. Although this dataset closely relates to the additive noise model,\nTable 2 demonstrates that HCR still successfully identi\ufb01es the causal direction and provides a fruitful\n\u2032 = 1, and maps {Female,\ninsight of the causal mechanism. In detail, the model maps Infant to Y\n\u2032 indicates that categorizing the sex of abalones into male and female is\nMale} to Y\nredundant relative to the considered effect, which is Length, Diameter, or Height. In the second stage,\nthe mapping shows the maturity causes the sizes.\nWe also compare our results with the baseline methods with different sample sizes. As shown in\nFigure 3(b), although this dataset follows the assumptions of the discrete additive noise model, HCR\nstill outperforms ANM. Note that, ICGI and CISC achieve the same performance as HCR and their\ncurves are covered by that of HCR. Moreover, SA and DC fail to give the correct direction on this\ndata set, perhaps because they are designed for the discrete data with a small number of cardinalities\nwhile the cardinality of the variables is very large in this dataset. These results also indicate that HCR\nmay provide a suitable representation of the causal mechanism in various scenarios.\nAs a summary, HCR stably outperforms all the baseline methods on these two real-world discrete\ndatasets and, furthermore, shows a meaningful hidden compact representation of the causal mecha-\nnism.\n\n5 Conclusion\n\nFinding causal direction between discrete variables is an important but challenging problem. In this\npaper, we make an attempt to solve this problem by developing a low-cardinality hidden representation\nmodel for the causal mechanism, which decomposes the mechanism into two stages. With this model\nestimated by the Bayesian Information Criterion (BIC), we develop an effective causal discovery\nmethod for discrete variables. Theoretical analysis also shows that the model is generally identi\ufb01able\u2014\nit is not identi\ufb01able when some weak technical conditions on the causal mechanism are violated.\nExperimental results on both synthetic and real data verify our theoretical results and support the\nvalidity of the proposed model, at least in a number of real situations. In future work, we plan to\nextend the proposed method to discrete data with confounding factors.\n\nAcknowledgments\n\nof China\n\n(61876043,\n\nThis research was supported in part by NSFC-Guangdong Joint Found (U1501254), Nat-\nural Science Foundation\n61472089), NSF of Guangdong\n(2014A030306004, 2014A030308008), Science and Technology Planning Project of Guang-\ndong (2015B010108006, 2015B010131015), Guangdong High-level Personnel of Special Sup-\nport Program (2015TQ01X140) , Pearl River S&T Nova Program of Guangzhou (201610010101).\nThis material is partially based upon work supported by United States Air Force under Contract No.\nFA8650-17-C-7715, by National Science Foundation under EAGER Grant No. IIS-1829681, and\nNational Institutes of Health under Contract No. NIH-1R01EB022858-01, FAINR01EB022858,\nNIH-1R01LM012087, NIH-5U54HG008540-02, and FAIN-U54HG008540, and work funded and\nsupported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie\n\n8\n\n\fMellon University for the operation of the Software Engineering Institute, a federally funded research\nand development center. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in\nthis material are those of the authors and do not necessarily re\ufb02ect the views of the United States Air\nForce or the National Institutes of Health or the National Science Foundation. We appreciate the\ncomments from anonymous reviewers, which greatly helped to improve the paper.\n\nReferences\nJames C Bezdek and Richard J Hathaway. Convergence of alternating optimization. Neural, Parallel\n\n& Scienti\ufb01c Computations, 11(4):351\u2013368, 2003.\n\nKailash Budhathoki and Jilles Vreeken. MDL for causal inference on discrete data. In ICDM, pages\n\n751\u2013756, 2017.\n\nRuichu Cai, Jie Qiao, Zhenjie Zhang, and Zhifeng Hao. Self: Structural equational embedded\n\nlikelihood framework for causal discovery. In AAAI, 2018.\n\nPatrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch\u00f6lkopf. Nonlinear\n\ncausal discovery with additive noise models. In NIPS, pages 689\u2013696, 2009.\n\nDominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniu\u0161is,\nBastian Steudel, and Bernhard Sch\u00f6lkopf. Information-geometric approach to inferring causal\ndirections. Arti\ufb01cial Intelligence, 182:1\u201331, 2012.\n\nM. Lichman. UCI machine learning repository, 2013.\n\nFurui Liu and Laiwan Chan. Causal discovery on discrete data with extensions to mixture model.\n\nACM Transactions on Intelligent Systems and Technology (TIST), 7(2):21, 2016.\n\nFurui Liu and Laiwan Chan. Causal inference on discrete data via estimating distance correlations.\n\nNeural computation, 2016.\n\nJudea Pearl and Thomas S Verma. A theory of inferred causation. Studies in Logic and the\n\nFoundations of Mathematics, 134:789\u2013811, 1995.\n\nJudea Pearl. Causality: models, reasoning and inference. Cambridge university press, 2009.\n\nJonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Identifying cause and effect on discrete\n\ndata using additive noise models. In AISTATS, pages 597\u2013604, 2010.\n\nGideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461\u2013464,\n\n1978.\n\nShohei Shimizu, Patrik O Hoyer, Aapo Hyv\u00e4rinen, and Antti Kerminen. A linear non-gaussian\nacyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030,\n2006.\n\nShohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv\u00e4rinen, Yoshinobu Kawahara,\nTakashi Washio, Patrik O Hoyer, and Kenneth Bollen. Directlingam: A direct method for learn-\ning a linear non-gaussian structural equation model. Journal of Machine Learning Research,\n12(Apr):1225\u20131248, 2011.\n\nPeter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT\n\npress, 2000.\n\nKun Zhang and Laiwan Chan. Extensions of ICA for causality discovery in the hong kong stock\nmarket. In Proc. 13th International Conference on Neural Information Processing (ICONIP 2006),\n2006.\n\nKun Zhang and Aapo Hyv\u00e4rinen. On the identi\ufb01ability of the post-nonlinear causal model. In UAI,\n\npages 647\u2013655, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1359, "authors": [{"given_name": "Ruichu", "family_name": "Cai", "institution": "Guangdong University of Technology"}, {"given_name": "Jie", "family_name": "Qiao", "institution": "Guangdong University of Technology"}, {"given_name": "Kun", "family_name": "Zhang", "institution": "CMU"}, {"given_name": "Zhenjie", "family_name": "Zhang", "institution": "Singapore R&D, Yitu Technology Ltd.,"}, {"given_name": "Zhifeng", "family_name": "Hao", "institution": "Guangdong University of Technology"}]}