Review for NeurIPS paper: Applications of Common Entropy for Causal Inference

NeurIPS 2020

Applications of Common Entropy for Causal Inference

Meta Review

While the reviewers took issue with the similarity to prior work, the lack of theoretical evidence around the small-sample improvement claim, the lack of experimental comparison with related work, and clarity of the presentation, I believe the authors answered these concerns sufficiently in their rebuttal, and the work is novel methodologically and theoretically to warrant acceptance. First, the reviewers had the following main concerns: - Concern #1 (R1): Idea is too similar to [23]. > Author response: In [23] the authors make an assumption which allows them to rule out the existence of an unobserved confounder Z that makes cause X and effect Y independent. They do so by assuming Z is 'low-complexity' which means either discrete or compact (in their experiments they only consider discrete Z). This is different from our approach. -> My take: I agree with the authors. The assumption here (Assumption 1 in the paper) is that any unobserved confounder Z has low entropy, with known upper bound. Given this, they can distinguish between spurious correlation between cause X and effect Y, and X directly causing Y. This is a different take on 'low-complexity', using entropy. While the intuition is the same (confounder is simple in some respect) the assumptions on the confounder are totally different. - Concern #2 (R1): The authors claim their method is beneficial in the small-sample regime but there is no theoretical evidence. > Author response: Finite sample theory for entropy is very difficult, currently good analytical bounds on common entropy are unknown, and would be necessary. -> My take: The authors already show experimentally that their method is more stable as sample size is reduced (from 80,000 to 1,000) than the standard PC algorithm and improves upon it everywhere. This makes sense as one is making stronger assumptions than the standard PC algorithm (i.e., a known upper bound on the entropy of all unobserved confounders), but it is very interesting to see what those additional assumptions buy you. - Concern #3 (R2): The motivation of the method is to avoid a high dimensional X and Y. Further the evaluation on Adult is poor because they assume a causal ground truth when there isn't one. > Author response: This is not our motivation, the motivation is to understand if making an assumption on the maximum allowed entropy of a confounder can lead to improved causal discovery. Further, we use a common way to evaluate causal discovery methods, by using data to fit the parameters of a fixed causal graph. -> My take: I completely agree with the authors here. - Concern #4 (R3): The authors do not compare to related work such as extensions of the PC algorithm. > Author response: Our goal was to isolate improvements given by our method from those given by extensions to PC, we will include this additional comparison. -> My take: Extensions to the PC algorithm are completely orthogonal to this work. In fact the choice of the PC algorithm is arbitrary, I believe this approach in this work could have been used to modify other causal discovery methods such as FCI. This suggested comparison is interesting, as there may be assumptions in PC extensions that have an overlapping effect with the current approach, but this approach can be applied to any causal discovery method, potentially yielding added benefits. - Concern #5 (R5): There are a number of presentation issues: (a) The introduction of the paper seems to mislead readers by focussing on distinguishing correlation between X and Y and an actual causal effect between X and Y, whereas the rest of the paper considers a mediation scenario where the causal effect between X and Y is mediated by Z. The significance of the mediation graph needs to be better discussed. (b) Direct graph (X->Y) is often forgotten in discussions (c) Authors seem to arbitrarily switch between different types of entropies without explanation (d) Better discussion of tuning parameters and limitations > Author response: They will make (b), (c), (d) clearer. -> My take: Based on the content of the paper and the rebuttal I believe they will be able to improve paper clarity. I urge the authors to also fix point (a). So I think the authors adequately addressed all of the reviewers concerns. Beyond these concerns I believe the paper warrants acceptance for the following reasons: - The idea of using a maximum entropy assumption on unobserved confounders to improve cause-effect identifiability is novel to my knowledge. While this isn't the first work to make some assumption on unobserved confounders to improve identifiability, the type of assumption is new. - They introduce a novel relaxation for computing the Renyi entropy of two random variables, and an algorithm for discovering the joint distribution q(x,y,z). I encourage the authors to include a proof of Theorem 1 either in the main paper or the supplement. - They have insightful experiments on: the performance of LatentSearch (its error for different dimensionalities of X,Y,Z), a test of a conjecture about how Renyi-1 entropy is bounded by the minimum of the entropies of H and Y, procedures on how to set \alpha based on a cause-effect dataset, and an evaluation of algorithms 2 and 3. I am impressed by the amount of analysis here. I think the biggest issue with the paper is presentation. There are many details to cover here. I would suggest the authors to more carefully and comprehensively describe sections 1,3,4,5 and potentially move all of section 2 (except the definition of Renyi entropy and other details that make the following sections clear) to the supplement. While Algorithm 1 and the relaxation are novel, I think it detracts from the core of the paper: using an assumption on the entropy of the latent confounder to improve identifiability. Ultimately how Renyi entropy is calculated seems somewhat orthogonal to the rest of the paper, so long as there is discussion/experiments on the behaviour of the method and how that behaviour impacts the identifiability results (I think the experiments already do this well). Also, I urge the authors to improve the presentation of the paper based on the clarity suggestions of the reviewers.