NeurIPS 2020

Counterfactual Data Augmentation using Locally Factored Dynamics

Meta Review

Reviewers were positive and excited about the paper, and I agree with the general sentiment that the work is a significant step in the right direction. My recommendation is "accept." Having said that, there are some issues that I would like to see fixed to make its final version more comfortable to read, sound, consistent, and well-positioned regarding the broader literature. Towards this goal, first, read the reviews carefully and try to incorporate their feedback as much as you can. I will list some critical issues below, mostly in addition to the ones raised by the reviewers. — The definition of minimality is not consistent and may lead to problems in other parts of the paper (as discussed in the reviews). Please, re-define causal model to account for the bipartite structure mentioned in the rebuttal; that's a strong constraint over the SCM-space but appears to be enough for the paper's purposes. That's a serious issue and shouldn't be overlooked. — It's a common source of confusion the discussion on model-based versus model-free, which is about having or not the specific parameterization of the underlying model (e.g., P(S | S', A), P(R | S, A)). This paper assumes much more than a parametric model of the dynamics, namely, detailed knowledge of the causal structure itself. While this is okay, to claim that the proposed procedure is "model-free" is somewhat far-fetched. — The idea of mixing and matching independent parts of the data, coming from distinct mechanisms, seems quite nice. The notion of context-specific independence (CSI, see Boutilier, Koller, et al.) seems to be the key behind the current approach. Contribution #3 for using "attention-based" methods & "disentangle state space" seems a bit like a distraction (even though a nice one). Instead of just "augmenting data," why not have an algorithm that considers the corresponding CSIs? This is a less desired mode of learning (e.g., people rotate/scale images, and feed into convolution NNs because CNN can't handle 'rotation' in general. ) For instance, in the robot's example with left, right arms, one can further use the symmetry relationship where 'left arm' data is fed into 'right arm' with mirroring. Data augmentation is all about the data generating process and prior knowledge, not counterfactuals. — Data based on CoDA might be selection biased or causally-invalid. Samples to be mixed satisfy some kind of "independence criteria", and conditioning on such criteria will result in selection bias. The authors seem to acknowledge, very briefly, such a selection bias is possible in Remark 3.3. Selection bias can adversely affect the agent's performance, but the paper does not perform any experiments demonstrating robustness against this bias. Furthermore, imbalanced data highlighting augmented samples without 'interaction' may harm agents' performance for the cases where subprocesses interact. Readers may appreciate if you add some acknowledgment of this phenomenon from an SCM-graphical perspective; for example, see discussion in Bareinboim and Pearl, 2012 2016 for more details on the semantics and available conditions needed for recoverability of data from selection bias. Further, consider a scenario with two initial conditions C1 and C2, where each results in (transition1A, transition1B) and (transition2A, transition2B). Assume that they can be safely mixed according to CoDA so that we can get new samples (transition1A, transition2B) and (transition2A, transition1B). However, it might be the case, under the condition C1, (transition1A, transition2B) is impossible. Imagine the billiard setting and let "transition1A" being one ball moves along the top side, and "transition2B" being another ball moves along the bottom side (in parallel). In a friction-free environment, it seems impossible and not "causally-valid". There must be a serious discussion on the meaning of "causally-valid" data with respect to the plausibility and selection-bias. — The notion of counterfactual is about events that cannot be realized in the real world (see Pearl, Ch. 7), where no data is available. The use of this expression seems inconsistent with the literature since data is factual, about one specific, realized world. Since some causal readers may find this usage somewhat off-putting, I would recommend adding a footnote explaining the difference from the interpretation employed in the paper.