Reviews: Consistent Estimation of Functions of Data Missing Non-Monotonically and Not at Random

NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona

Paper ID:	1571
Title:	Consistent Estimation of Functions of Data Missing Non-Monotonically and Not at Random

Reviewer 1

Summary

The authors propose consistent estimators under data missing not at random. These make no assumptions on the underlying data generating system, but only assume a model of missingness conditional on the data. This model is represented by a chain graph in whci there is no direct influence from the variable to its missingness indicator. This is an interesting step handling missing data situations using chain graphs. There are some presentational issues that could be explained better.

Qualitative Assessment

Technical quality: -Lemma 1 needs a proper proof. The separation property of the chain graphs from which it follows should be stated. -In my understanding the lemma has an error. Consider two variable model L1->L2, L1->R2,L2->R1, R1-R2. Path L2->R1-R2 has a non-collider section R1-R2 which has no nodes conditioned on, thus L2 _||_ R2 |L1 does not follow from the separation property. Please correct me if this is wrong? Is it the case that Lemma 1 should state that "Ri _||_ Li | L/L_i R/R_i" instead of what it says now "Ri _||_ Li | L/Li" ? This first is the independence used in the proof of Lemma 2. The first would also seem correct according to the separation property. -Lemma 2 proof needs to be much better. The independencies used in the first equation should be discussed. Where do each of the equalities follow? -section 6 makes claims about pseudolikelihood and IPW estimators, this should include citations. -simulations are simple but sufficient Novelty: -Generalizing the results of [8] to cover dependencies between missingness indications is a step forward Impact: -This is an incremental step in covering missing data scenarious. It could generate more results. Presentation: -Explain what is monotonicity and non-monotonicity of missingness right at the beginning. Monotonicity may mean several things for a non-expert reader (for example that higher values of certain variable are more likely to be missing). -When emphasizing the generality in the beginning, it is necessary to mention also the limitating assumption used: variables cannot drive their own missingness status. This essential assumption should appear already in the abstract. BTW doesn't the example on line 049 contradict this? line 053: natural ordering of what? missingness indicators? variables? Sec 2, 1st par: L* is not defined yet. Sec 2, 4th par: It is a bit unclear whether "in this approach" refers to the previous work explained in previous paragraph or the approach in the current paper or both? Sec 3, last par: On the other hand twice in a row.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 2

Summary

The paper describes a method to infer unbiased probability estimates from data sets with data missing not at random (MNAR), given a missingness model that satisfies certain properties (the ‘no self-censoring model’). The missingness model is specified using a chain graph, consisting of a complete DAG model between the observed variables L, with arcs into an undirected clique between the missingness variables R. The authors show the probability distribution over this model is nonparametrically identifiable from the observed data, and provide a log-linear parameterization that allows estimation through a (standard) algorithm that iteratively maximizes the pseudo-likelihood function. Feasibility of the approach is evaluated in a simple artificial data experiment, and some suggestions for further improvements are provided. The paper is very well written, with clear examples, solid maths, and a good use of and embedding in existing theoretical approaches. The ‘no self-censoring model’ seems a bit (unnecessarily) restrictive, though the final extensions suggest this can be relaxed a bit. Results of the numerical evaluation of the estimation model are somewhat disappointing, but the theory behind it is sound, and is likely to inspire new approaches to this challenging but long neglected problem. In summary: solid paper, challenging problem, interesting solution, no definitive answer but likely to inspire new ideas. Clearly belongs in the conference.

Qualitative Assessment

The problem tackled in this paper is interesting, challenging, and encountered frequently in practice. Technical quality is high, with a good balance between theoretical details and instructive examples. Proofs follow naturally from known results. The approach in the form of the chain graph missingness model is new, with an elegant connection to log-linear models for parameterization and subsequent estimation. Experimental evaluation is limited, but not key to the paper: its main contribution lies in developing a new approach to the problem. Main drawbacks: - need to specify the full missingness model (which precludes this approach from helping to learn the model over L under MNAR data -> this is usually one of the main problems in real-world experiments - 'no self-censoring' excludes an important missingness mechanism in practice (but perhaps necessarily so) - poor performance: the accuracy obtained is disappointing, but it seems that this is mainly due to the relatively low sample-size used in the study Still, I expect these results will be of interest to a wider audience, and recommend acceptance.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 3

Summary

The authors propose a graphical missingness model for MNAR and non-monotonic data. The main contribution of the paper is to present the identification conditions for the proposed model. The authors also derive an IPW estimator for the resulting model under a specific parameterization. The performance of the estimator is tested on a synthetic data generated for a simple model.

Qualitative Assessment

From the theoretical perspective, it is an impressive work. The proposed missingness model is general and the identification conditions are easy to check. One concern I have is the validity of `no self-censoring' condition in practice. For instance, the example given in the introduction (line 47-50) hints at a model that is not in accordance with the `no self-censoring' condition. Authors may want to revise that specific example to be more consistent throughout the paper. A brief discussion on how to relax that assumption may also be useful. For instance, in parametric treatment of the missing data pattern (in Rubin's framework), the low rank assumption seems to be effective (see Elad Hazan, Roi Livni, and Yishay Mansour. Classification with low rank and missing data. arXiv preprint arXiv:1501.03273, 2015).

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 4

Summary

Qualitative Assessment

The paper proposes to use chain graph model for missing data observations. Under the model, the paper proves that the full data law is non-parametrically identified, based on the observed data (with a mild assumption). The technical idea is presented clearly in the paper. The result is much stronger than the previous result under Bayesian nets. Under DAG model for the data variables and missingness variables, the full data law is non-parametrically identified. The paper generalizes the assumption by extending the DAG modelling to chain graphical models. Under the chain graphical model, the paper establishes an IPW estimator for full data. The simulation case study is over-simplified. It is not clear for a slightly more complicated scenario, e.g. if the observation function is nonlinear. Another limitation is about the dependency structure of L and R, which should be known a priori. If part of the dependencies are missing as well. Readers may also expect more experimental evaluation.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 5

Summary

The paper attempts to generalize results (including estimation results) concerning the identification of the full data law under missing not at random (MNAR) situation where the full data law is Markov to a DAG to the class of LWF chain graphs. There is also a related simulation study and extensions to the proposed model.

Qualitative Assessment

My assessment is that the paper borders with having a fatal flaw. Although the authors claim that they deal with the case where the full data law is Markov to chain graphs, in reality they only work with one graph (on every size) which is a chain graph, and not any collection of chain graphs. Even worse, the proposed chain graph is in fact Markov equivalent (i.e. induces the same independence structure) to a simple undirected graph as there is no unshielded collider in the graph. Therefore, one can simply use a factorization for undirected graphs for identification purposes. The fact that in the example of Figure 1(e) the data law is not identified is simply because by directing the edges of the R block (i.e. by ordering R_i) an unshielded collider V-configuration (R_1,R_3,L_1) is produced and hence the graph is no longer Markov equivalent to an undirected graph.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 6

Summary

This paper is concerned with the problem of deriving consistent estimators for datasets which have fields that are missing non-monotonically and not at random (MNAR). The authors frame this problem using the language of graphical models by introducing a chain graph depiction of the problem, propose a solution using inverse probability weighting and suggest an estimation technique which uses pseudo-likelihood estimation. A small set of synthetic experiments on a simple synthetic dataset are carried out to evaluate the proposed technique empirically.

Qualitative Assessment

I found the paper to be well written and fairly easy to follow overall. However, I have two concerns regarding this work: 1. The lack of discussion around related work makes it difficult to fully judge the contributions of the paper to the causal inference community. For example, the non-monotone MNAR setting has been studied, and a solution has been proposed using inverse propensity weighting, by Robins et al. (1999) and Vansteelandt et al. (2007). A nice summarization of these approaches is given by Li et al. (2011). A simple discussion of the estimation technique proposed in section 5 to the previously mentioned work would be very helpful. The proposed work seems unique in that the authors explicitly frame the problem within the context of graphical models, and use propose a pseudo-likelihood approach for estimation. However, it is difficult to judge the novelty of the proposed contributions without additional context. 2. The authors use a very simple synthetic domain for the experimental evaluation. I would have liked to see a larger array of functional forms and dependence structures. It is also not clear from the results that the proposed method achieves a substantial improvement over taking no action at all. In many cases, the adjusted values provide a worse estimate than performing no adjustment at all. It may be the case that more substantial improvement is seen in larger sample sizes but this should be shown within the main text of the paper. There is a small typo in footnote one where the independence symbol is erroneously defined as the conditional independence. The clarification is also not likely to be necessary given the intended audience. Overall, this paper presents an interesting idea for an important problem but the lack of framing and a thorough evaluation make it difficult for me to suggest acceptance. * Robins JM, Rotnitzky A, and Scharfstein D. "Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models." Statistical models in epidemiology: the environment and clinical trials. 1999 * Vansteelandt S, Rotnitzky A and Robins J. "Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse." Biometrika. 2007 * Li, Lingling, et al. "On weighting approaches for missing data." Statistical methods in medical research. 2011

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)