__ Summary and Contributions__:
The paper addresses linear regression problem in the presence of missing data that may be Missing Not At Random. It first derives the analytical form of the Bayes predictor under Missing at Random (MAR) and Gaussian self-masking missing data mechanisms, then proposes a neural network architecture to approximate the Bayes predictor. The proposed learning method is empirically shown to perform well.

__ Strengths__:
The paper addresses a relevant problem, the proposed method is novel, and the results should be useful in practice.

__ Weaknesses__:
The proposed method is limited to learning linear models.

__ Correctness__:
The claims look like correct.

__ Clarity__:
The paper is well written.

__ Relation to Prior Work__:
The relation to prior work is clearly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__:
What is v in Equation (8)?
===After author rebuttal ===
My opinion has not changed. I think the paper should be accepted.

__ Summary and Contributions__: The authors derive the optimal linear predictor under various missing data mechanisms including missing at random (MAR) and self-masking missing not at random (MNAR). Specifically, a new architecture, called Neumann networks, is presented based on Neumann series approximation of the Bayes predictors. Many methods associated with missing data have been proposed so far, but the proposed method is significant in the sense that it can handle missing not at random mechanism. Moreover, the proposed method scales well unlike the conventional methods for MNAR model. In the experiment, it is shown that the proposed method is more robust than the conventional methods especially in MNAR case.

__ Strengths__: The derivation of the expression of the Bayes predictor under the various missing value mechanisms is novel and theoretically valid. Furthermore, Neuman network's approximation error bound is also theoretically derived. I think this work is an important achievement in the research of missing data analysis.

__ Weaknesses__: In the derivation data X is assumed to be Gaussian, and the predictor is assumed to be linear model. Due to these assumptions, there seems to be application limits.

__ Correctness__: The authors argue that the proposed method is more robust than other methods for all missing mechanisms, but in fact, as shown in Fig. 4, when the number of samples is small, the conventional method has better performance in MCAR. Smaller numbers of data should be compared even in the MNAR case.

__ Clarity__: Introduction is great. The paper is well written and formal.

__ Relation to Prior Work__: The authors reviews the conventional methods for various missing data mechanisms.
Moreover, the difference between the proposed method and the others is clear.

__ Reproducibility__: Yes

__ Additional Feedback__: Typo:
Minus sign is missing of R2-Bayes rate (d=50 case) for MCAR in Fig.4.

__ Summary and Contributions__: The paper derives analytical expressions of optimal predictors in the presence of Missing Completely At Random (MCAR), Missing At Random (MAR) and self-masking missingness in the linear Gaussian case. Then, the paper proposes Neumann Network for learning the optimal predictor in the MAR case and show the insights and connection to the neural network with ReLU activations. There are two challenges of learning the optimal predicator from data containing missing values:
1) computing the inversion of covariance matrices in the MAR optimal predicator;
2) 2^d optimal predictors with different missingness patterns required to learn the optimal predictor, where d is the number of features/covariates.
For the first one, the paper provides a theoretical analysis, which is approximated in a recursive manner with the convergence and upper bounder guarantee.
For the second one, the Neumann Network shares the weights of optimal predictors with different missing patterns, which turns out empirically more data efficient and robust to self-masking missingness cases.

__ Strengths__: The analytical expression and discussion about the optimal predictor in self-masking missingness in Section 2 is novel and significant in the study of missing data problems. This could be helpful for understanding and dealing with the self-masking missingness, which is still a challenge in many fields.
The theoretical analysis guided neural network is an interesting and neat way to solve the learning problem. The neural network is guided by an approximation method with the analysis of convergence and upper bound. Moreover, the theoretical and empirical analysis of the neural network is appreciated very much, which shows the connection to the ReLU network and required number of samples.

__ Weaknesses__: The approximation and Neumann network are based on the expression of the optimal predictor in MAR, which is a less interesting problem, even though it shows that the performance of the proposed method works in self-masking missingness cases empirically. The lines 190-192 said, there could be another similar network for self-masking missingness, which maybe less satisfying. I am not asking for a solution/implementation for self-masking missingness cases here, but I would like to see discussion or proper justification for why and when the proposed method can be used for the self-masking missingness ( even for MNAR, the larger class of missingness mechanisms ) under which kind of guarantee. Then the contribution for the self-masking missingness ( even MNAR ) problem would be much more helpful for the community.
===Update ===
My opinion has not changed. The equation (2) in author feedback is based on the approximation which is claimed to be "poor" in the paper.

__ Correctness__: The comments are about the experiments.
The paper claimed that the method is robust to the MNAR case, which could be not suitable. Note that the MNAR is a large class of missingness mechanisms, which is much larger than M(C)AR together with self-masking missingness. In fact, the mentioned “MNAR” in the paper ( as in lines 77, 246, and 289-290 ) would be more accurate if replaced with self-masking missingness because the analysis, discussion and experiments are all based on the self-masking missingness.
I would like to know why the experiments didn’t test on the MAR scenario, which is quite different with MCAR in many scenarios.
Moreover, I am curious how the method would perform in the case where MAR and self-masking missingness happen at the same time.

__ Clarity__: Yes, it is well written.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: Some tiny things:
> In the lines 46-48, I am not sure that I understand it.
> Line 106, maybe “= P(M = m)”.
===Update ===
My opinion has not changed. Although the assumptions may make the work limited, I think the paper should be accepted because of the theoretical analysis and the neural network design.

__ Summary and Contributions__: The authors of this paper propose a method for supervised learning with missing covariates. They establish the Bayes predictor for a variety of specifying missing data mechanisms including MNAR mechanisms. They suggest a particular neural network architecture for learning the Bayes predictor and demonstrate it's performance empirically.

__ Strengths__: I thought this was a really nice paper. They take a different approach / perspective to missing data than I commonly encounter. I think this is a useful contribution in for very important practical problem that is often not given rigorous treatment.
There are some nice insights about the Bayes predictor in the simple linear setting which motivates a particular network architecture. Specifically, I think this is a nice first step toward theory and best practice for dealing with missing data in neural network models.
Although I didn't follow every detail, I thought the section comparing their network to the MLP fit with the concatenated data to be particularly interesting, as it links common practice to some of the theory the authors are developing.
Update: I really enjoyed this paper and think it should be accepted. The authors responses don't change my view in this regard. I'm keeping my score.

__ Weaknesses__: The required assumptions for the key propositions are quite strong (Gaussian data, Gaussian self-masking etc). How robust are the networks to deviations for Gaussian data? It would be nice to see some robustness checks on non-Gaussian data.

__ Correctness__: To the best of my knowledge, the method appears correct.

__ Clarity__: The paper is very well written and quite clear.
I couldn't really make sense of "Differences between MNAR and M(C)AR predictors" Section. It seems like the authors have some useful insights to share but perhaps given the length limit, I wasn't quite able to identify the main point of this Section.

__ Relation to Prior Work__: The prior work is clearly discussed. That said, one difficulty is that this references and builds on very recent work (papers on arxiv 2019/2020) which personally I was not familiar. Given the recency of the works they reference, it may help to add a few more sentences about the main contributions in the previous work and the extensions in this work.

__ Reproducibility__: Yes

__ Additional Feedback__: