__ Summary and Contributions__: The authors propose a neural network approach to solving Kolmogorov PDEs with parametrically varying coefficients via the Feynman-Kac formulation.

__ Strengths__: The authors provide both empirical and theoretical support for their approach.

__ Weaknesses__: The authors claim to circumvent the curse of dimensionality.

__ Correctness__: Appears so.

__ Clarity__: Sure.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: Some comments:
1) In the Basket put option example, it is said that the SDE is simulated using Euler-Maruyama with 25 equidistant steps. However, since the time interval over which these 25 steps are taken is not presented it is impossible to judge how accurate the approximate samples are.
2) D is used to denote the domain of the \gamma parameter but also denotes some sort of operator (second equation in Thm. 2).

__ Summary and Contributions__: The paper proposes a new neural network based methodology for producing a global solution to a class of parameterised parabolic PDEs, as a function of space, time and the parameters. This is accomplished by producing training data from an SDE related to the PDE through the Feynman-Kac formula, and training a neural with a multilevel architecture designed to be amenable to the problem. The new approach escapes from the curse of dimensionality, in that the computational cost of the algorithm is sub-exponential in the parameter dimension.
The authors test their new algorithm on a set of PDEs with affine linear parameters of increasing dimension to empirically demonstrate this sub-exponential dependence, and provide a set of theoretical guarantees on the accuracy of the global neural network approximation for a particular PDE.
EDIT: I am pleased with the authors responses and happy with my original assessment.

__ Strengths__: The paper is very strong; it is well written and presents an approach that addresses an existing problem in numerical analysis using state of the art techniques from machine learning. As such the contribution is clearly significant and novel, and relevant to the NeurIPS community.
The theoretical results provided are sensible if narrow. The empirical evaluation is for a challenging set of test problems that are both realistic, in the sense that the both option pricing models remain highly relevant, and challenging in the sense that the parameter dimension is large, so that evaluating prices and derivatives for a large set of parameters would be computationally demanding.

__ Weaknesses__: The authors restrict to affine-linear parameters, but this restriction is not a requirement of the methodology and it is unclear from the text why the restriction was put in place. Is the purpose solely to narrow the scope of the analysis, or is there some performance loss for more general coefficients? It would be useful for the authors to comment in the paper on why this restriction is imposed.
Similarly the theoretical results proven apply only to the example from Section 3.3 with a specific initial condition. A more general result would be preferable, particularly since the highest dimension application the authors have provided is actually for a different initial condition (though, of course, the input dimension for the parabaloid initial condition is still rather high). However a more general result may not be straightforward to derive.
In the numerical experiments, figure 3 is rather unclear - the lines are all so close to each other that one can only make out one line, and while this certainly suggests that the method is accurate it is impossible to determine the level of accuracy from such a plot. Plotting error scaled by standard deviation might be more informative. In addition, the number of input parameters is varied but not in a systematic way. It would be interesting to see plots of steps taken to achieve a given accuracy vs. dimension of parameter space, to more explicitly demonstrate the non-exponential cost of the method as a function of parameter dimension.

__ Correctness__: All of the claims and methodology appear to be correct.

__ Clarity__: The paper is excellently written, very clear and easy to follow.

__ Relation to Prior Work__: The differences from other neural-network based methods for solving these problems, as well as classical numerical methods, are thoroughly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: I am unsure what the notation after line 134 means (is this supposed to be concatenating vectors into a matrix...?) Perhaps the authors could clarify?
Perhaps I have missed this, but the scalar n in equation 7 is not explicitly mentioned; it would be good if this could be clarified.

__ Summary and Contributions__: The submission proposes a new architecture to learn approximate solutions to a parametric family of Kolmogorov PDEs. Theoretical results regarding the ability of ReLU nets to approximate the solution and generalization bounds for the ERM are presented.

__ Strengths__: The paper is clearly written. The theoretical derivations are clean.

__ Weaknesses__: There are two main weaknesses of this work in my opinion. The first is regarding the generation of the ground-truth data, where Feynmanâ€“Kac is invoked and an SDE is simulated. While for sections 3.1 and 3.3, the associated SDEs have analytical solutions and can thus be solved precisely, it is unclear how the quality of the data should affect the learned solution in section 3.2. I think the discretization step size and number of samples used to approximate the F-K formula's expectation should both matter for the learning algorithm. It would be useful to conduct ablation studies to see the sensitivity of the algorithm to the data quality.
Additionally, the theoretical results seem somewhat disconnected to the empirically driven algorithm, since the theorems ignore the concern of whether gradient-descent-like algorithm such as Adam and SGD can actually find the ERM. Understandably, this is a hard problem. Though, I believe it's imprecise to say that "... and rigorously show that our algorithm does not suffer from the curse of dimensionality" as on line 38. The curse of dimensionality in the PDE context usually refers to the exponential growth in the number of computational operations, and this is not directly comparable to the statements that ERM neural nets have number of parameters polylog in d, since it is definitely not obvious that the ERM can actually be found with time complexity polylog in d.

__ Correctness__: The methodology and theoretical proofs are correct, to the best of my knowledge.

__ Clarity__: The paper is well-written in my opinion.

__ Relation to Prior Work__: The relation is discussed adequately in my opinion. Though, it is possible that I am unaware of some works that use NNs to learn PDEs, since I don't actively work in this area.

__ Reproducibility__: Yes

__ Additional Feedback__: Minor
- line 171: "an European" -> "a European"
- appendix 464: "in order to proof" -> "in order to prove"
Post-rebuttal:
I appreciate the authors in attempting to address my concerns, and the response has mitigated my concern regarding COD. I'm therefore raising my score to 7.
With regards to data quality vs accuracy of PDE sol'n, my original intention was not to demand a theorem proving convergence (without explicit constants), although it's not bad that we have this guarantee now. In some sense I view this paper as a numerics methodology paper, and towards justifying that this is a useful method to practitioners, it would be useful to showcase some experimental results (on at least toy examples) that study the solution quality's dependence on some of the parameters (e.g. time horizon, volume in Euclidean space,...). These dependences are typically hard to obtain by deriving bounds. Moreover, the obtained bounds are hard to make tight (or prove tight via lower bound arguments).
Overall, I think while the paper's scope is somewhat narrow, it's doing a great job in making statements precise and clean.