NeurIPS 2020

### Review 1

Summary and Contributions: I have read the author's feedback. I really appreciate the time you took to answer us, reviewers. Rest assured that I have devoted a substantial amount of time looking at your paper and response. Unfortunately, the authors explanations do not add much to what I had already understood, and do not fully settle my concerns. Examples follow. On a few "subjective" matters: I disagree with the authors' response on "technicality" and "graphs" and I strongly suggest they take my advice into consideration in a future revision. In particular, regarding the use of graphs, even if the authors use graph-based software, the explanation do not need them. Examples: I appreciate the motivating example with PDE with stochastic boundaries. However, the authors should have also reported the training time, and the time to generate the training data. In a nutshell, a few more numbers would have made all the difference. Also, the authors should have addressed the obvious possibility of using methods that can start from the solution for a given $a$ to quickly find the solution for $a'$ close to $a$. I think that being a bit more self-critical would have probably helped the paper as well. I appreciate the promise for the two new examples. The authors should have done such experiments. It would have made all the difference to give a few numbers about their outcome. ====================================== The authors consider the problem of solving a PDE of the kind L_a u(x) = f(x) where L_a is a differential operator that depends on a parameter $a$, a function, and an input function f. The authors consider this problem in the scenario where f is fix but we want to solve the PDE over varying "values" for the parameter $a$ very fast. The authors also assume that we have access to a training set of pairs $(a(x),u(x))$ that are examples of solutions to the PDE. To tackle this problem, the authors first express the solution of the PDE as the fixed point of an integral recurrence, and then approximate the integral operator in the recurrence with a neural network, thus expressing the successive applications of the integral recurrence as the output of the composition of several NNs. These NN are build such that the computation time of the overall map remains tractable. In particular, they express the integral operator as the composition and sum of simpler NNs, each representing operators at a given "scale", i.e. level of detail. The authors compare their method with existing NN based methods for the same problem, and argue that their method should better performance.

Strengths: The authors work is a combination of several existing ideas, as they point out. The general idea of replacing integral operators with NNs is interesting. The general idea of expressing this operator as a sum and composition of operators that are sparse/dense & low-rank/high-rank is also interesting.

Weaknesses: There are several important points that need to be addressed in the paper. First, there is a non-uniform level of detail and technicality through out the paper. The authors start by trying to be very formal, and specifying that functions come from "separable Banach spaces", but quickly drop this rigor and start being vague mentioning "appropriate function-space" and "well-posed", and in the end, the actual method that they propose does not need these details, and is often explained using concepts, like "graphs", that were never formally defined. I suggest using the following heuristic: If a concept is not needed to explain how you get to eq. (12) and (13), then do not mention it at all. This will save you space that you should then use to explain what actually matters in more detail. Second, although I appreciate the comparison with other NN-based methods, it would be important for the authors to illustrate with an example how traditional methods perform. I do understand that traditional methods require recalculations for each new parameter $a$. Nonetheless, the numerical examples studied are not too complex, and I suspect that standard PDE solvers could solve them really fast. It would be good to report these time-to-solution numbers, as well as the accuracy of these classical solvers. This is also relevant for the numerical section, for one to understand how exactly the training set was generated. I assume you used some classical solver to find the training examples, yes ? Also, what is the training time? Sorry, didn't check the Appendix (as a reviewer I don't have to), maybe it is there?! In any case, a few of these numbers should be in the main text. Are you familiar with the Matlab package Chebfun? It can use Chebyshev interpolation (instead of grid-type discretizations like you do at each level) to solve PDEs very accurately, and it is really easy to use. I would be very interested in seeing how well you do in comparison with these type of tools. They also allow the computation pre-compuation of inverse operators, so, with some tricks, it could even be that as you change $a$, you do not need to recompute much stuff. Third, some of the explanations need to be clarified. I mention a few now. 1) Fig. 1 is pretty but it is not very informative. It is too generic, and it is incomprehensible at the point where it is first mentioned. I suggest that you move it to later, to where you explain your V-cycle, Section 3.2, and then also had \hat{v} and \invertedhat{v} to the different parts of the diagram. Also, I recommend not pointing the arrows inwards as we get deeper, since it is misleading. For example, in Fig. 1, in the last layer K23, K_33, and K_32 are in the same graph, while in the upper levels they are not. The whole diagram should just be a stacking of "squares", like below (hope it doesn't get distorted, since it took a while to draw). O----->---------O | | V ^ | | O----->---------O | | V ^ | | O----->---------O | | V ^ | | O----->---------O 2) The authors mention graphs several times in the paper but never really define formally what they mean by it. I am convinced that the whole paper can be written without referring to graphs ever, just sum and composition of sparse operators. However, if the authors do want to mention graphs, you should formally relate the adjacency matrix of these graphs with the sparsity patters of the kernel matrices. The same goes regarding the mentioning of GNNs. I believe the authors can avoid mentioned them at all, and just present the architerure of their NN as is. However, if you do want to mention GNNs, you need to give full details of how your method translates into a GNN, not just say that it does. E.g. from (12) and (13), it is not clear which graph you're using in the GNN. It could either be the metagraph in Fig.1, which I re-expressed above, or a fully detailed graph where, in each level, you express K as a graph operation as well. Please use a latex Definition environment. 3) The authors need to clarify what they mean by "complexity" in different parts of the paper. Is it the complexity to compute K? Is it the complexity to compute K v? Is it the complexity to compute the full map? Also, the authors need to give more detail on how these complexities are computed when more, or less, "tricks" are used. For example, at some point the authors mention that the domain of integration in eq. (3) is truncated to a ball. This corresponds to sparsifying K before doing K v. Is this truncation happening at every level of the the downward and upward pass? I.e. do we use this truncation in all matrix multiplications in eq. (12) and eq. (13)? Also, in addition to this truncation, you also use a Nyström approximation, which corresponds writing K as a product. How exactly does this affect computation time? When explaining these things, always compare what would happen if, e.g. I computed eq. (8) without any "tricks", and if I computed eq. (8) using sparsifying "tricks". Be specific, be rigors. Here it matters, when mentioning Banach spaces it doesn't. 4) At some point there is a disconnect between the flow of ideas that are guiding the development of the heuristic. You start with writing u(x) as solution for eq. (2), fine. Then you say that the r.h.s. of eq. (2) can be approximated by the action of a kernel as in eq. (3). Here I already see a problem as if u(x) = 0 in the r.h.s. of eq. (2) I still get something on the l.h.s. of eq. (2), but if u(x) = 0 in the r.h.s. of eq. (3) I get also zero on the l.h.s. of eq. (3). Then you seem to try to compensate the difference between eq. (3) and eq. (2) by introducing a matrix W. However, I still do not see how W can capture the term with f(x) in eq. (2). I also do not understand why the effect of W cannot be captured by \Kappa_a. Then you extend the dimension of the problem and you introduce activation functions. At this point, since you're learning \Kappa_a and W, you no longer are using any knowledge of the fact that you started from eq. (2). You could have started with eq. (4) directly, and things would still make sense. What's the point of the journey then? Then you explain in eq. (5) how the discretization of the domain affects calculation of eq. (4). Then you introduce the decomposition in eq. (8) as a matrix decomposition, and then clarify that these are actually tensors. Finally, you do not use eq. (8) at all as it is, but you stick an activation function in between every matrix multiplication/sum and offer eq (12) and (13) as your actual algorithm. Here again I see no reason why we need W, as this can be absorbed into K_ll. While this story is going you, you also mention here and there some truncations via B(x,r), graphs, domain discretization, and Nyström sampling, without never being very formal about these things. 5) Many solvers' algorithms are able to guarantee that some nice mathematical properties are kept. For example, that we do not lose mass, or charge, when solving a physics-related continuous PDEs via methods that are inherently discrete. They use symplectic integrators, etc. How does learning F^\dagger behave in this regard? Is it possible to do the training such that some nice conservation properties are kept? Can you illustrate, at least numerically, how well, or not, we conserve certain properties in e.g. Hamiltonian systems?

Correctness: The authors propose an heuristic to solve PDEs. It is based on interesting assumptions, but, as an heuristic, it cannot be said to be correct or incorrect. The authors do show, numerically, that it seem to produce good quality solutions.

Clarity: The language is clear. The level of detail with which different concepts are explained seems kind of random. Sometimes very formal in things that are irrelevant. Other times omitting details in what actually matters. See above.

Relation to Prior Work: They do mention other NN-based methods to solve PDEs, ODEs. It would be good to expand a bit more on current more "classical" methods. Even if it turns out that current methods are actually much faster than what the authors are introducing, the idea of replacing parts of numerical packages with non-linear functions learn from data is interesting.

Reproducibility: Yes

### Review 2

Summary and Contributions: The authors propose a generalization of graph neural networks that mimicks the hierarchical construction of multipole approximation. The idea is sound, but the technique is only shown for solving some simple 1D PDEs.

Strengths: The idea is sound, although I suspect limited novelty. There are insufficient benchmarks to demonstrate the value of the approach - I discuss in the following section.

Weaknesses: This looks like many existing works, which have more carefully demonstrated their usefulness: i.e. the MgNet from jinchao xu's group, and superficially this isn't so different from looking at graph convolutions with maxpooling to coarsen hierarchically - I'm curious whether this could simply be implemented as such a network with skip connections and get an equivalent architecture. In any event, the only metrics shown to gauge the stengths of the method are for solving 1D PDE, and the results are remarkably poor (>1% error for all methods considered, including this one). Several ML methods perform better than this (pinns, the deep ritz method, etc), and if one were to use a traditional FMM-based PDE solver one could get orders of magnitude better answers. Of course, this differentiable generalization may be applicable in ML contexts, however the authors structured the paper claiming to impact the solution of PDEs and the results were in my opinion not sufficiently remarkable to merit publication. I will note that I think the research direction pursued here is a solid one and the authors should continue to develop this idea, but hopefully generate some more compelling results for next publication.

Correctness: As far as I can tell yes.

Clarity: Relatively clear. There are some technical details in the first half of the paper where the details of multipole approximations aren't well represented (e.g. "Based on the insight that long-range interaction are smooth, FMM decomposes the kernel matrix into different ranges and hierarchically imposes low rank structures to the long-range components" is not accurate - long-range interactions are generally *not* smooth, and the original FMM was generated for N-body problems specifically where one can exploit regularity. While this is nitpicky - this has consequences later on, where this is used to solve burgers - a hyperbolic problem with shocks and where long range correlations are not smooth)

Relation to Prior Work: The authors mention some similar works but miss the reference I mention previously which do a much better job demonstrating the usefulness/value of these types of hierarchical approaches.

Reproducibility: Yes

### Review 3

Summary and Contributions: This paper proposes a new graph neural network model based on the multipole graph neural operator. This model allows a multiscale analysis of graph and thus tackle one of the main drawback of graph neural network. Moreover in order to have a distance for edge, they use kernels thus allow further extension.

Strengths: The main contribution is the multipole operator which permits a multiscale view of the graph. Then the V cycle model shows some similarities with the classical message passing method used for GCN and in fact extend this common model. The multiscale properties of the multipole tackles the one strong issue of the GCN as it permits to involve both local and global part of the graph when learning.

Weaknesses: The behavior of the new model for large complex graph is unclear. The experiment are not tested on classical dataset like Cora and others.

Correctness: The method seems correct.

Clarity: The paper is mostly easy to read. However there is a little confusion on what is really learned in the model, especially on the kernel part.

Relation to Prior Work: The relation to previous work is sufficient.

Reproducibility: Yes

Additional Feedback: There is a little confusion on the learned part. The authors talk about Nymstrom method to deal with kernel, but in the proposed code there is no kernel computation. Please be more precise on these part. Secondly, perhaps the factorized equations would give some insight on the method compared to classical graph neural networks. POST REBUTTAL: Since this work especially focus on PDE, it not ready yet for the GNN community (need comparison against other GNN). Furthermore, since PDE are the main problam to solve, we need comparison against PDE solvers.