NeurIPS 2020

Hypersolvers: Toward Fast Continuous-Depth Models

Review 1

Summary and Contributions: In this paper, the authors propose the hypersolvers. Given a solution to a neural ODE by an solve with low tolerance, another neural network is learned to approximate the residual, thus improves the truncation error. The proposed idea seems interesting and, to the best of my knowledge, novel. The authors empirically demonstrate the improvement in the Pareto frontier for accuracy and computational efficiency over existing solvers. **post-rebuttal**: I have read the authors' rebuttal and would like to keep my rating.

Strengths: I believe this work will be of interest to the community, and the idea seems interesting and novel.

Weaknesses: The experimental study seems not strong enough. I am curious why the authors only conduct experiments on FFJORD sampling, but not likelihood evaluations. Furthermore, different aspects discussed in Section 6 are interesting, but it would make the paper stronger if the authors further explore those ideas.

Correctness: The claims and method look correct to me.

Clarity: The paper is well organized, clearly written and easy to read.

Relation to Prior Work: The authors sufficiently discussed the related work and how this work differs from them.

Reproducibility: Yes

Additional Feedback: I was wondering how deep this proposed method can go, like a hyper-hypersolver, using another network to fit the residue of the hypersolver. This might lead to a progressive training of an ensemble of networks? Typo: line 228 on page 8, Fig 6 should be Fig 7?

Review 2

Summary and Contributions: This paper proposes to use a neural net to learn the high-order terms in the local Taylor expansion of a variable governed by an ODE, and utilizes this learned information to speed up neural ODE inference.

Strengths: The paper proposes a simple idea that seems to be effective. The problem of neural ODE inference is important but rather less studied by the community.

Weaknesses: There are certain aspects of the method that are not showcased as extensively as they should. I detail these in the Additional feedback section of this review.

Correctness: Claims and methods are correct to the best of my knowledge.

Clarity: The paper is well-written to the best of my knowledge.

Relation to Prior Work: Relation is clear to the best of my knowledge.

Reproducibility: Yes

Additional Feedback: - An important aspect of inference is the actual time and memory cost. While this is obviously dependent on the architecture of the computer, hardware, and software implementation, the overall message of inference speedup would be more convincing if at least some statistics/plots were given on this (e.g. on a computer w/ 12 CPU cores and a Nvidia v100 with PyTorch version 1.2.0). - Implicitly, there's a hyper-parameter to be tuned, which is the tolerance level set for obtaining training data for the high-order term neural net. How sensitive is the method w.r.t. this parameter? - Figure 5 top right plot's xticks and xlabel seems to be overlapping, making it hard to read. Post-rebuttal: I am satisfied with the additional experiments demonstrating the speed advantage of this method in practice, and therefore raising my score to a 7.

Review 3

Summary and Contributions: The paper presents a method to increase solver accuracy in Neural ODE models by fitting a residual neural network for a given numerical method. Thus, online computational cost of the numerical solver can be preprocessed in an offline neural network training phase. The quality of the achieved numerical fit is evaluated.

Strengths: The paper highlights an important venue for further research in Neural ODEs.

Weaknesses: There are two major weaknesses with this submission: 1.) The presented methodology boils down to the following idea: pre-computing necessary numerical methods offline, feed the results into a fast-to-evaluate interpolator and use the interpolator in deployment for faster prediction times. Stated this broadly, the idea is not novel and has many names in different communities, e.g., surrogate models, white-box emulation and similar. Thus, the paper would not need to convince me that this is generally a good idea, but why /this particular interpolation of the paper/ is better suited than the many alternatives, e.g., simply precomputing a mesh of possible IVPs with high accuracy and training directly g : x |-> z(s_K). The discussion in Sect. 6 tends towards this rationale, but is too short and superficial to be of substantial benefit. 2.) In particular, the authors present a rationale for supposedly faster evaluation times using Hypersolvers, but there are no experiments whatsoever what sort of hypersolver-training time/accuracy tradeoff is required.

Correctness: The theoretical claims seem to be correct but I have only checked the superficially. The paper's discussion and experiments fail to address why this particular form of numerical pre-computation and interpolation is beneficial. The experiments that are presented are reasonable, but do not support the main story of the paper.

Clarity: The paper is well written.

Relation to Prior Work: The connections to other work in Neural ODEs are adequately addressed. The connection to surrogate models is entirely absent. Some examples include:

Reproducibility: Yes

Additional Feedback: Post-rebuttal update: Unfortunately, I have to report that the authors' comments do not change my evaluation as I have the impression that they miss my point: I fully agree with the authors that Sect. 6 contains interesting hypotheses that I would like to see published at NeurIPS---if the hypotheses are proven to be correct and the authors can get the methods to work. So far, they are merely thought-provoking speculations and thus do not support the publication at this point per se. Secondly, I feel that the points "Scope" and "Alternative Approaches" cannot be seperated in this case. I also agree with the authors that they do not claim to solve a more general problem. I also agree with the authors that a clever idea for a tailored recipe in the context of Neural ODEs could be interesting. However, I remain unconvinced that *this particular solution* is specifically beneficial in *this particular context*. Specially so, as I claim that from the experiments I cannot make a comparison with other methods (e.g., directly predicting final time). Doubly so, as fixed step methods are not ODE solvers, i.e., there are no checks whatsoever within the solver to check for any sort of numerical accuracy. If I have understood the paper correctly, the training data for the Hypersolver was generated under the same solver and configuration as has later been used for testing. This is not a problem per se, but it is not obvious why the solution from this solver-configuration-pair should be the gold standard in the application context. I have the impression that the metric punishes the Midpoint and other methods for not being Dorpi5 at RelTol/AbsTol 1e-5. Furthermore, I want to highlight something else about the proposed method: it does *not necessarily* learn *the next higher-order* error term. It really learns *the difference* between any two numerical methods. If the authors would have trained the Hypersolver on Dopri5 RelTol/AbsTol 1e-10, the authors probably would have gotten a numerical method that behaved like a Dopri5 RelTol/AbsTol 1e-10. The fact that the Euler's method is still applied only acts as a determinstic pre-computation per step that the Hypersolver can correct for given a large enough model class (and, as I understand the experiments, also does). This is the particular reason why I would like to see experiments that compare with a) simply predicting the next step (without additional Euler computations) or b) simply predicting the final solution. I agree with the authors that there are probably interesting accuracy/complexity trade-offs, but from the current set of experiments, I am unable to judge them. I hope the following more concrete suggestions helps the authors in their future submissions: 1) The authors need to show that the effect of the Hypersolver is not due to overfitting to solver-configuration pairs. In particular, for testing either a different solver with same tolerance settings or the same solver with much lower tolerance settings should be used. 2) In this context, also test the effect of combining a hypersolver with *different* numerical schemes of the *same* order. (E.g., Ralston's method and Heun's method). If the positive effect of the Hypersolver really is due to predicting the next leading order term, this should work reasonably well. 2b) But also try to simply regress z(s_k+1) from z(s_k) and 2c) z(s_K) from z(s_0). All of these different approaches should work in general and the interesting question is whether any of those show a particularly beneficial accuracy/complexity trade-off. 3) Or, the authors could focus on detailing out the ideas presented in Sect. 6 (possibly a paper each). I personally believe that these would have the much higher impact in the community. I hope this clarification helped in understanding my criticism and also helps the authors to improve their manuscript for future submissions.

Review 4

Summary and Contributions: The authors propose to refine classic time stepping schemes for solving ODEs by modelling the local error with a network.

Strengths: It is empirically shown that the proposed method indeed improves the accuracy of Eules/Heuns method.

Weaknesses: It is not clear how the authors define number of function evaluations (NFE), which is the metric used to claim solver speedups. Thus it is not clear how much is gained in absolute terms. I elaborate on this issue below.

Correctness: Seems so.

Clarity: It is fairly well written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Some comments: 1) In footnote 2: single-step and explicit are not synonymous, e.g. implicit Euler is a single-step method but not explicit. I guess "single-step or explicit" should be "explicit single-step". 2) Examining the data fitting criterion \ell it is quite clear that the minimizer is given by g(e,x,s_k,z(s_k) ) = - 1/e^{p+1}*( z(s_{k+1) - z(s_k) - e*\psi ). Consequently, the hypersolver in (4) would reduce to z_{k+1} = z(s_{k+1). So why not just train a NN to map (x,s) to z(s)? This would also reduce the input dimension. 3) \ell_{local} is not defined anywhere. 4) The number of function evaluations (NFE) is not defined. Do you mean number of evaluations of the vector field f? In such a case this might be misleading since these hypersolvers, in addition to using function evaluations for the underlying classic ODE solver also have to evaluate the NN-correction. That is, this would only be a viable metric of computational complexity if the NN-correction is very cheap to compute in comparison to the vector field, is this the case? 5) It is claimed that the hypersolver is able to generalise to different step-sizes. It is not obvious where the support for this claim can be found. Nontheless it makes sense in view of my comment 2).