Review for NeurIPS paper: Learning Differential Equations that are Easy to Solve

NeurIPS 2020

Learning Differential Equations that are Easy to Solve

Review 1

Summary and Contributions: This paper presents a regularization scheme for neural ordinary differential equations (ODE) that favors easier-to-solve dynamics. Given a k'th order ODE solver, the authors propose to augment the loss term with an integral over k'th-order state derivatives (or its squared norm). The model is tested on classification, time-series modeling, and continuous-time normalizing flow tasks, where the number of function evaluations is significantly decreased while retaining a comparable test performance. The authors also describe how to efficiently compute higher-order derivatives with Taylor mode auto differentiation (AD), and provide an open-source implementation.

Strengths: As claimed, the proposed regularization scheme leads to simpler vector fields so that integration with adaptive step ODE solvers takes a smaller amount of time. The integral needed to compute the regularization term is computed in parallel with the ODE system, which is pretty convenient. The experiments are quite extensive, all applications of neural ODEs are covered.

Weaknesses: - The contributions in this paper are twofold: (i) The new regularization term, and (ii) an efficient implementation of it. Despite being different from eq. (1), augmenting the loss with an extra penalty term is already proposed in Finlay et al (2020). Similarly, Taylor mode AD was described in [1]. As such, I doubt the paper would be a significant contribution to neural ODE literature. - As pointed out by the authors, presented method does not improve the training time, which is usually the bottleneck in generative models. In fact, looking at Table 2, second to the last RNODE row balances the train time/NFE and log likelihood the best. An additional ~19 hours of training to reduce the test NFE is not a sensible trade-off to me. Finally, not surprisingly, both RNODE and TayNODE do a better job at minimizing respective objectives (R_2 and B/K). - The results on time series modeling are not so impressive. Vector fields in Figure 4 look rather similar compared to, e.g., those in Figure 1. Also, 8% performance drop to save compute time *only* in test mode is not so preferable. [1] Bettencourt, Jesse, Matthew J. Johnson, and David Duvenaud. "Taylor-Mode Automatic Differentiation for Higher-Order Derivatives in JAX." (2019).

Correctness: The claims and the experiments go hand in hand. I did not find any flaw in the experiment setup. My only comment is that looking at Figure 6, it's indeed that setting K=2 when the integrator is second order gives the best tradeoff. But otherwise, there is no apparent tie between the order of regularization and solver, which contradicts with lines 99-100.

Clarity: The paper is well-written and organized.

Relation to Prior Work: Overall, the related work section is well-written. As mentioned before, a citation to [1] would be nice.

Reproducibility: Yes

Additional Feedback: I decided to increase my score after reading the rebuttal and discussing with other reviewers. - The authors and I are in agreement with the drawbacks of the increased training time. It's indeed that test execution time drops if this methodology is applied. Although, personally, I'm not aware of any real-time test deployment of neural ODEs where the execution time is crucial, I give credit to the authors in that. - I now better see the distinctions between this work and Finley et al. 2020. Although the starting point of Finley et al. 2020 is different from this paper (optimal transport perspective), the introduced recipe leads to simpler (possibly non-stiff) dynamics, very much like in this paper. What's more, the presented results still does not improve much upon Finley et al 2020; I'm not sure which method I'd choose to penalize a NODE vector field. I'd definitely like to see some mesaure of error uncertainty since the losses achieved by vanilla NODE, RNODE and TayNODE are very close (particularly in the tables in the appendix). - I guess I wasn't clear in what I said. Essentially, my message is that the regularized vector field in Fig.4 looks as complicated as the unregularized one in Fig.1, which demonstrates the benefits of the proposed method in time-series models is limited.

Review 2

Summary and Contributions: The paper proposes a new regularization technique in Neural ODE models that make the black-box dynamics easier to solve. A thorough empirical evaluation is presented to assess the benefits and pitfalls of this new regularizer. A heads-up: I wasn't able to open the Supplementary PDF on my computer (a Windows machine). Post-rebuttal update: I'm glad that the authors appreciate my suggestions. I'm looking forward to read the revised version. My score for now remains unchanged high.

Strengths: The paper addresses an important need when training Neural ODEs models. The paper is very critical about its own methodology and thoroughly investigates the benefits and pitfalls.

Weaknesses: Some parts of the presentation are kept short borderline of being too short to be complete. However, I hope to provide concrete constructive suggestions below.

Correctness: The theoretical contributions are presented as heuristics, but the rationale is made clear. The empirical methodology seems exhaustive and appropriate.

Clarity: Clarity is the reason for my low score. Presentation is typically highly subjective, so I do understand if the authors or the other reviewers disagree with some or all points and I do not consider any of them to be sufficient reasons for rejection. This list will be sorted chronologically, not according to severity. l. 47: throughout the intro, the authors often use qualifiers for their claims except here. Often these qualifiers are not needed (e.g., line 30), but here I think a qualifier is necessary, as error bounds and actual error can be quite different things. Paragraph at l. 64: The paragraph does not mention that the higher-order is achieved by matching the numerical and the analytical derivatives of the solution at t+h. For uninitiated readers, this might be a crucial omission. l. 77: Here, advantages of Neural ODEs are mentioned whereas throughout the rest of the text, the authors are on the defense for their general applicability. I propose to move the citations of Sect. 8 here, delete Sect. 8 and worry less about whether people question the usage of Neural ODEs in general. (Also, in the broader impact section, please remove the word "better".) l. 130: the Taylor mode AD makes or breaks the regularization technique in practice at the moment, yet only half a paragraph and a book chapter as citation are given. Me and probably many readers would like to learn more about Taylor mode AD here. Also, I probably should have worked out some higher-order derivatives in my head to understand Table 1, but in the limited time, the examples didn't really improve my understanding. Fig. 4: as you stated earlier, there are degrees of freedom to be expected in the vector field, so plotting two examples next to each other without any performance plots is leaving much to the imagination. For example, in (b), the sharp turn in the area z_1 approx 0, z_0 < 0 looks like it might induce stiffness issues. Sect. 5 and later: the submission seems to awkwardly justify the higher computation time, in particular since NFEs are so strongly emphasized in the beginning. I would suggest to be more explicit once in the beginning and then not to worry about it for the rest of the manuscript too much. Sect. 6.4: I think I was able to understand this bit plus Fig. 8 (a), but this is probably a bit too dense. How are the y-values computed in Fig. 8 (a)?

Relation to Prior Work: I think a connection to https://openreview.net/forum?id=B1e9Y2NYvS should be made, otherwise it looks okay.

Reproducibility: Yes

Additional Feedback: Do you think that your heuristic might also affect the robustness of Neural ODEs? Cf. https://openreview.net/forum?id=B1e9Y2NYvS and https://arxiv.org/abs/2006.05749 Fig. 5 (b): it looks like there is a large stochastic effect based on the scatter. Maybe the authors should comment how sensitive all the experiments are on stochasticity overall (from the rest of the plots I get the impression that effects are mostly robust) Fig. 6 (d): did you make an analysis what the distribution of chosen orders have been in the adaptive order method?

Review 3

Summary and Contributions: This paper proposes a new training method for neural ODEs to improve the test performance by reducing the number of function evaluations (NFE) by adding the norm of higher order derivative as a penalty term to the original objective function.

Strengths: The idea of training a neural ODE so that a solver can easily solve it is novel and interesting. The goal of reducing the computational cost without losing solution quality is very practical.

Weaknesses: The proposed method does not seem mature enough. The merits of the proposed regularization are not clear in the experiments, and not convincing enough.

Correctness: Maybe yes.

Clarity: This paper could be written better. There are unnecessarily many subsections.

Relation to Prior Work: Yes. The difference from other works is relatively clear.

Reproducibility: Yes

Additional Feedback: The idea of learning neural ODE that can be solved easily is interesting. While this paper focuses on reducing the test time, I wonder if there are some other advantages of suppressing higher-order derivatives for some applications.

Review 4

Summary and Contributions: This paper considers the problem of fitting an ODE to data when the vector field is parametrised by a neural network. The main contribution is to regularise the data fitting criterion in a manner which biases the estimate towards a vector field that makes the underlying ODE well-behaved.

Strengths: The authors discuss the trade-off between many important quantities such as the number vector field evaluations, degree of regularisation, unregularised loss, and computation time at training versus test time.

Weaknesses: The relation between the present approach and prior literature on function approximation via Gaussian processes / Kernel interpolation is left completely explored. I elaborate on this in the comments below.

Correctness: Appears so.

Clarity: Well enough.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Some comments: 1) The regulariser R_K is precisely the square RKHS norm for (K-1)-times integrated Wiener processes on [t_0,t_1 starting at zero [Sec 10, 1]. From this perspective the regulariser enforces the resulting ODE solution to have a small norm in this RKHS. However, pinning the Wiener process is known to give poor approximation properties of smooth functions wherefore a "released version" is often preferred [Sec 10, 1]. I think the paper would benefit from a discussion on whether this is applicable to the present approach as well. 2) R_K is also the Onsager--Machlup functional = "prior density" functional associated with (K-1)-times integrated Wiener processes starting at zero [2]. Consequently it seems like the regularised objective in Eq. (2) could have an interpretation of some kind of constrained maximum a posteriori estimate. Is this the case? 3) In Sec 6.3 it is mentioned that while evaluation at test time is generally faster with the present approach, the training time is in general slower. I think this could have been explored in more detail. That is, how many tests per train do I need to perform for the present approach to offer an advantage? [1] A. W. van der Vaart and J. H. van Zanten "Reproducing kernel Hilbert spaces Gaussian priors" (2008). [2] D. A. Dutra, B. O. S. Teixeira, and L. A. Aguirre "Maximum a posteriori state path estimation: Discretization limits and their interpretation".