Review for NeurIPS paper: Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs

NeurIPS 2020

Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs

Review 1

Summary and Contributions: In this paper, the authors propose an alternative to the adjoint method that uses the barycentric Lagrange interpolation to approximate intermediate activations. The authors also give a theoretical analysis of the gradient error bound. The proposed method is applied to density estimation and classification tasks.

Strengths: The slowness has been a major road blocker for wider applications of neural ODE. I think this paper is relevant to the community. The application of the barycentric Lagrange interpolation seems novel. The theoretical analysis also gives a guarantee of the error in the gradients.

Weaknesses: The experimental study seems weak. The datasets considered for the density estimation task are only synthetic datasets. If would be helpful to see the performance on at least one real data, such as MNIST.

Correctness: The theorectical conclusion seems correct. The the empirical methodology is correct.

Clarity: The paper is well motivated and clearly written.

Relation to Prior Work: The authors presented and discussed related methods in detail in Figure 1.

Reproducibility: Yes

Additional Feedback: Besides faster model training, are there other advantages of the proposed model compared to the adjoint method? My understanding is that by storing intermediate activations, the stability of backpropagation should also be improved. Are there any ways to demonstrate this, especially when the time span is long?

Review 2

Summary and Contributions: The paper discusses different computation strategies to compute the gradients in Neural ODE blocks and presents a method to better trade-off memory, speed and accuracy in the adjoint computation using Barycentric Lagrange Interpolation. A theorem regarding error bounds is presented and empirical analysis is included. As a heads-up to all involved parties: I have seen a similar submission during the ICML 2020 review process which I assume is an earlier version of this work.

Strengths: The problem is well motivated and clearly illustrated besides English style and grammar issues. I particularly commend Sect. 2 (with minor caveats detailed below) and Fig. 1. I agree with the authors that it is a very important problem in Neural ODEs which would be a strong argument for me to potentially accept the submission.

Weaknesses: I take issues with two aspects of this submission that lead me to recommend rejection at this point. 1.) The submission points out that the evaluations of z(t) at the Chebyshev grid points can be obtained without additional cost, e.g., at line 108. While this is true in some sense in general, there are many numerical theory aspects to this claim that are ignored here, both in the text as well as in the code. Runge-Kutta methods only guarantee high-order approximations at their own grid points. If high-order approximations are sought at pre-defined grid points, there are two solutions: a) the solvers are forced to include the pre-defined grid points as part of the otherwise adaptive mesh or b) a particular choice has to be made to find a smooth-interpolant Runge-Kutta formula. Neither solution is mentioned in the submission, nor applied in the code (I have checked), but smooth-interpolants exist for the DOPRI5 pair, e.g., https://www.sciencedirect.com/science/article/pii/0377042790901989 This in itself would not be a cruical omission. However, since the authors claim that the BLI in particular stands out as interpolation technique, this claim would need to be contrasted with other interpolation techniques to achieve similar results. 2.) There is little practical context for the bound of Sect. 3. All statements and proofs seem to be correct in Sect. 3 on my superficial reading, so this is not a problem. The problem is a) the proof compares to ground truth activations z(t) which are not available. At best we have the numerical solution of DOPRI5. If I understood the code correctly, the authors furthermore interpolate the DOPRI5 output to obtain the values at the Chebyshev grid which is then once more interpolated when needed during the adaptive step-size backward pass. All these error propagations are not considered in the presented theory. b) Even if this was presented, the user of Neural ODEs needs to know how the error compares to the reverse dynamic mode which is never directly evaluated in the manuscript, only indirectly through training time, #function evaluations and accuracy, which further includes the effects of stochastic optimization in the mix. Thus, the paper reserves two pages of theory that are fun to read, but not meaningful to the contributions of the paper.

Correctness: Claims seem to be correct. Empirical methodology is borderline, but would be acceptable. Hints to improve experiments: - comparison with ANODE could (should?) be included - Personally, I would like to see an evaluation of gradient accuracy with respect to grid points and tolerance for the RDM. To this end, I would like to see how gradient estimates converge when using lower and lower tolerances for the RDM and more and more grid points for the BLI. These convergence speeds could be contrasted with overall gradient variance through mini-batching. (If necessary, this could be done on synthetic data sets.) - An experiment should be conducted how much more function evaluations/wall clock time is needed to force DOPRI5 to include the Chebyshev grid vs. how the gradients differ if 'simple interpolation' is used vs. a RK smooth-interpolant. These experiments could take up the space of Sect. 3 and Sect. 3 could move in the appendix.

Clarity: Please have your submission proof-read for English style and grammar issues.

Relation to Prior Work: This is adequate.

Reproducibility: Yes

Additional Feedback: Since this is such an important problem and this really is standard numerical practice in other fields, I would be willing to be convinced that the paper should be accepted. To this end, I am looking for convincing answers to these questions in the rebuttal: a) Will the code be released upon acceptance? b) How do the authors plan to improve the manuscript with respect to weakness 1.)? Post-rebuttal update: I thank the authors for their feedback. I still maintain my position that the theorem does not support the flow of arguments of the main contribution. While the authors have suggested that based on these insights, regularizers could be formulated, this line of rationale is nowhere present in the rest of the manuscript. I think the clarity could further improved if this intention is clearly mentioned. I can follow their argument of the ANODE solver. I am looking forward to see the promised additional experiments with respect to solver tolerances, number of grid-points and piecewise-smooth interpolants. I am also looking forward to have a look at the code upon release. As a consequence of the rebuttal, I'll raise my score, but I'll lower my confidence, as I will only be able to see the promised additions conditioned on the acceptance decision.

Review 3

Summary and Contributions: This works tackles the problem of reversing a dynamical system during the adjoint method (or reverse dynamic method) for computing gradients of Neural ODEs. Instead of solving the state z(t) backwards in time, they propose replacing it with a barycentric Lagrange interpolation based on N points. These N points are computed during the forward pass for essentially free using an adaptive ODE solver and stored. The approach allows the user to change the memory required from O(1) to O(N) while increasing the accuracy of the state reconstruction. The authors also bound the error of this interpolation method, which grows wrt the gradient norms (Lipschitz constants) of f. Empirically, the proposed approach seems to be able to compute gradient marginally faster. -- Update -- I thank the authors for including an experiment on reversing a difficult system. I rescind my comment on the marginal time improvement, though it'd be best if the authors could include error bars.

Strengths: The method is simple and the code is provided.

Weaknesses: Though one motivation was being able to use this approach for ODE systems that are numerically unstable for the reverse dynamics method, this scenario was not tested. Notably, one aspect this paper has ignored is that the actual gradient vector (which is usually much larger compared to the state z(t)) still needs to be solved backwards in time using the reverse dynamic method, and it seems the computed gradients using both the reverse dynamic method and the integrated one are mostly the same, judging from the loss curves. So while the experiments show that there are fewer evaluations due to not having solve z(t) backwards, the paper would be stronger if it is actually applied to a system that the reverse dynamic method cannot solve. (For instance, an ODE that collapses to one point is easy to solve in the forward direction but difficult in the reverse direction.)

Correctness: I did not spot any mistakes.

Clarity: The structure and explanations in the paper are very well-written. The paper does have some small grammatical mistakes throughout. I would suggest the authors pass the text through a grammar checker.

Relation to Prior Work: A related work is SNODE: Spectral Discretization of Neural ODEs for System Identification [1] at ICLR 2020, where they use Legendre polynomials to approximate the ODE solution. [1] https://openreview.net/forum?id=Sye0XkBKvS

Reproducibility: Yes

Additional Feedback: The interpolation acts as an approximation for z(T). Is it possible to directly backpropagate through this approximated z(T) to obtain gradients with O(N) memory? If the interpolation sufficiently models the solution, this can forego solving the gradient vector.

Review 4

Summary and Contributions: This paper proposes an interpolation technique to speed up the approximation of gradients in ODEs via the adjoint method

Strengths: The authors present a bound on the error in their gradient approximation.

Weaknesses: The bound on the gradient error appears to not be particularly useful and rely on boundedness conditions on derivatives of the vector field, which are VERY strong.

Correctness: The overall methodology appears to be sound.

Clarity: The paper is mostly clearly written however at some places the word "the" appears to be missing, there"s some typos, abrupt changes in notation and unannounced notation overload.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Some comments: 1) The notation for the interpolant of z starts out being \hat{z} but then changes to \tilde{z}. 2) In the bound of E in Eq. (10) the second term does not have the approximation errors \Delta z or \Delta a as factors consequently, it can not be reduced by improving the approximation quality. This does not appear to be useful. 3) J starts being defined as df/d\theta but ends up being df/dz? ( Eq. (16)). 4) The boundedness condition on df/d\theta and d^2 f/d\theta dz is very strong. Can this even be verified in the considered applications?