Reviews: Neural Networks with Cheap Differential Operators

This paper proposes a neural network architecture that allows fast computation dimension-wise derivatives (regardless of dimensionality). Proposed modification in the computation graph leads to diagonal Jacobians (as well as higher order derivatives) that can be computed in a single pass. The model also generalizes a number of recently proposed flow models, which makes it a significant contribution to this line of research. The authors show that their idea can be combined with implicit ODE methods, which results in faster convergence rates in stiff dynamics. Continuous-time normalizing flows is also improved by computing the exact trace thanks to the proposed framework. Finally, DiffOpNet is proven useful for the inference of SDEs, which is known to be a difficult problem. The paper is very well-written and easy to follow, I really enjoyed reading the paper. The proposed idea seems to be pretty straightforward but also well-thought and developed throughout the paper. In addition to the idea, the paper is also novel in the sense that it brings together the recent advances in automatic differentiation and a neural network construction. I believe that we will be seeing more papers taking advantage of AD graphs in the near future. Hence, this paper, being one of the firsts, should definitely be accepted. Comments on the results: - Results in Figure1 is intuitive and really impressive but I do wonder how you identified the stiffness of the ODE systems. - I am particularly impressed by the improvement upon CNF in Table1. - Did you explore the SDE systems having d>2 dimensionality? - One restriction of the SDE inference is that the diffusion is diagonal. Would your framework still be used if it weren't? - Could you comment on the execution times? - Could you give a reference to your claim that setting k=5000 and computing (13) is the standard way of computing NLL? Replies to the above questions could also be included in the paper.

Reviewer 2

UPDATE: Many thanks to the authors for their response. I understand that training a neural ODE with an unconstrained network and exact trace evaluation is expensive in high dimensions. I remain positive about the paper, and I'm happy to keep my score of 8. I wish the authors best of luck and I look forward to see what comes next! Summary: The paper presents a new architecture for neural networks with D inputs and D outputs, called DiffOpNet. DiffOpNets have the property that differential operators such as the divergence or the Jacobian diagonal can be computed efficiently on them (whereas in general they'd need O(D) backprop passes). The paper presents three applications of DiffOpNets in solving, evaluating and learning differential equations; thanks to using DiffOpNets, the necessary differential operators can be efficiently computed. Originality: The idea of DiffOpNets is original and clever. DiffOpNets have some similarities to autoregressive transformations, which the paper clearly describes. The three applications of DiffOpNets (Jacobi-Newton iterations, exact log density of neural ODEs, learning SDEs) are original and exciting. Overall, the originality of the paper is high. Quality: The paper is of high technical quality, and includes sufficient mathematical detail to demonstrate the correctness of the proposed methods. I think a weakness of the paper is that it doesn't sufficiently evaluate extent to which the architectural constraints of DiffOpNets impact their performance. In particular, it would have been nice if section 4 contained an experiment that compared the performance (e.g. on density estimation) of a neural ODE that uses a DiffOpNet with a neural ODE that uses an unconstrained network of comparable size but computes the trace exactly. That way, we could directly measure the impact of the architectural constraints on performance, as measured by e.g. test log likelihood. Lines 61-63 compare the expressiveness of DiffOpNets to autoregressive and coupling transformations. As I understand it, the implied argument is that DiffOpNets impose fewer architectural constrains than autoregressive and coupling transforms, and since flows based on these transforms are expressive enough in practice, so should DiffOpNets. However I think this is a flawed argument: autoregressive and coupling transforms are composable, and even though each individual transform is weak, their composition can be expressive. On the other hand, it doesn't seem to me that DiffOpNets are composable; if we stack two DiffOpNets, the resulting model wouldn't have the same properties as a DiffOpNet. Section 5 evaluates Fokker-Planck matching is on a rather toy and very low-dimensional experiment (two dimensions). As I understand it, the main claim of that section is that Fokker-Planck matching with DiffOpNets scales better with dimensionality compared to other methods. Therefore, I think that I higher-dimensional experiment would strengthen this claim. Clarity: The paper is very well written, and I enjoyed reading it. Having the main method in section 2 and three stand-alone applications in sections 3, 4 & 5 works really well as a structure. Significance: I think DiffOpNets are a significant contribution, with various important applications in differential equations, as was clearly demonstrated. I think that the significance of the paper would have been higher if it included a clearer evaluation of the impact of the architectural constraints on the network performance. In any case I think that the paper deserves to be communicated to the rest of the NeurIPS community and I happily recommend acceptance.

Reviewer 3

The paper studied the problem of efficient derivative computation. The advantage is that the Jacobian is diagonal matrix for the given structure and the vector-Jacobian multiplication reduces to vector inner product. The major improvement of given method is computation efficiency. In the experiments, why there is no wall-clock time result to show such efficiency? Why an efficient gradient calculation can lead to better convergence point of the optimization objective? The paper is slightly out of the reviewer's area but the reviewer cannot get a clear understanding from the paper alone, probably due to some ambiguous notations: i) On page 2, it says setting h_i = x_\i can recover standard neural net, but doing so gives f_i = \tau_i(x) where standard neural net should be (f_1,…, f_d) = \tau (x) while here there are d networks? ii) What is g(x, h) on page 2? Is it \tau(x, h)? In general, the claimed improvements are not clearly reflected in the experiments and the improvement of writing is desired. ###### I have read the authors rebuttal and make the change of rating accordingly.

Paper ID:	5277
Title:	Neural Networks with Cheap Differential Operators

Reviewer 1

Reviewer 2

Reviewer 3