Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
[Originality] This paper is one of the first papers, if not the very first, to introduce a coupled ODE framework that takes a principled approach on neural ODE to dynamically controlling the neural net parameters. There exist some previous papers on dynamic parameters in neural nets, but they are not quite related to neural ODE and specifically not looking into the discretization issues in neural ODE. And with one additional coupled ODE, the presented framework can perform inference efficiently. [Quality] The theoretical part of this paper is sound and mostly self-contained. The authors presented detailed experiment results with several examples of the proposed framework. It clearly shows that the proposed coupled ODE framework has advantage over the original neural ODE. And I believe the authors makes a good job in evaluating their and others' work. [Clarity] The paper is clearly written with some visualizations for readers to understand the proposed framework. Key equations for the variants of coupled ODEs are provided. [Significance] The paper presented an efficient way to improve neural ODE via allowing a separate dynamic weight evolution ODE and coupling this one with the original ODE. The performance gain is likely to indicate that this approach is effective. And the use of dynamic parameters in neural ODE may inspire other work in different applications which involve both dynamical systems and requirement for adaptability (for example, robotics).
In Eqn 4 and Eqn 5, the z_0 should carry an index i for the i^th training example, i.e., the index shows up only in the loss l_i, but not in the dynamics. In the Neural ODE paper of  time dependence is explicitly included in the function f: dh/dt = f(h, theta, t) In the architecture, it enters through an extra channel for time. While the coupled ODE formulation of the paper is elegant, It is unclear whether this mechanism of introducing time-varying weights is better and if so precisely why. The paper grounds its motivation through the observations in : "other discretization schemes such as RK2 or RK4, or using more time steps does not affect the generalization performance of the model". However, it is not made precisely clear how the ANODEv2 approach resolves these problems. The reference to Turings paper on The Chemical Basis of Morphogenesis has the wrong year. The diffusion-reaction-advection model for convolutional weights is interesting and worthwhile to study in greater detail. The Baseline in the experiments is not described. Overall the improvements over Neural ODE seem smallish but consistent. Which integrators are used in the implementation? Timings are not reported.
Strengths: 1. The PDE-inspired formulation of coupled ODE is very interesting and can enable utilization of decades of progress in efficiently solving particular classes of coupled equations, in deep learning applications. This is a very exciting connection discovered by the authors. 2. The general idea of allowing activations and weights to evolve (in particular, evolve independently) is an interesting approach to enrich neuralODE representation. Weaknesses: 1. The central contribution of modeling weight evolution using ODEs hinges on the mentioned problem of neural ODEs exhibiting inaccuracy while recomputing activations. It appears a previous paper first reported this issue. The reviewer is not convinced about this problem. The current paper doesn't provide a convincing analytical argument or empirical evidence about this issue. 2. Leaving aside the claimed weakness of neuralODE, the idea of modeling weight evolution as ODE is itself very intellectually interesting and worthy of pursuit. But the empirical improvement reported in Table 1 over AlexNet, ResNet-4 and ResNet-10 is <= 1.75 % for both configurations. The improvement of decoupling weight evolution is in fact even small and not consistent - the improvement in ResNet for configuration 2 is smaller than keeping the evolution of parameters and activations aligned. The improvement for ablation study over neuralODE is also minimal. So, the empirical case for the proposed approach is not convincing. 3. The derivation of optimality conditions for the coupled formulation is interesting because of connections to a machine learning application (backpropagation) but a pretty standard textbook derivation from dynamical systems / controls point of view.