NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2858
Title:Latent Ordinary Differential Equations for Irregularly-Sampled Time Series

Reviewer 1

Update after rebuttal: Thank you for your response. The inclusion of some more references, error bars, and hyperparameter details for experiments make the paper stronger. I have raised my score to an 8. Original review: This is a good paper. I have put a score of 7 but I'm happy to raise this to 8 if the authors address point 1 under "notes and questions" below and cite some earlier ODE-adjoint literature. Originality - The combination of RNNs and neural ODEs is novel, as is the combination to form an encoder-decoder model with continuous-time latent state evolution. The ODE-RNN idea isn't *surprising* if you've read both the Neural ODE paper and the work on RNNs with exponentially decaying hidden states, but it's good to see it executed well and evaluated thoroughly. The encoder-decoder model and Poisson process likelihood ideas are less obvious (to me at least). The fact we can get a Poisson process likelihood with a single ODE solve is cool. Quality: The paper seems technically sound. The experiments are thorough, and the authors compare to a variety of strong baselines. The proposed models are demonstrated to perform well. I haven't looked at the code, but it is encouraging that it is provided. Significance: ODE dynamics for hidden and latent states, and the continuous-time Poisson process likelihood, both seem like they could become standard tools for combining deep learning with time series. There are a wide variety of situations where observations and predictions do not conform to a fixed grid, and this method makes more conceptual sense then previous approaches, and has demonstrated empirical advantages. I rate the significance as high. Clarity - The paper was clear, easy to follow, and enjoyable to read. Thanks! I appreciated that the authors (i) provided clear detail of final hyperparameters and training in appendix, (ii) provide code. Notes and questions: 1. The paper cites the neural ODE paper (Chen et al, 2018) stating "Chen et al showed how adjoint sensitivities can be used to compute exact, memory-efficient gradients w.r.t. theta, allowing ODE solutions to be used as a building block in larger models". It would be good to also assign credit to earlier adjoint method + ODEs work. As far as I understand Chen et al did not develop or extend the adjoint state method itself, but used an NN as a parametric gradient function for an ODE, and showed this NN could be trained by following a long line of prior non-ML literature in using the adjoint state method to differentiate through the ODE solve. I suggest modifying the statement and also citing earlier work on adjoint method for ODEs. Chen et all themselves cite a recent paper from Stapor and Froehlich, but a cursory google search shows papers which appear to use the adjoint state method to backpropogate through ODEs in essentially the same fashion since at least 2000 ( ; I wouldn't be surprised if essentially the same formula dates back to control literature somewhere in the 70s-90s. I think it's particularly important the reference reflect this when the method is relatively new to the ML community. This might seem like an overreaction to a reference but I think it's very important that the ML community cite / assign credit to the source of ideas outside our subfield. Firstly to be good citizens of the broader scientific community and to do right by the authors of prior work. Secondly for the sake of our readers and our own field; only citing papers within our field disconnects readers from related areas of science and might give junior researchers the impression they don’t need to read outside of NeurIPS, ICML, etc. Some papers in my current batch reinvent things which are well known in the engineering and control fields, which is an unfortunate waste of everyone's time. I think this is a symptom of our community being at times too inwardly focussed, and perhaps avoidable if we more actively cite and give credit outside our (sub) field. We should also be clear about where useful techniques such as the adjoint state method come from, so that other researchers can best go searching for methods to introduce to our community. 2. For the encoder-decoder model, I wonder about encoding all the information from the encoded sequence into the initial latent state z0. This seems like it might cause optimization problems - it seems related to the "shooting method" used for initial value problems for ODEs (where one fixes dynamics and optimizes x(t0) which results in x(t1)). The shooting method has known issues when the dynamics are not extremely simple / stable / smooth, which seems almost certainly the case here. If this is indeed a potential issue, there is a long ODE literature on alternatives to overcome the drawbacks of the shooting method, which might be relevant for follow up work. 3. In the experiments section, there isn't much detail on data preparation / generation. I found this in the appendix - please at least put a note in the main paper telling the reader to look there. 4. There aren't comparisons to more traditional time series forecasting measures from stats (e.g. ARIMA). This criticism can unfortunately be applied to all of the deep learning papers I've read focussing on Physionet - it'd be really useful for the authors to include some traditional methods as baselines for the extrapolation tasks, so that we know for sure the deep learning methods win. Folklore has it that deep learning can still underperform traditional methods so it would be useful to have an empirical answer. Either some simple interpolation method could be used to project the observations to a grid, or perhaps there is some work from the time series community on continuous-time models (some extension of Kalman filtering? GARCH models?). 5. Does your preparation of e.g. Physionet agree with or differ from that used by other papers, e.g. Che et all, and how/why? 6. Was hyperparameter search done manually, over a grid, ...? It would be good to record the (range of) hparam values tested for each model. (Useful for verifying hparam search is not biased towards exploring a 'good' range for the proposed model). 7. Conditional on some z0, can you easily use the Poisson process to sample different observation times as well as different observation values for those times?

Reviewer 2

- In terms of originality, the main contribution of the paper is the introduction of the ODE-RNN, which is novel if not extremely original. This model is a natural application of the Neural ODE. It does extend in an elegant fashion previous approaches for irregularly-sampled time-series [Mozer 17]. The proposed Latent ODE model relying on ODE-RNN is an immediate extension of the work of [Chen 18], and is less original than the ODE-RNN. - Clarity: overall, the paper is well written and easy to follow. - Quality: poor. The main concerns I have for this paper are related to the experimental section. The toy experiments are poorly conducted, are not convincing and bring little insight into the qualitative properties of the model. Interpolation and extrapolation errors are mildly convincing - see detailed remarks. The only convincing experiment is the classification task on the PhysioNet dataset. - Significance: the ideas proposed in this paper could be quite significant, since the problem addressed arises in many time-series applications. ** Update after the authors' response** I thank the authors for their response. This response addresses several experimental problems listed in the review, especially by providing error bars which help appreciate the significance of their results and raises the quality of the paper. Provided the toy experiments are better explained in the main paper, I vote in favor of accepting this paper.

Reviewer 3

I like the experimental approach in trying both auto-encoding (useful for interpolation and imputation) and extrapolation of time series, and encoding the initial condition of the Latent ODE accordingly (respectively, backwards from the last observation when doing auto-encoding, or forward to the last observation before prediction of the continuation of the time series). While the submission is original and clearly written, the following comments address a few remaining questions: In the Latent ODE models, how does the RNN encoder handle irregularly-spaced inputs {x_i, t_i}_{i=0,... N_2} or {x_i, t_i}_{i=0,... N}? Does it work like a plain RNN with regularly spaced inputs, as lines 102-103 suggest, i.e., that irregularly spaced inputs are fed to the RNN as if they were regularly spaced? In particular, I was puzzled by the the poor performance of the Latent ODE with plain RNN encoder, on the toy dataset extrapolation task with 20 input points, and wonder this was an artefact of seemingly noisy input data. Fig. 2 in the supplement shows better extrapolation when the initial condition z_0 is conditioned on 80 input points. I am wondering if the ODE would better extrapolate with a smaller state variable, to make the prediction of the initial condition easier, especially given that the modelled dynamics are simple oscillations. What is g_mu and g_sigma in Algorithm 2, section 2.2 of the supplement, and does this differ from z'_0? My understanding is that the ODE-RNN encoder produces two values z'_0 for each latent variable: the mean and the variance, to sample from using a VAE, but what are these intermediary values? The Physionet, MujoCo and human activity datasets are all evaluated on auto-regressive RNNs. It is not clear from the paper what are the inputs and outputs of those RNNs - I assume that like in typical dynamical modeling, x_{t-1} is the input, and \hat x_{t} is the output at time t, with a recurrent state of various dimensions specified in section 5 of the supplement. A naïve question would be: why not use a simple Neural ODE, with a state variable of the same dimension as the observations, as an additional baseline? Also, why not use an ODE-RNN as decoder in the Latent ODE? The inputs could for instance consist in the observed values x(t_i), and those input could correct the predictions of the decoder. Minor comments: There are two references for Sutskever 2014. The bibliography style appears to be wrong (as NeurIPS uses numbers for references, not full names and years) - some space could be gained for more content.