{"title": "Neural Ordinary Differential Equations", "book": "Advances in Neural Information Processing Systems", "page_first": 6571, "page_last": 6583, "abstract": "We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.", "full_text": "Neural Ordinary Differential Equations\n\nRicky T. Q. Chen*, Yulia Rubanova*, Jesse Bettencourt*, David Duvenaud\n\nUniversity of Toronto, Vector Institute\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nAbstract\n\nWe introduce a new family of deep neural network models. Instead of specifying a\ndiscrete sequence of hidden layers, we parameterize the derivative of the hidden\nstate using a neural network. The output of the network is computed using a black-\nbox differential equation solver. These continuous-depth models have constant\nmemory cost, adapt their evaluation strategy to each input, and can explicitly trade\nnumerical precision for speed. We demonstrate these properties in continuous-depth\nresidual networks and continuous-time latent variable models. We also construct\ncontinuous normalizing \ufb02ows, a generative model that can train by maximum\nlikelihood, without partitioning or ordering the data dimensions. For training, we\nshow how to scalably backpropagate through any ODE solver, without access to its\ninternal operations. This allows end-to-end training of ODEs within larger models.\n\n1\n\nIntroduction\n\nModels such as residual networks, recurrent neural\nnetwork decoders, and normalizing \ufb02ows build com-\nplicated transformations by composing a sequence of\ntransformations to a hidden state:\n\nht+1 = ht + f (ht, \u03b8t)\n\n(1)\nwhere t \u2208 {0 . . . T} and ht \u2208 RD. These iterative\nupdates can be seen as an Euler discretization of a\ncontinuous transformation (Lu et al., 2017; Haber\nand Ruthotto, 2017; Ruthotto and Haber, 2018).\nWhat happens as we add more layers and take smaller\nsteps? In the limit, we parameterize the continuous\ndynamics of hidden units using an ordinary differen-\ntial equation (ODE) speci\ufb01ed by a neural network:\n\ndh(t)\n\ndt\n\n= f (h(t), t, \u03b8)\n\n(2)\n\nResidual Network\n\nODE Network\n\nFigure 1: Left: A Residual network de\ufb01nes a\ndiscrete sequence of \ufb01nite transformations.\nRight: A ODE network de\ufb01nes a vector\n\ufb01eld, which continuously transforms the state.\nBoth: Circles represent evaluation locations.\n\nStarting from the input layer h(0), we can de\ufb01ne the output layer h(T ) to be the solution to this\nODE initial value problem at some time T . This value can be computed by a black-box differential\nequation solver, which evaluates the hidden unit dynamics f wherever necessary to determine the\nsolution with the desired accuracy. Figure 1 contrasts these two approaches.\nDe\ufb01ning and evaluating models using ODE solvers has several bene\ufb01ts:\n\nMemory ef\ufb01ciency In Section 2, we show how to compute gradients of a scalar-valued loss with\nrespect to all inputs of any ODE solver, without backpropagating through the operations of the solver.\nNot storing any intermediate quantities of the forward pass allows us to train our models with constant\nmemory cost as a function of depth, a major bottleneck of training deep models.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fAdaptive computation Euler\u2019s method is perhaps the simplest method for solving ODEs. There\nhave since been more than 120 years of development of ef\ufb01cient and accurate ODE solvers (Runge,\n1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth\nof approximation error, monitor the level of error, and adapt their evaluation strategy on the \ufb02y to\nachieve the requested level of accuracy. This allows the cost of evaluating a model to scale with\nproblem complexity. After training, accuracy can be reduced for real-time or low-power applications.\n\nParameter ef\ufb01ciency When the hidden unit dynamics are parameterized as a continuous function\nof time, the parameters of nearby \u201clayers\u201d are automatically tied together. In Section 3, we show that\nthis reduces the number of parameters required on a supervised learning task.\n\nScalable and invertible normalizing \ufb02ows An unexpected side-bene\ufb01t of continuous transforma-\ntions is that the change of variables formula becomes easier to compute. In Section 4, we derive\nthis result and use it to construct a new class of invertible density models that avoids the single-unit\nbottleneck of normalizing \ufb02ows, and can be trained directly by maximum likelihood.\n\nContinuous time-series models Unlike recurrent neural networks, which require discretizing\nobservation and emission intervals, continuously-de\ufb01ned dynamics can naturally incorporate data\nwhich arrives at arbitrary times. In Section 5, we construct and demonstrate such a model.\n\n2 Reverse-mode automatic differentiation of ODE solutions\n\nThe main technical dif\ufb01culty in training continuous-depth networks is performing reverse-mode\ndifferentiation (also known as backpropagation) through the ODE solver. Differentiating through\nthe operations of the forward pass is straightforward, but incurs a high memory cost and introduces\nadditional numerical error.\nWe treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity\nmethod (Pontryagin et al., 1962). This approach computes gradients by solving a second, aug-\nmented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly\nwith problem size, has low memory cost, and explicitly controls numerical error.\nConsider optimizing a scalar-valued loss function L(), whose input is the result of an ODE solver:\n\nL(z(t1)) = L\ufffdz(t0) +\ufffd t1\n\nt0\n\nf (z(t), t, \u03b8)dt\ufffd = L (ODESolve(z(t0), f, t0, t1, \u03b8))\n\n(3)\n\ndt\n\n\u2202z\n\nda(t)\n\nTo optimize L, we require gradients with respect\nto \u03b8. The \ufb01rst step is to determining how the\ngradient of the loss depends on the hidden state\nz(t) at each instant. This quantity is called the\nadjoint a(t) = \u2202L/\u2202z(t). Its dynamics are given\nby another ODE, which can be thought of as the\ninstantaneous analog of the chain rule:\n= \u2212a(t)T \u2202f (z(t), t, \u03b8)\n\n(4)\nWe can compute \u2202L/\u2202z(t0) by another call to an\nODE solver. This solver must run backwards,\nstarting from the initial value of \u2202L/\u2202z(t1). One\ncomplication is that solving this ODE requires\nthe knowing value of z(t) along its entire tra-\njectory. However, we can simply recompute\nz(t) backwards in time together with the adjoint,\nstarting from its \ufb01nal value z(t1).\nComputing the gradients with respect to the pa-\nrameters \u03b8 requires evaluating a third integral,\nwhich depends on both z(t) and a(t):\na(t)T \u2202f (z(t), t, \u03b8)\n\ndt\n\n(5)\n\ndL\nd\u03b8\n\n=\ufffd t0\n\nt1\n\n\u2202\u03b8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nFigure 2: Reverse-mode differentiation of an ODE\nsolution. The adjoint sensitivity method solves\nan augmented ODE backwards in time. The aug-\nmented system contains both the original state and\nthe sensitivity of the loss with respect to the state.\nIf the loss depends directly on the state at multi-\nple observation times, the adjoint state must be\nupdated in the direction of the partial derivative of\nthe loss with respect to each observation.\n\n2\n\n\fThe vector-Jacobian products a(t)T \u2202f\n\u2202\u03b8 in (4) and (5) can be ef\ufb01ciently evaluated by\nautomatic differentiation, at a time cost similar to that of evaluating f. All integrals for solving z, a\nand \u2202L\n\u2202\u03b8 can be computed in a single call to an ODE solver, which concatenates the original state, the\nadjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the\nnecessary dynamics, and call an ODE solver to compute all gradients at once.\n\n\u2202z and a(t)T \u2202f\n\nAlgorithm 1 Reverse-mode derivative of an ODE initial value problem\nInput: dynamics parameters \u03b8, start time t0, stop time t1, \ufb01nal state z(t1), loss gradient \u2202L/\u2202z(t1)\n\n\ufffd De\ufb01ne initial augmented state\n\ufffd De\ufb01ne dynamics on augmented state\n\ufffd Compute vector-Jacobian products\n\ufffd Solve reverse-time ODE\n\ufffd Return gradients\n\ns0 = [z(t1), \u2202L\ndef aug_dynamics([z(t), a(t),\u00b7], t, \u03b8):\n\n\u2202z(t1) , 0|\u03b8|]\n\nreturn [f (z(t), t, \u03b8),\u2212a(t)T \u2202f\n\n\u2202z ,\u2212a(t)T \u2202f\n\u2202\u03b8 ]\n\n\u2202\u03b8 ] = ODESolve(s0, aug_dynamics, t1, t0, \u03b8)\n\n[z(t0),\n\nreturn\n\n\u2202L\n\n\u2202z(t0) , \u2202L\n\u2202z(t0) , \u2202L\n\n\u2202L\n\n\u2202\u03b8\n\nMost ODE solvers have the option to output the state z(t) at multiple times. When the loss depends\non these intermediate states, the reverse-mode derivative must be broken into a sequence of separate\nsolves, one between each consecutive pair of output times (Figure 2). At each observation, the adjoint\nmust be adjusted in the direction of the corresponding partial derivative \u2202L/\u2202z(ti).\nThe results above extend those of Stapor et al. (2018, section 2.4.2). An extended version of\nAlgorithm 1 including derivatives w.r.t. t0 and t1 can be found in Appendix C. Detailed derivations\nare provided in Appendix B. Appendix D provides Python code which computes all derivatives for\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd by extending the \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd automatic differentiation package. This\ncode also supports all higher-order derivatives. We have since released a PyTorch (Paszke et al.,\n2017) implementation, including GPU-based implementations of several standard ODE solvers at\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd.\n\n3 Replacing residual networks with ODEs for supervised learning\n\nIn this section, we experimentally investigate the training of neural ODEs for supervised learning.\n\nSoftware To solve ODE initial value problems numerically, we use the implicit Adams method\nimplemented in LSODE and VODE and interfaced through the \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd package. Being\nan implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires\nsolving a nonlinear optimization problem at every step. This setup makes direct backpropagation\nthrough the integrator dif\ufb01cult. We implement the adjoint sensitivity method in Python\u2019s \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\nframework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden\nstate dynamics and their derivatives on the GPU using Tensor\ufb02ow, which were then called from the\nFortran ODE solvers, which were called from Python \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd code.\n\nTable 1: Performance on MNIST. \u2020From LeCun\net al. (1998).\n\nModel Architectures We experiment with a\nsmall residual network which downsamples the\ninput twice then applies 6 standard residual\nblocks He et al. (2016b), which are replaced\nby an ODESolve module in the ODE-Net vari-\nant. We also test a network with the same archi-\nO(L)\nO( \u02dcL)\ntecture but where gradients are backpropagated\nO( \u02dcL)\ndirectly through a Runge-Kutta integrator, re-\nferred to as RK-Net. Table 1 shows test error, number of parameters, and memory cost. L denotes\nthe number of layers in the ResNet, and \u02dcL is the number of function evaluations that the ODE solver\nrequests in a single forward pass, which can be interpreted as an implicit number of layers.\nWe \ufb01nd that ODE-Nets and RK-Nets can achieve around the same performance as the ResNet, while\nusing fewer parameters. For reference, a neural net with a single hidden layer of 300 units has around\nthe same number of parameters as the ODE-Net and RK-Net architecture that we tested.\n\n# Params Memory\n0.24 M\n0.60 M\n0.22 M\n0.22 M\n\n1-Layer MLP\u2020\nResNet\nRK-Net\nODE-Net\n\n1.60%\n0.41%\n0.47%\n0.42%\n\nO(L)\nO( \u02dcL)\nO(1)\n\nTest Error\n\n-\n\nTime\n\n-\n\n3\n\n\fError Control in ODE-Nets ODE solvers can approximately ensure that the output is within a\ngiven tolerance of the true solution. Changing this tolerance changes the behavior of the network.\nWe \ufb01rst verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is\nproportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a\ntrade-off between accuracy and computational cost. One could train with high accuracy, but switch to\na lower accuracy at test time.\n\nFigure 3: Statistics of a trained ODE-Net. (NFE = number of function evaluations.)\n\nFigure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly\nhalf of the forward pass. This suggests that the adjoint sensitivity method is not only more memory\nef\ufb01cient, but also more computationally ef\ufb01cient than directly backpropagating through the integrator,\nbecause the latter approach will need to backprop through each function evaluation in the forward\npass.\n\nNetwork Depth It\u2019s not clear how to de\ufb01ne the \u2018depth\u2018 of an ODE solution. A related quantity is\nthe number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver\nand dependent on the initial state or input. Figure 3d shows that he number of function evaluations\nincreases throughout training, presumably adapting to increasing complexity of the model.\n\n4 Continuous Normalizing Flows\n\nThe discretized equation (1) also appears in normalizing \ufb02ows (Rezende and Mohamed, 2015) and\nthe NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to\ncompute exact changes in probability if samples are transformed through a bijective function f:\n\nAn example is the planar normalizing \ufb02ow (Rezende and Mohamed, 2015):\n\nz1 = f (z0) =\u21d2 log p(z1) = log p(z0) \u2212 log\ufffd\ufffd\ufffd\ufffddet\n\n\u2202f\n\n\u2202z0\ufffd\ufffd\ufffd\ufffd\n\nz(t + 1) = z(t) + uh(wTz(t) + b),\n\n(6)\n\n(7)\n\n(8)\n\n\u2202z\ufffd\ufffd\ufffd\ufffd\nlog p(z(t + 1)) = log p(z(t)) \u2212 log\ufffd\ufffd\ufffd\ufffd1 + uT \u2202h\n\nGenerally, the main bottleneck to using the change of variables formula is computing of the deter-\nminant of the Jacobian \u2202f/\u2202z, which has a cubic cost in either the dimension of z, or the number\nof hidden units. Recent work explores the tradeoff between the expressiveness of normalizing \ufb02ow\nlayers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018).\nSurprisingly, moving from a discrete set of layers to a continuous transformation simpli\ufb01es the\ncomputation of the change in normalizing constant:\nTheorem 1 (Instantaneous Change of Variables). Let z(t) be a \ufb01nite continuous random variable\nwith probability p(z(t)) dependent on time. Let dz\ndt = f (z(t), t) be a differential equation describing\na continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z\nand continuous in t, then the change in log probability also follows a differential equation,\n\n\u2202 log p(z(t))\n\n\u2202t\n\n= \u2212tr\ufffd df\ndz(t)\ufffd\n\nProof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation.\nAlso unlike standard \ufb01nite \ufb02ows, the differential equation f does not need to be bijective, since if\nuniqueness is satis\ufb01ed, then the entire transformation is automatically bijective.\n\n4\n\n\fAs an example application of the instantaneous change of variables, we can examine the continuous\nanalog of the planar \ufb02ow, and its change in normalization constant:\n\ndz(t)\n\ndt\n\n= uh(wTz(t) + b),\n\n\u2202 log p(z(t))\n\n\u2202t\n\n= \u2212uT \u2202h\n\n\u2202z(t)\n\n(9)\n\nGiven an initial distribution p(z(0)), we can sample from p(z(t)) and evaluate its density by solving\nthis combined ODE.\n\nUsing multiple hidden units with linear cost While det is not a linear function, the trace function\n\nis, which implies tr(\ufffdn Jn) =\ufffdn tr(Jn). Thus if our dynamics is given by a sum of functions then\n\nthe differential equation for the log density is also a sum:\n\ndz(t)\n\ndt\n\n=\n\nfn(z(t)),\n\nd log p(z(t))\n\ndt\n\n=\n\nM\ufffdn=1\n\ntr\ufffd \u2202fn\n\u2202z \ufffd\n\nM\ufffdn=1\n\n(10)\n\nThis means we can cheaply evaluate \ufb02ow models having many hidden units, with a cost only linear in\nthe number of hidden units M. Evaluating such \u2018wide\u2019 \ufb02ow layers using standard normalizing \ufb02ows\ncosts O(M 3), meaning that standard NF architectures use many layers of only a single hidden unit.\nTime-dependent dynamics We can specify the parameters of a \ufb02ow as a function of t, making the\ndifferential equation f (z(t), t) change with t. This is parameterization is a kind of hypernetwork (Ha\ndt = \ufffdn \u03c3n(t)fn(z)\net al., 2016). We also introduce a gating mechanism for each hidden unit, dz\nwhere \u03c3n(t) \u2208 (0, 1) is a neural network that learns when the dynamic fn(z) should be applied. We\ncall these models continuous normalizing \ufb02ows (CNF).\n\n4.1 Experiments with Continuous Normalizing Flows\n\nWe \ufb01rst compare continuous and discrete planar \ufb02ows at learning to sample from a known distribution.\nWe show that a planar CNF with M hidden units can be at least as expressive as a planar NF with\nK = M layers, and sometimes much more expressive.\n\nDensity matching We con\ufb01gure the CNF as described above, and train for 10,000 iterations\nusing Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using\nRMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we\nminimize KL (q(x)\ufffdp(x)) as the loss function where q is the \ufb02ow model and the target density p(\u00b7)\ncan be evaluated. Figure 4 shows that CNF generally achieves lower loss.\n\nMaximum Likelihood Training A useful property of continuous-time normalizing \ufb02ows is that\nwe can compute the reverse transformation for about the same cost as the forward pass, which cannot\nbe said for normalizing \ufb02ows. This lets us train the \ufb02ow on a density estimation task by performing\n\nK=2\n\nK=8\n\nK=32\n\nM=2\n\nM=8 M=32\n\n1\n\n2\n\n3\n\n(a) Target\n\n(b) NF\n\n(c) CNF\n\n(d) Loss vs. K/M\n\nFigure 4: Comparison of normalizing \ufb02ows versus continuous normalizing \ufb02ows. The model capacity\nof normalizing \ufb02ows is determined by their depth (K), while continuous normalizing \ufb02ows can also\nincrease capacity by increasing width (M), making them easier to train.\n\n5\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f5% 20% 40% 60% 80% 100%\n\n5% 20% 40% 60% 80% 100%\n\ny\nt\ni\ns\nn\ne\nD\n\ns\ne\nl\np\nm\na\nS\n\nF\nN\n\ny\nt\ni\ns\nn\ne\nD\n\ns\ne\nl\np\nm\na\nS\n\nF\nN\n\nTarget\n\nTarget\n\n(a) Two Circles\n\n(b) Two Moons\n\nFigure 5: Visualizing the transformation from noise to data. Continuous-time normalizing \ufb02ows\nare reversible, so we can train on a density estimation task and still be able to sample from the learned\ndensity ef\ufb01ciently.\n\nmaximum likelihood estimation, which maximizes Ep(x)[log q(x)] where q(\u00b7) is computed using\nthe appropriate change of variables theorem, then afterwards reverse the CNF to generate random\nsamples from q(x).\nFor this task, we use 64 hidden units for CNF, and 64 stacked one-hidden-unit layers for NF. Figure 5\nshows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the\ntransformed distribution after a small amount of time which shows the locations of the initial planar\n\ufb02ows. Interestingly, to \ufb01t the Two Circles distribution, the CNF rotates the planar \ufb02ows so that\nthe particles can be evenly spread into circles. While the CNF transformations are smooth and\ninterpretable, we \ufb01nd that NF transformations are very unintuitive and this model has dif\ufb01culty \ufb01tting\nthe two moons dataset in Figure 5b.\n\n5 A generative latent function time-series model\n\nApplying neural networks to irregularly-sampled data such as medical records, network traf\ufb01c, or\nneural spiking data is dif\ufb01cult. Typically, observations are put into bins of \ufb01xed duration, and the\nlatent dynamics are discretized in the same way. This leads to dif\ufb01culties with missing data and ill-\nde\ufb01ned latent variables. Missing data can be addressed using generative time-series models (\u00c1lvarez\nand Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data\nimputation (Che et al., 2018). Another approach concatenates time-stamp information to the input of\nan RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017).\nWe present a continuous-time, generative approach to modeling time series. Our model represents\neach time series by a latent trajectory. Each trajectory is determined from a local initial state, zt0, and\na global set of latent dynamics shared across all time series. Given observation times t0, t1, . . . , tN\nand an initial state zt0, an ODE solver produces zt1 , . . . , ztN , which describe the latent state at each\nobservation.We de\ufb01ne this generative model formally through a sampling procedure:\n\nzt0 \u223c p(zt0 )\n\nzt1 , zt2 , . . . , ztN = ODESolve(zt0 , f, \u03b8f , t0, . . . , tN )\n\neach xti \u223c p(x|zti , \u03b8x)\n\n(11)\n(12)\n(13)\n\nFunction f is a time-invariant function that takes the value z at the current time step and outputs the\ngradient: \u2202z(t)/\u2202t = f (z(t), \u03b8f ). We parametrize this function using a neural net. Because f is time-\ninvariant, given any latent state z(t), the entire latent trajectory is uniquely de\ufb01ned. Extrapolating\nthis latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.\n\nTraining and Prediction We can train this latent-variable model as a variational autoen-\ncoder (Kingma and Welling, 2014; Rezende et al., 2014), with sequence-valued observations. Our\nrecognition net is an RNN, which consumes the data sequentially backwards in time, and out-\nputs q\u03c6(z0|x1, x2, . . . , xN ). A detailed algorithm can be found in Appendix E. Using ODEs as a\ngenerative model allows us to make predictions for arbitrary time points t1...tM on a continuous\ntimeline.\n\n6\n\n\f\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nht0\n\nht1\n\nhtN\n\nq(zt0|xt0...xtN )\nzt0\n\n\u00b5\n\u03c3\n\n\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nODE Solve(zt0 , f, \u03b8f , t0, ..., tM )\n\nzt1\n\nztN\n\nztN +1\n\nztM\n\n\u02c6x(t)\n\n\ufffd\n\nx(t)\n\nt1\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ntN\n\ntN +1\ntM\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nt0\n\nt1\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ntN\n\ntM\n\ntN +1\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nFigure 6: Computation graph of the latent ODE model.\n\n\ufffd\ufffd\ufffd\ufffd\n\nt0\n\nPoisson Process likelihoods The fact that an observation oc-\ncurred often tells us something about the latent state. For ex-\nample, a patient may be more likely to take a medical test if\nthey are sick. The rate of events can be parameterized by a\nfunction of the latent state: p(event at time t| z(t)) = \u03bb(z(t)).\nGiven this rate function, the likelihood of a set of indepen-\ndent observation times in the interval [tstart, tend] is given by an\ninhomogeneous Poisson process (Palm, 1943):\n\nlog p(t1 . . . tN| tstart, tend) =\n\nlog \u03bb(z(ti)) \u2212\ufffd tend\n\ntstart\n\n\u03bb(z(t))dt\n\nN\ufffdi=1\n\n)\nt\n(\n\u03bb\n\nt\n\nFigure 7: Fitting a latent ODE dy-\nnamics model with a Poisson pro-\ncess likelihood. Dots show event\ntimes. The line is the learned inten-\nsity \u03bb(t) of the Poisson process.\n\nWe can parameterize \u03bb(\u00b7) using another neural network. Con-\nveniently, we can evaluate both the latent trajectory and the\nPoisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate\nlearned by such a model on a toy dataset.\nA Poisson process likelihood on observation times can be combined with a data likelihood to jointly\nmodel all observations and the times at which they were made.\n\n5.1 Time-series Latent ODE Experiments\n\nWe investigate the ability of the latent ODE model to \ufb01t and extrapolate time series. The recognition\nnetwork is an RNN with 25 hidden units. We use a 4-dimensional latent space. We parameterize the\ndynamics function f with a one-hidden-layer network with 20 hidden units. The decoder computing\np(xti|zti ) is another neural network with one hidden layer with 20 hidden units. Our baseline was a\nrecurrent neural net with 25 hidden units trained to minimize negative Gaussian log-likelihood. We\ntrained a second version of this RNN whose inputs were concatenated with the time difference to the\nnext observation to aid RNN with irregular observations.\n\nBi-directional spiral dataset We generated a\ndataset of 1000 2-dimensional spirals, each start-\ning at a different point, sampled at 100 equally-\nspaced timesteps. The dataset contains two\ntypes of spirals: half are clockwise while the\nother half counter-clockwise. To make the task\nmore realistic, we add gaussian noise to the observations.\n\n# Observations\nRNN\nLatent ODE\n\nTable 2: Predictive RMSE on test set\n\n30/100\n0.3937\n0.1642\n\n50/100\n0.3202\n0.1502\n\n100/100\n0.1813\n0.1346\n\nTime series with irregular time points To generate irregular timestamps, we randomly sample\npoints from each trajectory without replacement (n = {30, 50, 100}). We report predictive root-\nmean-squared error (RMSE) on 100 time points extending beyond those that were used for training.\nTable 2 shows that the latent ODE has substantially lower predictive RMSE.\nFigure 8 shows examples of spiral reconstructions with 30 sub-sampled points. Reconstructions from\nthe latent ODE were obtained by sampling from the posterior over latent trajectories and decoding it\n\n7\n\n\fFigure 9: Data-space trajectories decoded from varying one dimension of zt0. Color indicates\nprogression through time, starting at purple and ending at red. Note that the trajectories on the left\nare counter-clockwise, while the trajectories on the right are clockwise.\n\nto data-space. Examples with varying number of time points are shown in Appendix F. We observed\nthat reconstructions and extrapolations are consistent with the ground truth regardless of number of\nobserved points and despite the noise.\n\nLatent space interpolation Figure 8c shows\nlatent trajectories projected onto the \ufb01rst two\ndimensions of the latent space. The trajecto-\nries form two separate clusters of trajectories,\none decoding to clockwise spirals, the other to\ncounter-clockwise. Figure 9 shows that the la-\ntent trajectories change smoothly as a function\nof the initial point z(t0), switching from a clock-\nwise to a counter-clockwise spiral.\n\n6 Scope and Limitations\n\nMinibatching The use of mini-batches is less\nstraightforward than for standard neural net-\nworks. One can still batch together evaluations\nthrough the ODE solver by concatenating the\nstates of each batch element together, creating a\ncombined ODE with dimension D\u00d7K. In some\ncases, controlling error on all batch elements to-\ngether might require evaluating the combined\nsystem K times more often than if each system\nwas solved individually. However, in practice\nthe number of evaluations did not increase sub-\nstantially when using minibatches.\n\nUniqueness When do continuous dynamics\nhave a unique solution? Picard\u2019s existence the-\norem (Coddington and Levinson, 1955) states\nthat the solution to an initial value problem ex-\nists and is unique if the differential equation is\nuniformly Lipschitz continuous in z and contin-\nuous in t. This theorem holds for our model if\nthe neural network has \ufb01nite weights and uses\nLipshitz nonlinearities, such as \ufffd\ufffd\ufffd\ufffd or \ufffd\ufffd\ufffd\ufffd.\n\n(a) Recurrent Neural Network\n\n(b) Latent Neural Ordinary Differential Equation\n\n(c) Latent Trajectories\n\nFigure 8: (a): Reconstruction and extrapolation\nof spirals with irregular time points by a recurrent\nneural network. (b): Reconstructions and extrapo-\nlations by a latent neural ODE. Blue curve shows\nmodel prediction. Red shows extrapolation. (c) A\nprojection of inferred 4-dimensional latent ODE\ntrajectories onto their \ufb01rst two dimensions. Color\nindicates the direction of the corresponding trajec-\ntory. The model has learned latent dynamics which\ndistinguishes the two directions.\n\nSetting tolerances Our framework allows the user to trade off speed for precision, but requires\nthe user to choose an error tolerance on both the forward and reverse passes during training. For\nsequence modeling, the default value of \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd was used. In the classi\ufb01cation and density estimation\nexperiments, we were able to reduce the tolerance to \ufffd\ufffd\ufffd\ufffd and \ufffd\ufffd\ufffd\ufffd, respectively, without degrading\nperformance.\n\nReconstructing forward trajectories Reconstructing the state trajectory by running the dynamics\nbackwards can introduce extra numerical error if the reconstructed trajectory diverges from the\noriginal. This problem can be addressed by checkpointing: storing intermediate values of z on the\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fforward pass, and reconstructing the exact forward trajectory by re-integrating from those points. We\ndid not \ufb01nd this to be a practical problem, and we informally checked that reversing many layers of\ncontinuous normalizing \ufb02ows with default tolerances recovered the initial states.\n\n7 Related Work\n\nThe use of the adjoint method for training continuous-time neural networks was previously pro-\nposed (LeCun et al., 1988; Pearlmutter, 1995), though was not demonstrated practically. The\ninterpretation of residual networks He et al. (2016a) as approximate ODE solvers spurred research\ninto exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al.,\n2017). We demonstrate these same properties in more generality by directly using an ODE solver.\n\nAdaptive computation One can adapt computation time by training secondary neural networks\nto choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al.,\n2016; Figurnov et al., 2017; Chang et al., 2018). However, this introduces overhead both at training\nand test time, and extra parameters that need to be \ufb01t. In contrast, ODE solvers offer well-studied,\ncomputationally cheap, and generalizable rules for adapting the amount of computation.\n\nConstant memory backprop through reversibility Recent work developed reversible versions\nof residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives\nthe same constant memory advantage as our approach. However, these methods require restricted\narchitectures, which partition the hidden units. Our approach does not have these restrictions.\n\nLearning differential equations Much recent work has proposed learning differential equations\nfrom data. One can train feed-forward or recurrent neural networks to approximate a differential\nequation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applica-\ntions such as \ufb02uid simulation (Wiewel et al., 2018). There is also signi\ufb01cant work on connecting\nGaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to \ufb01t\ndifferential equations (Raissi et al., 2018b) and can naturally model continuous-time effects and\ninterventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic\nvariational inference to recover the solution of a given stochastic differential equation.\n\nDifferentiating through ODE solvers The \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd library (Farrell et al., 2013) implements adjoint\ncomputation for general ODE and PDE solutions, but only by backpropagating through the individual\noperations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient\nestimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity\nanalysis is quadratic-time in the number of variables, whereas the adjoint sensitivity analysis is\nlinear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint\nmethod to train bespoke latent dynamic models.\nIn contrast, by providing a generic vector-Jacobian product, we allow an ODE solver to be trained\nend-to-end with any other differentiable model components. While use of vector-Jacobian products\nfor solving the adjoint method has been explored in optimal control (Andersson, 2013; Andersson\net al., In Press, 2018), we highlight the potential of a general integration of black-box ODE solvers\ninto automatic differentiation (Baydin et al., 2018) for deep learning and generative modeling.\n\n8 Conclusion\n\nWe investigated the use of black-box ODE solvers as a model component, developing new models\nfor time-series modeling, supervised learning, and density estimation. These models are evaluated\nadaptively, and allow explicit control of the tradeoff between computation speed and accuracy.\nFinally, we derived an instantaneous version of the change of variables formula, and developed\ncontinuous-time normalizing \ufb02ows, which can scale to large layer sizes.\n\n9\n\n\f9 Acknowledgements\n\nWe thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya,\nHossein Soleimani, Eldad Haber, Ken Caluwaerts, and Daniel Flam-Shepherd for feedback. We\nthank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful discussions.\n\nReferences\nMauricio A \u00c1lvarez and Neil D Lawrence. Computationally ef\ufb01cient convolved multiple output\n\nGaussian processes. Journal of Machine Learning Research, 12(May):1459\u20131500, 2011.\n\nBrandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks.\n\nIn International Conference on Machine Learning, pages 136\u2013145, 2017.\n\nJoel Andersson. A general-purpose software framework for dynamic optimization. PhD thesis, 2013.\n\nJoel A E Andersson, Joris Gillis, Greg Horn, James B Rawlings, and Moritz Diehl. CasADi \u2013 A\nsoftware framework for nonlinear optimization and optimal control. Mathematical Programming\nComputation, In Press, 2018.\n\nAtilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind.\nAutomatic differentiation in machine learning: a survey. Journal of machine learning research, 18\n(153):1\u2013153, 2018.\n\nRianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester\n\nnormalizing \ufb02ows for variational inference. arXiv preprint arXiv:1803.05649, 2018.\n\nBob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betan-\ncourt. The Stan math library: Reverse-mode automatic differentiation in c++. arXiv preprint\narXiv:1509.07164, 2015.\n\nBo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible\narchitectures for arbitrarily deep residual neural networks. arXiv preprint arXiv:1709.03698, 2017.\n\nBo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks\nfrom dynamical systems view. In International Conference on Learning Representations, 2018.\nURL \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd.\n\nZhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural\nnetworks for multivariate time series with missing values. Scienti\ufb01c Reports, 8(1):6085, 2018.\nURL \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd.\n\nEdward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun.\nDoctor AI: Predicting clinical events via recurrent neural networks. In Proceedings of the 1st\nMachine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning\nResearch, pages 301\u2013318. PMLR, 18\u201319 Aug 2016. URL \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd.\n\nEarl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata McGraw-\n\nHill Education, 1955.\n\nLaurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\nNan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song.\nRecurrent marked temporal point processes: Embedding event history to vector. In International\nConference on Knowledge Discovery and Data Mining, pages 1555\u20131564. ACM, 2016.\n\nPatrick Farrell, David Ham, Simon Funke, and Marie Rognes. Automated derivation of the adjoint of\n\nhigh-level transient \ufb01nite element programs. SIAM Journal on Scienti\ufb01c Computing, 2013.\n\nMichael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and\nRuslan Salakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint,\n2017.\n\n10\n\n\fJ. Futoma, S. Hariharan, and K. Heller. Learning to Detect Sepsis with a Multitask Gaussian Process\n\nRNN Classi\ufb01er. ArXiv e-prints, 2017.\n\nAidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network:\nIn Advances in Neural Information Processing\n\nBackpropagation without storing activations.\nSystems, pages 2211\u20132221, 2017.\n\nAlex Graves. Adaptive computation time for recurrent neural networks.\n\narXiv:1603.08983, 2016.\n\narXiv preprint\n\nDavid Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.\n\nEldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34\n\n(1):014004, 2017.\n\nE. Hairer, S.P. N\u00f8rsett, and G. Wanner. Solving Ordinary Differential Equations I \u2013 Nonstiff Problems.\n\nSpringer, 1987.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016a.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European conference on computer vision, pages 630\u2013645. Springer, 2016b.\n\nGeoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture\n\n6a overview of mini-batch gradient descent, 2012.\n\nYacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in\n\nrecurrent neural networks. arXiv preprint arXiv:1611.06188, 2016.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference\n\non Learning Representations, 2014.\n\nDiederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural Information\nProcessing Systems, pages 4743\u20134751, 2016.\n\nW. Kutta. Beitrag zur n\u00e4herungsweisen Integration totaler Differentialgleichungen. Zeitschrift f\u00fcr\n\nMathematik und Physik, 46:435\u2013453, 1901.\n\nYann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-propagation.\nIn Proceedings of the 1988 connectionist models summer school, volume 1, pages 21\u201328. CMU,\nPittsburgh, Pa: Morgan Kaufmann, 1988.\n\nYann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nYang Li. Time-dependent representation for neural event sequence prediction. arXiv preprint\n\narXiv:1708.00065, 2017.\n\nZachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with\nRNNs: Improved classi\ufb01cation of clinical time series. In Proceedings of the 1st Machine Learning\nfor Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 253\u2013\n270. PMLR, 18\u201319 Aug 2016. URL \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd.\n\nZ. Long, Y. Lu, X. Ma, and B. Dong. PDE-Net: Learning PDEs from Data. ArXiv e-prints, 2017.\n\nYiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond \ufb01nite layer neural networks:\nBridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121,\n2017.\n\n11\n\n\fDougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Reverse-mode differentiation of\n\nnative Python. In ICML workshop on Automatic Machine Learning, 2015.\n\nHongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally self-modulating\nmultivariate point process. In Advances in Neural Information Processing Systems, pages 6757\u2013\n6767, 2017.\n\nValdemar Melicher, Tom Haber, and Wim Vanroose. Fast derivatives of likelihood functionals for\nODE based models using adjoint-state method. Computational Statistics, 32(4):1621\u20131643, 2017.\n\nConny Palm. Intensit\u00e4tsschwankungen im fernsprechverker. Ericsson Technics, 1943.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\nBarak A Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE\n\nTransactions on Neural networks, 6(5):1212\u20131228, 1995.\n\nLev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathemat-\n\nical theory of optimal processes. 1962.\n\nM. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial\n\ndifferential equations. Journal of Computational Physics, pages 125\u2013141, 2018.\n\nMaziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-\n\ndriven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018a.\n\nMaziar Raissi, Paris Perdikaris, and George Em Karniadakis. Numerical Gaussian processes for\ntime-dependent and nonlinear partial differential equations. SIAM Journal on Scienti\ufb01c Computing,\n40(1):A172\u2013A198, 2018b.\n\nDanilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the 31st International Conference on\nMachine Learning, pages 1278\u20131286, 2014.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\nC. Runge. \u00dcber die numerische Au\ufb02\u00f6sung von Differentialgleichungen. Mathematische Annalen, 46:\n\n167\u2013178, 1895.\n\nLars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.\n\narXiv preprint arXiv:1804.04272, 2018.\n\nT. Ryder, A. Golightly, A. S. McGough, and D. Prangle. Black-box Variational Inference for\n\nStochastic Differential Equations. ArXiv e-prints, 2018.\n\nMichael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with Runge-Kutta\n\nmeans. In Advances in Neural Information Processing Systems 25, 2014.\n\nPeter Schulam and Suchi Saria. What-if reasoning with counterfactual Gaussian processes. arXiv\n\npreprint arXiv:1703.10651, 2017.\n\nHossein Soleimani, James Hensman, and Suchi Saria. Scalable joint models for reliable uncertainty-\naware event prediction. IEEE transactions on pattern analysis and machine intelligence, 2017a.\n\nHossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatment-response models for coun-\nterfactual reasoning with continuous-time, continuous-valued interventions. arXiv preprint\narXiv:1704.02038, 2017b.\n\nJos Stam. Stable \ufb02uids. In Proceedings of the 26th annual conference on Computer graphics and\n\ninteractive techniques, pages 121\u2013128. ACM Press/Addison-Wesley Publishing Co., 1999.\n\n12\n\n\fPaul Stapor, Fabian Froehlich, and Jan Hasenauer. Optimization and uncertainty analysis of ODE\n\nmodels using second order adjoint sensitivity analysis. bioRxiv, page 272005, 2018.\n\nJakub M Tomczak and Max Welling. Improving variational auto-encoders using Householder \ufb02ow.\n\narXiv preprint arXiv:1611.09630, 2016.\n\nSteffen Wiewel, Moritz Becher, and Nils Thuerey. Latent-space physics: Towards learning the\n\ntemporal evolution of \ufb02uid \ufb02ow. arXiv preprint arXiv:1802.10123, 2018.\n\nHong Zhang and Adrian Sandu. Fatode: a library for forward, adjoint, and tangent linear integration\n\nof ODEs. SIAM Journal on Scienti\ufb01c Computing, 36(5):C504\u2013C523, 2014.\n\n13\n\n\f", "award": [], "sourceid": 3310, "authors": [{"given_name": "Ricky T. Q.", "family_name": "Chen", "institution": "University of Toronto"}, {"given_name": "Yulia", "family_name": "Rubanova", "institution": "University of Toronto"}, {"given_name": "Jesse", "family_name": "Bettencourt", "institution": "University of Toronto"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}]}