{"title": "Augmented Neural ODEs", "book": "Advances in Neural Information Processing Systems", "page_first": 3140, "page_last": 3150, "abstract": "We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.", "full_text": "Augmented Neural ODEs\n\nEmilien Dupont\n\nUniversity of Oxford\n\ndupont@stats.ox.ac.uk\n\nArnaud Doucet\n\nUniversity of Oxford\n\ndoucet@stats.ox.ac.uk\n\nYee Whye Teh\n\nUniversity of Oxford\n\ny.w.teh@stats.ox.ac.uk\n\nAbstract\n\nWe show that Neural Ordinary Differential Equations (ODEs) learn representa-\ntions that preserve the topology of the input space and prove that this implies the\nexistence of functions Neural ODEs cannot represent. To address these limita-\ntions, we introduce Augmented Neural ODEs which, in addition to being more\nexpressive models, are empirically more stable, generalize better and have a lower\ncomputational cost than Neural ODEs.\n\n1\n\nIntroduction\n\nThe relationship between neural networks and differential equations has been studied in several recent\nworks (Weinan, 2017; Lu et al., 2017; Haber & Ruthotto, 2017; Ruthotto & Haber, 2018; Chen et al.,\n2018). In particular, it has been shown that Residual Networks (He et al., 2016) can be interpreted as\ndiscretized ODEs. Taking the discretization step to zero gives rise to a family of models called Neural\nODEs (Chen et al., 2018). These models can be ef\ufb01ciently trained with backpropagation and have\nshown great promise on a number of tasks including modeling continuous time data and building\nnormalizing \ufb02ows with low computational cost (Chen et al., 2018; Grathwohl et al., 2018).\nIn this work, we explore some of the consequences of taking this continuous limit and the restrictions\nthis might create compared with regular neural nets. In particular, we show that there are simple\nclasses of functions Neural ODEs (NODEs) cannot represent. While it is often possible for NODEs\nto approximate these functions in practice, the resulting \ufb02ows are complex and lead to ODE problems\nthat are computationally expensive to solve. To overcome these limitations, we introduce Augmented\nNeural ODEs (ANODEs) which are a simple extension of NODEs. ANODEs augment the space\non which the ODE is solved, allowing the model to use the additional dimensions to learn more\ncomplex functions using simpler \ufb02ows (see Fig. 1). In addition to being more expressive models,\nANODEs signi\ufb01cantly reduce the computational cost of both forward and backward passes of the\nmodel compared with NODEs. Our experiments also show that ANODEs generalize better, achieve\nlower losses with fewer parameters and are more stable to train.\n\nFigure 1: Learned \ufb02ows for a Neural ODE and an Augmented Neural ODE. The \ufb02ows (shown as lines\nwith arrows) map input points to linearly separable features for binary classi\ufb01cation. Augmented\nNeural ODEs learn simpler \ufb02ows that are easier for the ODE solver to compute.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nNeural ODEAugmented Neural ODE\f2 Neural ODEs\n\nNODEs are a family of deep neural network models that can be interpreted as a continuous equivalent\nof Residual Networks (ResNets). To see this, consider the transformation of a hidden state from a\nlayer t to t + 1 in ResNets\n\nwhere ht \u2208 Rd is the hidden state at layer t and ft : Rd \u2192 Rd is some differentiable function which\npreserves the dimension of ht (typically a CNN). The difference ht+1 \u2212 ht can be interpreted as a\ndiscretization of the derivative h(cid:48)(t) with timestep \u2206t = 1. Letting \u2206t \u2192 0, we see that\n\nht+1 = ht + ft(ht)\n\nht+\u2206t \u2212 ht\n\n\u2206t\n\n=\n\ndh(t)\n\ndt\n\nlim\n\u2206t\u21920\n\n= f (h(t), t)\n\nso the hidden state can be parameterized by an ODE. We can then map a data point x into a set of\nfeatures \u03c6(x) by solving the Initial Value Problem (IVP)\n\ndh(t)\n\ndt\n\n= f (h(t), t),\n\nh(0) = x\n\nto some time T . The hidden state at time T , i.e. h(T ), corresponds to the features learned by the\nmodel. The analogy with ResNets can then be made more explicit. In ResNets, we map an input x to\nsome output y by a forward pass of the neural network. We then adjust the weights of the network to\nmatch y with some ytrue. In NODEs, we map an input x to an output y by solving an ODE starting\nfrom x. We then adjust the dynamics of the system (encoded by f) such that the ODE transforms x\nto a y which is close to ytrue.\nODE \ufb02ows. We also de\ufb01ne the \ufb02ow associated to the vector \ufb01eld\nf (h(t), t) of the ODE. The \ufb02ow \u03c6t : Rd \u2192 Rd is de\ufb01ned as the\nhidden state at time t, i.e. \u03c6t(x) = h(t), when solving the ODE from\nthe initial condition h(0) = x. The \ufb02ow measures how the states\nof the ODE at a given time t depend on the initial conditions x. We\nde\ufb01ne the features of the ODE as \u03c6(x) := \u03c6T (x), i.e. the \ufb02ow at the\n\ufb01nal time T to which we solve the ODE.\nNODEs for regression and classi\ufb01cation. We can use ODEs to map\ninput data x \u2208 Rd to a set of features or representations \u03c6(x) \u2208 Rd.\nHowever, we are often interested in learning functions from Rd to R,\ne.g. for regression or classi\ufb01cation. To de\ufb01ne a model from Rd to R, we follow the example given\nin Lin & Jegelka (2018) for ResNets. We de\ufb01ne the NODE g : Rd \u2192 R as g(x) = L(\u03c6(x)) where\nL : Rd \u2192 R is a linear map and \u03c6 : Rd \u2192 Rd is the mapping from data to features. As shown in Fig.\n2, this is a simple model architecture: an ODE layer, followed by a linear layer.\n\nFigure 2: Diagram of Neural\nODE architecture.\n\n3 A simple example in 1d\n\nIn this section, we introduce a simple function that ODE \ufb02ows cannot represent, motivating many of\nthe examples discussed later. Let g1d : R \u2192 R be a function such that g1d(\u22121) = 1 and g1d(1) = \u22121.\nProposition 1. The \ufb02ow of an ODE cannot represent g1d(x).\n\nA detailed proof is given in the appendix. The intuition behind the proof is simple; the trajectories\nmapping \u22121 to 1 and 1 to \u22121 must intersect each other (see Fig. 3). However, ODE trajectories\ncannot cross each other, so the \ufb02ow of an ODE cannot represent g1d(x). This simple observation is at\nthe core of all the examples provided in this paper and forms the basis for many of the limitations of\nNODEs.\nExperiments. We verify this behavior experimentally by training an ODE \ufb02ow on the identity\nmapping and on g1d(x). The resulting \ufb02ows are shown in Fig. 3. As can be seen, the model easily\nlearns the identity mapping but cannot represent g1d(x). Indeed, since the trajectories cannot cross,\nthe model maps all input points to zero to minimize the mean squared error.\nResNets vs NODEs. NODEs can be interpreted as continuous equivalents of ResNets, so it is\ninteresting to consider why ResNets can represent g1d(x) but NODEs cannot. The reason for this is\n\n2\n\nODE\fFigure 3: (Top left) Continuous trajectories mapping \u22121 to 1 (red) and 1 to \u22121 (blue) must intersect\neach other, which is not possible for an ODE. (Top right) Solutions of the ODE are shown in solid\nlines and solutions using the Euler method (which corresponds to ResNets) are shown in dashed lines.\nAs can be seen, the discretization error allows the trajectories to cross. (Bottom) Resulting vector\n\ufb01elds and trajectories from training on the identity function (left) and g1d(x) (right).\n\nexactly because ResNets are a discretization of the ODE, allowing the trajectories to make discrete\njumps to cross each other (see Fig. 3). Indeed, the error arising when taking discrete steps allows the\nResNet trajectories to cross. In this sense, ResNets can be interpreted as ODE solutions with large\nerrors, with these errors allowing them to represent more functions.\n\n4 Functions Neural ODEs cannot represent\n\nWe now introduce classes of functions in arbitrary dimension d which NODEs\ncannot represent. Let 0 < r1 < r2 < r3 and let g : Rd \u2192 R be a function\nsuch that\n\n(cid:26)g(x) = \u22121\n\ng(x) = 1\n\nif (cid:107)x(cid:107) \u2264 r1\nif r2 \u2264 (cid:107)x(cid:107) \u2264 r3,\n\nwhere (cid:107) \u00b7 (cid:107) is the Euclidean norm. An illustration of this function for d = 2 is\nshown in Fig. 4. The function maps all points inside the blue sphere to \u22121 and\nall points in the red annulus to 1.\nProposition 2. Neural ODEs cannot represent g(x).\n\nFigure 4: Diagram\nof g(x) for d = 2.\n\nA proof is given in the appendix. While the proof requires tools from ODE theory and topology, the\nintuition behind it is simple. In order for the linear layer to map the blue and red points to \u22121 and\n1 respectively, the features \u03c6(x) for the blue and red points must be linearly separable. Since the\nblue region is enclosed by the red region, points in the blue region must cross over the red region to\nbecome linearly separable, requiring the trajectories to intersect, which is not possible. In fact, we\ncan make more general statements about which features Neural ODEs can learn.\nProposition 3. The feature mapping \u03c6(x) is a homeomorphism, so the features of Neural ODEs\npreserve the topology of the input space.\n\nA proof is given in the appendix. This statement is a consequence of the \ufb02ow of an ODE being a\nhomeomorphism, i.e. a continuous bijection whose inverse is also continuous; see, e.g., (Younes,\n2010). This implies that NODEs can only continuously deform the input space and cannot for\nexample tear a connected region apart.\nDiscrete points and continuous regions. It is worthwhile to consider what these results mean in\npractice. Indeed, when optimizing NODEs we train on inputs which are sampled from the continuous\nregions of the annulus and the sphere (see Fig. 4). The \ufb02ow could then squeeze through the gaps\n\n3\n\nr1r2r3\f(a) g(x) in d = 1\n\n(b) g(x) in d = 2\n\n(c) Separable function in d = 2\n\nFigure 5: Comparison of training losses of NODEs and ResNets. Compared to ResNets, NODEs\nstruggle to \ufb01t g(x) both in d = 1 and d = 2. The difference between ResNets and NODEs is less\npronounced for the separable function.\n\nbetween sampled points making it possible for the NODE to learn a good approximation of the\nfunction. However, \ufb02ows that need to stretch and squeeze the input space in such a way are likely\nto lead to ill-posed ODE problems that are numerically expensive to solve. In order to explore this,\nwe run a number of experiments (the code to reproduce all experiments in this paper is available at\nhttps://github.com/EmilienDupont/augmented-neural-odes).\n\n4.1 Experiments\n\nWe \ufb01rst compare the performance of ResNets and NODEs on simple regression tasks. To provide a\nbaseline, we not only train on g(x) but also on data which can be made linearly separable without\naltering the topology of the space (implying that Neural ODEs should be able to easily learn this\nfunction). To ensure a fair comparison, we run large hyperparameter searches for each model and\nrepeat each experiment 20 times to ensure results are meaningful across initializations (see appendix\nfor details). We show results for experiments with d = 1 and d = 2 in Fig. 5. For d = 1, the ResNet\neasily \ufb01ts the function, while the NODE cannot approximate g(x). For d = 2, the NODE eventually\nlearns to approximate g(x), but struggles compared to ResNets. This problem is less severe for the\nseparable function, presumably because the \ufb02ow does not need to break apart any regions to linearly\nseparate them.\n\n4.2 Computational Cost and Number of Function Evaluations\n\nOne of the known limitations of NODEs is that, as training progresses and the \ufb02ow gets increasingly\ncomplex, the number of steps required to solve the ODE increases (Chen et al., 2018; Grathwohl\net al., 2018). As the ODE solver evaluates the function f at each step, this problem is often referred\nto as the increasing number of function evaluations (NFE). In Fig. 6, we visualize the evolution of\nthe feature space during training and the corresponding NFEs. The NODE initially tries to move the\ninner sphere out of the annulus by pushing against and stretching the barrier. Eventually, since we are\nmapping discrete points and not a continuous region, the \ufb02ow is able to break apart the annulus to let\nthe \ufb02ow through. However, this results in a large increase in NFEs, implying that the ODE stretching\nthe space to separate the two regions becomes more dif\ufb01cult to solve, making the computation slower.\n\nFigure 6: Evolution of the feature space as training progresses and the corresponding number of\nfunction evaluations required to solve the ODE. As the ODE needs to break apart the annulus, the\nnumber of function evaluations increases.\n\n4\n\n\f5 Augmented Neural ODEs\n\nMotivated by our theory and experiments, we introduce Augmented Neural ODEs (ANODEs) which\nprovide a simple solution to the problems we have discussed. We augment the space on which we\nlearn and solve the ODE from Rd to Rd+p, allowing the ODE \ufb02ow to lift points into the additional\ndimensions to avoid trajectories intersecting each other. Letting a(t) \u2208 Rp denote a point in the\naugmented part of the space, we can formulate the augmented ODE problem as\n\n(cid:20)h(t)\n(cid:21)\n\na(t)\n\n= f (\n\n, t),\n\n(cid:20)h(t)\n(cid:21)\n\na(t)\n\nd\ndt\n\n(cid:20)h(0)\n(cid:21)\n\na(0)\n\n=\n\n(cid:20)x\n(cid:21)\n\n0\n\ni.e. we concatenate every data point x with a vector of zeros and solve the ODE on this augmented\nspace. We hypothesize that this will also make the learned (augmented) f smoother, giving rise to\nsimpler \ufb02ows that the ODE solver can compute in fewer steps. In the following sections, we verify\nthis behavior experimentally and show both on toy and image datasets that ANODEs achieve lower\nlosses, better generalization and lower computational cost than regular NODEs.\n\n5.1 Experiments\n\nWe \ufb01rst compare the performance of NODEs and ANODEs on toy datasets. As in previous experi-\nments, we run large hyperparameter searches to ensure a fair comparison. As can be seen on Fig. 7,\nwhen trained on g(x) in different dimensions, ANODEs are able to \ufb01t the functions NODEs cannot\nand learn much faster than NODEs despite the increased dimension of the input. The corresponding\n\ufb02ows learned by the model are shown in Fig. 7. As can be seen, in d = 1, the ANODE moves into a\nhigher dimension to linearly separate the points, resulting in a simple, nearly linear \ufb02ow. Similarly,\nin d = 2, the NODE learns a complicated \ufb02ow whereas ANODEs simply lift out the inner circle to\nseparate the data. This effect can also be visualized as the features evolve during training (see Fig. 8).\nComputational cost and number of function evaluations. As ANODEs learn simpler \ufb02ows, they\nwould presumably require fewer iterations to compute. To test this, we measure the NFEs for NODEs\nand ANODEs when training on g(x). As can be seen in Fig. 8, the NFEs required by ANODEs\nhardly increases during training while it nearly doubles for NODEs. We obtain similar results when\ntraining NODEs and ANODEs on image datasets (see Section 5.2).\nGeneralization. As ANODEs learn simpler \ufb02ows, we also hypothesize that they generalize better\nto unseen data than NODEs. To test this, we \ufb01rst visualize to which value each point in the input\nspace gets mapped by a NODE and an ANODE that have been optimized to approximately zero\ntraining loss. As can be seen in Fig. 9, since NODEs can only continuously deform the input space,\nthe learned \ufb02ow must squeeze the points in the inner circle through the annulus, leading to poor\n\nFigure 7: (Left) Loss plots for NODEs and ANODEs trained on g(x) in d = 1 (top) and d = 2\n(bottom). ANODEs easily approximate the functions and are consistently faster than NODEs. (Right)\nFlows learned by NODEs and ANODEs. ANODEs learn simple nearly linear \ufb02ows while NODEs\nlearn complex \ufb02ows that are dif\ufb01cult for the ODE solver to compute.\n\n5\n\nInputsFlowFeaturesANODE 1DNODE 2DANODE 2D\fFigure 8: (Left) Evolution of features during training for ANODEs. The top left tile shows the feature\nspace for a randomly initialized ANODE and the bottom right tile shows the features after training.\n(Right) Evolution of the NFEs during training for NODEs and ANODEs trained on g(x) in d = 1.\n\ngeneralization. ANODEs, in contrast, map all points in the input space to reasonable values. As\na further test, we can also create a validation set by removing random slices of the input space\n(e.g. removing all points whose angle is in [0, \u03c0\n5 ]) from the training set. We train both NODEs and\nANODEs on the training set and plot the evolution of the validation loss during training in Fig. 9.\nWhile there is a large generalization gap for NODEs, presumably because the \ufb02ow moves through\nthe gaps in the training set, ANODEs generalize much better and achieve near zero validation loss.\nAs we have shown, experimentally we obtain lower losses, simpler \ufb02ows, better generalization and\nODEs requiring fewer NFEs to solve when using ANODEs. We now test this behavior on image data\nby training models on MNIST, CIFAR10, SVHN and 200 classes of 64 \u00d7 64 ImageNet.\n\n5.2\n\nImage Experiments\n\nWe perform experiments on image datasets using convolutional architectures for f (h(t), t). As the\ninput x is an image, the hidden state h(t) is now in Rc\u00d7h\u00d7w where c is the number of channels\nand h and w are the height and width respectively. In the case where h(t) \u2208 Rd we augmented the\nspace as h(t) \u2208 Rd+p. For images we augment the space as Rc\u00d7h\u00d7w \u2192 R(c+p)\u00d7h\u00d7w, i.e. we add p\nchannels of zeros to the input image. While there are other ways to augment the space, we found that\nincreasing the number of channels works well in practice and use this method for all experiments.\nFull training and architecture details can be found in the appendix.\nResults for models trained with and without augmentation are shown in Fig. 10. As can be seen,\nANODEs train faster and obtain lower losses at a smaller computational cost than NODEs. On\nMNIST for example, ANODEs with 10 augmented channels achieve the same loss in roughly 10\ntimes fewer iterations (for CIFAR10, ANODEs are roughly 5 times faster). Perhaps most interestingly,\nwe can plot the NFEs against the loss to understand roughly how complex a \ufb02ow (i.e. how many\nNFEs) are required to model a function that achieves a certain loss. For example, to compute a\nfunction which obtains a loss of 0.8 on CIFAR10, a NODE requires approximately 100 function\nevaluations whereas ANODEs only require 50. Similar observations can be made for other datasets,\nimplying that ANODEs can model equally rich functions at half the computational cost of NODEs.\n\nFigure 9: (Left) Plots of how NODEs and ANODEs map points in the input space to different outputs\n(both models achieve approximately the same zero training loss). As can be seen, the ANODE\ngeneralizes better. (Middle) Training and validation losses for NODE. (Right) Training and validation\nlosses for ANODE.\n\n6\n\nNeural ODEAugmented Neural ODE\fFigure 10: Training losses, NFEs and NFEs vs Loss for various augmented models on MNIST (top\nrow) and CIFAR10 (bottom row). Note that p indicates the size of the augmented dimension, so\np = 0 corresponds to a regular NODE model. Further results on SVHN and 64 \u00d7 64 ImageNet can\nbe found in the appendix.\n\nParameter ef\ufb01ciency. As we augment the dimension of the ODEs, we also increase the number of\nparameters of the models, so it may be that the improved performance of ANODEs is due to the\nhigher number of parameters. To test this, we train a NODE and an ANODE with the same number of\nparameters on MNIST (84k weights), CIFAR10 (172k weights), SVHN (172k weights) and 64 \u00d7 64\nImageNet (366k weights). We \ufb01nd that the augmented model achieves signi\ufb01cantly lower losses\nwith fewer NFEs than the NODE, suggesting that ANODEs use the parameters more ef\ufb01ciently than\nNODEs (see appendix for details and results). For all subsequent experiments, we use NODEs and\nANODEs with the same number of parameters.\nNFEs and weight decay. The increased computational cost during training is a known issue with\nNODEs and has previously been tackled by adding weight decay (Grathwohl et al., 2018). As\nANODEs also achieve lower computational cost, we test models with various combinations of weight\ndecay and augmentation (see appendix for detailed results). We \ufb01nd that ANODEs without weight\ndecay signi\ufb01cantly outperform NODEs with weight decay. However, using both weight decay and\naugmentation achieves the lowest NFEs at the cost of a slightly higher loss. Combining augmentation\nwith weight decay may therefore be a fruitful avenue for further scaling NODE models.\nAccuracy. Fig. 11 shows training accuracy against NFEs for ANODEs and NODEs on MNIST,\nCIFAR10 and SVHN. As expected, ANODEs achieve higher accuracy at a lower computational cost\nthan NODEs (similar results hold for ImageNet as shown in the appendix).\nGeneralization for images. As noted in Section 5.1, ANODEs generalize better than NODEs\non simple datasets, presumably because they learn simpler and smoother \ufb02ows. We also test this\nbehavior on image datasets by training models with and without augmentation on the training set and\ncalculating the loss and accuracy on the test set. As can be seen in Fig. 12 and Table 1, for MNIST,\nCIFAR10, and SVHN, ANODEs achieve lower test losses and higher test accuracies than NODEs,\nsuggesting that ANODEs also generalize better on image datasets.\n\nFigure 11: Accuracy vs NFEs for MNIST (left), CIFAR10 (middle) and SVHN (right).\n\n7\n\nLossNFEsNFEs vs Loss \fFigure 12: Test accuracy (left) and loss (right) for MNIST (top) and CIFAR10 (bottom).\n\nMNIST\nCIFAR10\nSVHN\n\nNODE\n\n96.4% \u00b1 0.5\n53.7% \u00b1 0.2\n81.0% \u00b1 0.6\n\nANODE\n\n98.2% \u00b1 0.1\n60.6% \u00b1 0.4\n83.5% \u00b1 0.5\n\nTable 1: Test accuracies and their standard deviation over 5 runs on various image datasets.\n\nStability. While experimenting with NODEs we found that the NFEs could often become pro-\nhibitively large (in excess of 1000, which roughly corresponds to a 1000-layer ResNet). For example,\nwhen over\ufb01tting a NODE on MNIST, the learned \ufb02ow can become so ill posed the ODE solver\nrequires timesteps that are smaller than machine precision resulting in under\ufb02ow. Further, this\ncomplex \ufb02ow often leads to unstable training resulting in exploding losses. As shown in Fig. 13,\naugmentation consistently leads to stable training and fewer NFEs, even when over\ufb01tting.\nScaling. To measure how well the models scale to larger datasets, we train NODEs and ANODEs on\n200 classes of 64 \u00d7 64 ImageNet. As can be seen in Fig. 13, ANODEs scale better, achieve lower\nlosses and train almost 10 times faster than NODEs.\n\n5.3 Relation to other models\n\nIn this section, we discuss how ANODEs compare to other relevant models.\nNeural ODEs. Different architectures exist for training NODEs on images. Chen et al. (2018)\nfor example, downsample MNIST twice with regular convolutions before applying a sequence of\nrepeated ODE \ufb02ows. These initial convolutions can be understood as implicitly augmenting the\nspace (since they increase the number of channels). While combining NODEs with convolutions\nalleviates the representational limitations of NODEs, it also results in most of the attractive properties\nof NODEs being lost (including invertibility, ability to query state at any timestep, cheap Jacobian\ncomputations in normalizing \ufb02ows and reduced number of parameters). In contrast, ANODEs\novercome the representational weaknesses of NODEs while maintaining all their attractive properties.\n\nFigure 13: Instabilities in the loss (left) and NFEs (right) when \ufb01tting NODEs to MNIST. In the latter\nstages of training, NODEs can become unstable and the loss and NFEs become erratic.\n\n8\n\n\fFigure 14: Accuracy (left) and loss (right) on 64 \u00d7 64 ImageNet for NODEs and ANODEs.\n\nResNets. Since ResNets can be interpreted as discretized equivalents of NODEs, it is interesting\nto consider how augmenting the space could affect the training of ResNets. Indeed, most ResNet\narchitectures (He et al., 2016; Xie et al., 2017; Zagoruyko & Komodakis, 2016) already employ a\nform of augmentation by performing convolutions with a large number of \ufb01lters before applying\nresidual blocks. This effectively corresponds to augmenting the space by the number of \ufb01lters minus\nthe number of channels in the original image. Further, Behrmann et al. (2018) and Ardizzone et al.\n(2018) also augment the input with zeros to build invertible ResNets and transformations. Our\n\ufb01ndings in the continuous case are consistent with theirs: augmenting the input with zeros improves\nperformance. However, an important consequence of using augmentation for NODEs is the reduced\ncomputational cost, which does not have an analogy in ResNets.\nNormalizing Flows. Similarly to NODEs, several models used for normalizing \ufb02ows, such as\nRealNVP (Dinh et al., 2016), MAF (Papamakarios et al., 2017) and Glow (Kingma & Dhariwal,\n2018) are homeomorphisms. The results presented in this paper may therefore also be relevant in this\ncontext. In particular, using augmentation for discrete normalizing \ufb02ows may improve performance\nand is an interesting avenue for future research.\n\n6 Scope and Future Work\n\nIn this section, we describe some limitations of ANODEs, outline potential ways they may be\novercome and list ideas for future work. First, while ANODEs are faster than NODEs, they are\nstill slower than ResNets. Second, augmentation changes the dimension of the input space which,\ndepending on the application, may not be desirable. Finally, the augmented dimension can be seen as\nan extra hyperparameter to tune. While the model is robust for a range of augmented dimensions, we\nobserved that for excessively large augmented dimensions (e.g. adding 100 channels to MNIST), the\nmodel tends to perform worse with higher losses and NFEs. We believe the ideas presented in this\npaper could create interesting avenues for future research, including:\nOvercoming the limitations of NODEs. In order to allow trajectories to travel across each other, we\naugmented the space on which the ODE is solved. However, there may be other ways to achieve this,\nsuch as learning an augmentation (as in ResNets) or adding noise (similarly to Wang et al. (2018)).\nAugmentation for Normalizing Flows. The NFEs typically becomes prohibitively large when\ntraining continuous normalizing \ufb02ow (CNF) models (Grathwohl et al., 2018). Adding augmentation\nto CNFs could likely mitigate this effect and we plan to explore this in future work.\nImproved understanding of augmentation. It would be useful to provide more theoretical analysis\nfor how and why augmentation improves the training of NODEs and to explore how this could guide\nour choice of architectures and optimizers for NODEs.\n\n7 Conclusion\n\nIn this paper, we highlighted and analysed some of the limitations of Neural ODEs. We proved that\nthere are classes of functions NODEs cannot represent and, in particular, that NODEs only learn\nfeatures that are homeomorphic to the input space. We showed through experiments that this lead to\nslower learning and complex \ufb02ows which are expensive to compute. To mitigate these issues, we\nproposed Augmented Neural ODEs which learn the \ufb02ow from input to features in an augmented space.\nOur experiments show that ANODEs can model more complex functions using simpler \ufb02ows while\nachieving lower losses, reducing computational cost, and improving stability and generalization.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Anthony Caterini, Daniel Paulin, Abraham Ng, Joost Van Amersfoort and\nHyunjik Kim for helpful discussions and feedback. Emilien gratefully acknowledges his PhD funding\nfrom Google DeepMind. Arnaud Doucet acknowledges support of the UK Defence Science and\nTechnology Laboratory (Dstl) and Engineering and Physical Research Council (EPSRC) under grant\nEP/R013616/1. This is part of the collaboration between US DOD, UK MOD and UK EPSRC under\nthe Multidisciplinary University Research Initiative. Yee Whye Teh\u2019s research leading to these results\nhas received funding from the European Research Council under the European Union\u2019s Seventh\nFramework Programme (FP7/2007-2013) ERC grant agreement no. 617071.\n\nReferences\nLynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W Pellegrini, Ralf S Klessen,\nLena Maier-Hein, Carsten Rother, and Ullrich K\u00f6the. Analyzing inverse problems with invertible\nneural networks. arXiv preprint arXiv:1808.04730, 2018.\n\nJens Behrmann, David Duvenaud, and J\u00f6rn-Henrik Jacobsen. Invertible residual networks. arXiv\n\npreprint arXiv:1811.00995, 2018.\n\nTian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential\n\nequations. In 32nd Conference on Neural Information Processing Systems, 2018.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv\n\npreprint arXiv:1605.08803, 2016.\n\nWill Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord:\narXiv preprint\n\nFree-form continuous dynamics for scalable reversible generative models.\narXiv:1810.01367, 2018.\n\nEldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34\n\n(1):014004, 2017.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npp. 770\u2013778, 2016.\n\nDurk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions. In\n\nAdvances in Neural Information Processing Systems, pp. 10215\u201310224, 2018.\n\nHongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator.\n\nIn 32nd Conference on Neural Information Processing Systems, 2018.\n\nYiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond \ufb01nite layer neural networks:\nBridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121,\n2017.\n\nGeorge Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive \ufb02ow for density\n\nestimation. In Advances in Neural Information Processing Systems, pp. 2338\u20132347, 2017.\n\nLars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.\n\narXiv preprint arXiv:1804.04272, 2018.\n\nBao Wang, Binjie Yuan, Zuoqiang Shi, and Stanley J Osher. Enresnet: Resnet ensemble via the\n\nFeynman-Kac formalism. arXiv preprint arXiv:1811.10745, 2018.\n\nE Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics\n\nand Statistics, 5(1):1\u201311, 2017.\n\nSaining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pp. 1492\u20131500, 2017.\n\n10\n\n\fLaurent Younes. Shapes and Diffeomorphisms, volume 171. Springer Science & Business Media,\n\n2010.\n\nSergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016.\n\n11\n\n\f", "award": [], "sourceid": 1771, "authors": [{"given_name": "Emilien", "family_name": "Dupont", "institution": "Oxford University"}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": "Oxford"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}]}