NeurIPS 2020

Rational neural networks

Meta Review

The paper studies rational DNNs --- deep neural networks where rational functions (of small degrees) are used as non-linearities. The paper provides many interesting theoretical results on the approximation properties of the rational DNNs (specifically, in comparison to ReLU DNNs). The paper also provides two experiments (learning the solution of the 2-dimensional PDE and applications in generative adversarial networks), which are meant to demonstrate that rational activations have advantages compared to other popular activations (ReLu, sine, tanh, polynomial, etc) when used in actual DNN training. The theory presented in the paper establishes that: (1) Consider two problems: (i) Approximating (in the uniform norm) a function implemented with the rational DNNs using ReLU DNNs; and (ii) approximating a function implemented with the ReLU DNNs using rational DNNs. Theorem 3 shows that (ii) is much easier than (i): (ii) can be solved to eps-precision with log(log(1 / eps)) many parameters, whereas (i) requires at least log(1 / eps ) parameters (exponentially more). (2) Theorem 4 shows that any function in a specific Sobolev space can be eps-approximated with the rational DNNs using log(log(1/eps)) eps^{-d/n} parameters, where the inputs are d-dimensional and n controls the smoothness of the Sobolev space. Meanwhile, it has been previously known that in the case of ReLU networks same problem can be solved with log(1/eps) eps^{-d/n} parameters. These results are interesting for the approximation community. Their proofs are novel and use an interesting observation that the composition of rational functions of small degrees can lead to the rational function of a very large (exponential in the number of layers) degree while having relatively small (linear in the number of layers) number of parameters. My major concern (shared by most of the reviewers) is that these approximation results are not very relevant to the actual DNN training. The fact that there exists a parameter configuration that approximates a given function with the rational DNN does not mean that the same parameter configuration can be efficiently learned using SGD (or any other practical method). The only results supporting a relevance of the presented approximation theory are two experiments. The first one is essentially an MSE regression with two-dimensional inputs (Section 4.1, Figure 2) and 10k points in the training set. In the second experiments the authors train DCGAN-style generator-discriminator pair on MNIST with ReLU/LeakyReLU activations (used in the original DCGAN paper) replaced with the proposed rational activations. The results of the first experiment are convincing: indeed, the test MSE of the rational DNN decreases considerably faster than for any other considered activation function. Unfortunately, I am not sure if the success in the 2-dimensional MSE regression with 10k points in the training set (which is *a lot*) will transfer to more practical settings. The authors could at least train the rational DNNs on MNIST and compare this to the vanilla ReLU DNN training. I wonder if the authors tried this natural experiment. Results of the second GAN experiment are difficult to interpret: the authors did not provide any quantitative metrics (eg. Frechet Inception Distance, etc) and instead base their evaluation on the visual inspection. In the field of unsupervised generative modeling it has been established a long time ago (at least since [1]) that any results on GANs necessarily need to be supplemented with at least some quantitative evaluation metrics. Otherwise, the visual evaluation can lead to any desired conclusions. In summary, if the results on the pure approximation theory of rational DNNs are interesting enough for the NeurIPS community, I think the paper provides novel/strong/useful contributions. Otherwise, the relevance of the results to the DNN training is not clear at this point. [1] Are GANs Created Equal? A Large-Scale Study, 2017.