NeurIPS 2020

Rational neural networks

Review 1

Summary and Contributions: Summary: this article considers approximation and initialization of neural networks with rational activation functions. It also provides some numerical evidence that such networks can give reasonable performance on some tasks.

Strengths: (1) I think approximation theory using neural networks is an interesting subject and the quantitative results for approximation with rational networks in Section 3 are certainly of interest to the approximation theory community. (2) The suggestion to initialize near a ReLU-approximant and then allowing the parameters of the rational function to be learned seems reasonable. This may have find some practical applications, though I think it is somewhat unlikely.

Weaknesses: (1) In general, I think articles that study the pure approximation power of neural networks, without regard for what can actually be learned from a random initialization can run the risk of being irrelevant to neural network community as a whole. That being said, as I mentioned above, I think there is enough here that, although this is indeed a weakness, it is not a fatal one. (2) The improvement for log(1/\eps) to log(log(1/\eps)) in Theorem 4 (for ReLUL compared with rational) doesn't strike me as particularly interesting given the \eps^{-d/n} in front is unchanged. (3) I am not entirely convinced by the numerical experiments. For example, in Figure 2, on the right (if I am understanding correctly is plotted the training loss). Lower training loss doesn't necessarily mean better test accuracy/loss, so it's hard to say that the rational networks are better from such a plot.

Correctness: Although I didn't check every detail, the claim appear to the correct.

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: POST REBUTTAL UPDATE: I take the authors' point about the near-optimality of \eps^{-d/n} from the DeVore et. al. paper of 1989. Note however, that in that work there is a continuity assumption on the parameter section and function reconstruction maps. Is you construction obviously continuous in the function? In any case, SGD-based parameter selection need not be continuous. Indeed, some work by Yarotsky shows that if we allow for discontinuous parameter selection then NNs can do way better than the \eps^{-d/n} error rate. I also understand the authors' point about the depth decreasing to \log(\log(1/\eps)). That indeed a good point. Although I would not say it is obviously of practical interest since no one says that you can learn this shallow representation in any numerically stable way. Finally, I am glad that the authors will put in a plot of validation accuracy. That is certainly helpful. My overall assessment remains positive (6/10).

Review 2

Summary and Contributions: This paper presents a neural network with a newly proposed activation function, avoiding the vanishing or improve the performance of deep learning models. This can be achieved by applying the rational function which has a good approximation capacity.

Strengths: 1. The solution for the raised problem is novel, i.e., rational function as activation. 2, strong theoretical study of the rational neural network 3, promising experimental results

Weaknesses: 1, Solution configuration is not well justified, e.g., the order of (3, 2) 2, Evaluation section only use GAN as a general deep learning model, could try more general models, and PDE is a specific problem. 3, Lack of comparison with popular polynomial approximation.

Correctness: yes; yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: This work proposes a new activation function to sever deep learning architecture, providing a theoretical study about its complexity. This paper is well-written and provides a high-level of readability to most readers of the data mining community. However, the article would be significantly enhanced if the issues related to their motivation, technical analysis, and experiments are addressed. Detailed comments are given in the following: 1) Motivation – This paper proposes rational activation function as an alternative to ReLU, potentially avoiding the issue of vanishing gradient problem * The problem raised in this paper, i.e., some existing activation functions (e.g., sigmoid, logistic) can only handle the smooth signal, is a significant problem in deep neural network optimization since their derivative are zero for large value. * The most popular activation, ReLU, has zero gradients for negative real values, and its performance lacks theoretical support. 2) Technical analysis – The paper proposes rational neural networks with a theoretical study about its approximation capacity and complexity. * The order of rational function is (3,2), but the justification is not provided. Low-degree can save time, but is there any better configuration and why choose such type? * There is another issue for (3,2) type since it can be reduced to a (2,2) plus a constant. That’s why rational function only has a larger denominator order; the reason why choosing a larger numerator order is not offered. * The result of Figure 1 shows the advantage of rational NN in approximating ReLU with little oscillation. However, the polynomial approximation is more popular techniques with less complexity and lower accuracy. The author may need to compare a rational neural network with polynomials. 3) Experiment – The proposed framework on PDE problem and GAN: * The result on PDE shows promising results, significantly faster than the baselines. GAN test also shows its superiority over ReLU. * However, rational neural networks are still a black box, and the authors may need to perform ablation or sensitivity test regarding the order. * Since the paper acclaims that it can help to solve the vanishing gradient issue, an additional experiment may be needed.

Review 3

Summary and Contributions: The paper studies neural networks equipped with rational activation functions. The authors begin with a good motivation on the importance of the activation functions in networks and list drawbacks of widely used activation functions, e.g., ReLU, tanh, sigmoid. Then rational activation functions of type (3, 2) is considered throughout the paper. Theoretical approximation theory illustrates rational functions can efficiently approximate ReLU functions, as well as nonparametric functions, e.g., Sobolev functions. Besides, empirical results on using rational networks for solving PDE and generating fake images of MNIST dataset demonstrates the improved performance of rational functions.

Strengths: Activation functions tie closely to the training and testing performance of neural networks. As the authors pointed out, smooth activation functions, e.g., sigmoid, tanh, can lead to gradient vanishing issue, while ReLU activation is only active for nonnegative inputs. The rational activation functions can potentially be a good alternative in some applications. From the theoretical results, the authors show that rational networks can more efficiently approximate smooth functions than ReLU networks. Even more importantly, the rational activation function used only has a degree smaller than 3. The empirical results also showcase the practicability of rational networks.

Weaknesses: The approximation theories of rational networks rely heavily on two works, Telgarsky and Yarotsky. In particular, the equivalence relation between ReLU activation and rational activation is based on the framework of Telgarsky, with an introduction of Zolotarev functions in replacement of the original Newman polynomials. This allows a tighter bound compared to Telgarsky. The universal approximation result is based on Yarotsky. The difference is Yarotsky uses ReLU network to approximate Taylor polynomials, while in this paper, rational networks are used. The technical contributions should be elaborated. For example, what is the core new steps in replacing Newman polynomials with Zolotarev functions? In addition, the improvement on the network size of using rational networks to approximate sobolev functions is marginal. In particular, the number of parameters in the network is reduced from \epsilon^{-d/n} \log 1/\epsilon to \epsilon^{-d/n} \log \log 1/\epsilon. The improvement of the dependence on \epsilon is highly likely only some constant multiplication. More importantly, the authors does not compare the constants hidden in the big O notation. The experiments also have some flaws in implementation, and are incomplete to demonstrate the strength of rational networks. 1. In solving PDE part, Line 244 reports the MSE of three different neural networks. This is presumably measured on the training set, rather on a testing set. Although this illustrates rational networks can potentially have better fitting ability (to confirm this, one should train three networks until stable, i.e., MSE does not visibly decrease), it is still a hasty conclusion that rational networks can perform better than ReLU and sinusoid networks (overfitting). 2. From the supplementary, the networks tested have the same architecture, and the only difference is the activation function. However, the total number of trainable parameters in these networks are not the same: rational activation functions bring more trainable weight parameters. One would suspect that the improved performance may due to the increased number of trainable parameters. 3. The experiment on GANs does not provide any quantitative results, therefore, the results does not add clarity to the performance of rational networks.

Correctness: The claims and method in the paper seems sound and correct, albeit not all the details are checked.

Clarity: The overall structure and flow of the paper is easy to follow. However, some mathematical claims are not rigorous. 1. Line 88: the big O notation hides a constant depending on the size and depth of the ReLU network --- the size of a network is a vague term and it is used across the theoretical claims. It is better to phrase it as the total number of trainable parameters or neurons in the network. 2. Zolotarev function is the key to show the claims in the paper, however, it is never defined and compared with Newman polynomials.

Relation to Prior Work: The contributions seem to be marginal, see the weakness section.

Reproducibility: Yes

Additional Feedback: -------------- Post Rebuttal --------------------- My main concerns of the paper are addressed in the response: 1) The authors provide experimental results on testing (validation), and demonstrate the good performance of using the rational activation compared with other common activation functions; 2) The relation with Telgarsky's work is highlighted in the response. I agree that the technique is different (this paper considers the compositional structure). Accordingly, I raise my rating to marginally above the threshold. One weakness of the paper still stands in my opinion. The improvement from \log 1/\epsilon to \log \log 1/\epsilon is rather marginal. Such an improvement does indicate the advantage of rational networks, and show rational networks can have smaller depth, though.

Review 4

Summary and Contributions: The paper investigates neural networks whose activation functions are (trainable) rational functions of their input. The paper shows theoretical results about the approximation power of these rational networks. In particular it shows that they improve on standard ReLU networks. The authors also perform experiments with rational activation functions of the type (3,2), i.e. polynomials of degree 3 at the numerator and polynomial of degree 2 at the denominator. The authors are interested in the future applicability of rational neural networks to partial differential equations (PDEs). Hence they study the KdV equation in their experiments.

Strengths: In the literature, most of the works that introduce “exotic” activation functions are motivated by empirical results, not supported by theoretical statements. In contrast with these works, the present paper proves mathematical results supporting the potentially improved approximation power of their fractional activation functions over ReLU.

Weaknesses: The authors note that the digits 1 generated by their rational network are all identical, while standard ReLU networks do not suffer from this problem. They point out that GANs are known to be hard to train and suggest that the rational GAN likely suffers from mode collapse. I appreciate the transparency of the authors, but I would have expected to see more compelling results on other tasks as well. I find the theoretical result of this work interesting, but from the experiments it is not clear yet that this result could have practical implications.

Correctness: The claims of this work are well grounded. The authors show in particular that: 1/ all ReLU networks can be epsilon-approximated by a rational network of size O(log(log(1/epsilon))) 2/ there exists a rational network that cannot be epsilon-approximated by a ReLU network of size less than O(log(1/epsilon)). These results suggest that rational networks could be more versatile than ReLU networks.

Clarity: The paper is well written overall.

Relation to Prior Work: The relation to prior work is adequately addressed. The authors explain that theoretical work by Telgarsky motivates the use of rational functions in deep learning. Molina et al. have experimented with rational activation functions (they defined the Padé Activation Unit), while Chen et al. propose to use high-degree rational activation functions in neural networks. In this work, the authors use low-degree rational functions as activation functions. Their composition in a deep network thus builds high-degree rational functions.

Reproducibility: Yes

Additional Feedback: The authors mention in several places of the paper “size” and “depth”. What is the definition of "size" here? (I would expect that size depends on depth) Is the architecture of the network assumed to be a regular MLP kind of neural net? (Can there be skip-layer connections for example?) ==== Post Rebuttal ==== Thanks to the authors for the nice rebuttal.