Review for NeurIPS paper: Stationary Activations for Uncertainty Calibration in Deep Learning

NeurIPS 2020

Stationary Activations for Uncertainty Calibration in Deep Learning

Review 1

Summary and Contributions: This paper introduces a method to derive activation functions from stationary Matern family kernels, allowing Bayesian deep nets to approximate uncertainties captured by a GP.

Strengths: The research topic looks interesting. To my knowledge, most recent efforts to connect NNs and GPs are spent to build links between kernels and NN extracted features (e.g. deep kernels, kernel feature expansions), it is hence nice to have another view on the activation functions.

Weaknesses: 1. While this paper describes a method to derive activation functions from kernels, it seems not covering much work on kernel feature expansion (e.g. random Fourier features). Given that the proposed approach is also based on the Fourier duality, it would be good to claim further differences between performing it at a feature level and performing it at an activation function level. 2. Although I appreciate this paper has included various tasks to show the goodness of approximation, some of the experiments can still be tuned to give better illustrations. For instance, a) to include methods based on kernel feature expansion as discussed above. b) both figure 1 and figure 2 shows the approximated distributions, it should be possible to perform quantitive comparisons on such toy datasets (e.g. compute divergence/discrepancy). It seems the approximation depends on the number of hidden units as well as the number of Monto-Carlo samples. c) For later out-of-distribution detection task, it would also be better if some results from the original GPs are included (maybe not on a huge image dataset considering computational costs).

Correctness: From a top-level check, the proposed method seems to work, while the experiments can be improved as discussed above. On the claim side, one issue is the "uncertainty calibration" claimed in the title and through the paper. Recently this term has a particular meaning in both classification and regression tasks regarding probability calibration. It will require a set of definitions and evaluation metrics to verify the level of calibration, which is not seen in this paper. I would be more comfortable if the author can rename it to "uncertainty quantification", as the calibration part is not really touched in this paper.

Clarity: Yes, readers with related background should get most contents without issues.

Relation to Prior Work: As discussed above, my only question is about the differences between approximate GPs using kernel features and using the proposed method.

Reproducibility: Yes

Additional Feedback: As shown by my score, I generally like the idea and method, it would be better if the authors can address my problems above regarding kernel extracted features and kernel extracted activation functions for GP approximation. ===========AFTER AUTHOR FEEDBACK===================== The author feedback provides some initial answers to my question. I still think it might require some further work to explain the differences experimentally before I can vote for an apparent accept. I would encourage the authors to includes a detailed discussion and experiments on related work regarding kernel approximations so we can see the pros and cons.

Review 2

Summary and Contributions: This paper presents a novel neural network activation function that resembles some properties of the Matern kernel popular in Gaussian processes (GPs).

Strengths: The paper is well written and easy to follow. The best thing about the paper is it variously discusses the connections of the proposed method to other techniques and disciplines which, unfortunately, is rare in many machine learning papers nowadays.

Weaknesses: As highlighted in the additional feedback sections below, * The motivation and impact are not clear. * The paper requires more experimental evaluations.

Correctness: Technically correct though empirical validations require further work.

Clarity: Well-written.

Relation to Prior Work: The provided relations are clear. I have pointed out many other in the additional feedback section below.

Reproducibility: Yes

Additional Feedback: Thanks for the supplementary materials! The overall motivation is not clear. It is true that Matern kernels are good at capturing sharp transitions, as shown in Fig 1. There are many other methods to achieve similar, if not better, results. For instance, we can learn kernels [1,2], use deep kernels [2], use spectral mixture kernels [3], use "neural-network kernels" [4], etc. Comparisons with [1]-[5] would provide further insights. Please also report MSE and AUC in addition to the accuracy. It is not clear why the paper discusses stationary kernels. Stationarity is indeed a limitation of classical kernel methods. Easy to use recent methods that capture nonstationarity provide better estimates (mean and variance), even in real-world examples [5]. It is not clear why the stationarity is highlighted as an advantageous property. It is not clear why MC dropout is used instead of, say, more stable SVI. It is true that MC dropout typically provides incorrect variance estimates for both in and out of distribution data and the accuracy is sensitive to the type of activation function. To the best of my knowledge, there is no clear explanation for why this happens. It is not clear why using the proposed activation function leads to better performance, especially for OOD. OOD performance has also not been adequately benchmarked as in, say, [6]. The metrics and the techniques provided in [6] can be used. Can the recently proposed activation functions such as SIREN [7] achieve similar results? Other than the robustness against OOD samples, how would the paper be more impactful and beneficial to the NeurIPS readership? [1] Black-box Quantiles for Kernel Learning, AISTATS’19 [2] Deep Kernel Learning, AISTATS’16 [3] Spectral Mixture Kernels for Multi-Output Gaussian Processes, NeurIPS’17 [4] Gaussian processes for Machine learning [5] Automorphing Kernels for Nonstationarity in Mapping Unstructured Environments, CoRL’18 [6] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles, NeurIPS’17 [7] Implicit Neural Representations with Periodic Activation Functions POST-REBUTTAL COMMENTS I do appreciate the idea of introducing Matern kernels as a new activation function for NNs. However, in my opinion, the contributions of the paper, in its current form, are not sufficient enough to accept the paper. Since the majority of text is simply explaining existing work, I believe that experiments should be strong enough, benchmarking against different techniques and using different datasets. This is specifically important because the authors claim that the proposed activation is superior that existing NN activation functions. Note that this is a strong claim and would be revolutionary. This is the main reason why I hesitate accepting the paper straightaway and demand more comparisons on diverse datasets and existing methods that can achieve similar results. I believe providing authors a chance to improve their paper and submit a well matured version to a different venue will result in a better validated and more impactful paper. The readership of a conference such as AISTATS would appreciate this paper more. Answers to the rebuttal: 1. "Stationarity encodes conservative behaviour suitable for uncertainty quantification" - It is not clear what this means. As I have highlighted in my original review, stationary is actually a limitation than a feature. 2. "...tackling a different problem, where the kernel is not used for encoding specific prior information, but inferred from data..." - I do not fully agree with this statement. Since a kernel measures similarity, it always encode prior information. This is specifically obvious in geological mapping (e.g. kriging) and robotics applications (citations provided in the original review). Authors also agree in the rebuttal that "the choice of kernel/activation function is up to the modelling task and expert knowledge." Learning a generic kernel (See [1,3]) is always better than using a Matern kernel. 3. It is still not clear why MC dropout is used instead of, say, more stable variational inference. Also, the OOD performance has not been quantitatively evaluated as in [6].

Review 3

Summary and Contributions: In the paper, authors proposed a new activation method derived from the Matern kernel family. Besides the thorough analysis and explicit explanation of their method, a main contribution is the thought of leveraging the link between Gaussian process methods and neural network, which is inspiring to the community.

Strengths: The paper has a solid theoretical standing and explicit explanation.

Weaknesses: Motivations of using Matern family should be further clarified though I agree Matern family is the reasonable choice in a neural network. Since there are many GP kernels and some of them have similar theoretical consideration with the Matern family, a broader discussion may be added.

Correctness: Yes, the claims, method and the expirical methodology are correct.

Clarity: Yes, the idea, method and explanation are well presented.

Relation to Prior Work: Related work focuses on OOD detection.

Reproducibility: Yes

Additional Feedback: (1) I've seem several recent work focusing on the Bayesian network so that the learning process is calibrated with uncertainty estimation. Comparing with the work such as the "Deep Neural Networks as Gaussian Processes", what's the difference, benefit and limitations of only using a Matern activation function? (2) I need more motivation explanations on choosing the Matern kernel. Matern kernel is oriented for the purpose of a better depiction of the physical process(Stein's kriging work, or Cressie and Christ. ). We can take it as the solution of a SPDE. I'm not sure if authors have analyzed the effectiveness of a general SPDE form for activation function. I personality feel that will be a higher-level theory. There are several kernel functions having a better capability of capturing intrinsic features from the data, i would like to see some comparisons with some of them. (3) what's the complexity of using the proposed activation function? Any efficiency comparison results? (4) A naive question, why not Bayesian network if uncertainty estimation is considered important? or Deep kernel learning (wilson, zhiting 2015) ? I am always thinking a neural network calibrating uncertainty with kernel tricks is essentially a simplified Bayesian network. ######################################## I like the comparison with SIREN in the feedback, I would suggest the author to add it to the supplementary if it is possible. The motivation seems a common concern among reviewers, On the whole, I think it is a very good submission.

Review 4

Summary and Contributions: This paper introduces a new set of activation functions for DNN that are based on the Matérn family of kernels in Gaussian process. The paper argues that these new activation functions allow one to impose stationarity, continuity, and various degrees of differentiability as priors on the resulting function. Such priors are beneficial for calibrating out-of-distribution uncertainty.

Strengths: + The motivation for the proposed activation functions is well formulated + The theoretical grounding appears sound + The empirical results illustrate how the proposed approach provides better OOD uncertainty on multiple datasets

Weaknesses: - As noted by the authors, the inference procedure used in this work has not been given a substantial amount of attention (it offers a direction for future work) - I found some of the empirical results a little hard to interpret

Correctness: The claims and methods appear correct to me.

Clarity: The paper is very well written.

Relation to Prior Work: The relations to previous work have been extensively discussed (though I'm not an expert in the field and cannot tell if the paper is missing some relevant references)

Reproducibility: Yes

Additional Feedback: I really enjoyed reading this paper. The motivation is well presented and the choice of methodology seems well-justified. I like that, unlike in some previous work, this work is based on changes to the model rather than, for example, the optimisation procedure in order to achieve better OOD uncertainty. I found Fig.1 somewhat hard to interpret. As expected for GPs with a stationary kernel, the top row illustrates how away from the data the uncertainty increases quite fast. However, in the bottom row the uncertainty away from the data does not increase uniformly and there are still large areas for which the uncertainty appears similar to that in the parts of the domain with the data. Is it fair to say that achieving the same error paterns in both rows in this figure (and I'm mostly refering to the Matern 5/2 case here) would be the ultimate goal in your case? Looking at Fig. 2, it appears that in all cases of NNs (in the second row) the noise that's estimated in parts of the domain with the data is higher than in the GP cases, i.e. the NNs seem to estimate higher observation noise (epsilon, where y = f(x) + epsilon) than the GPs. Consequently (for this and possibly other reasons), the uncertainty in the parts of the domain with no data is also higher in the NN cases than for the corresponding GPs. What is the reason for the higher observational noise in the case of NNs? Also, for the Matern kernels in the NN cases, it appears that the predictions converge to the mean of the data far from the data (in this case, on the sides of the domain). However, that doesn't seem to be the case in the RBF case. Do you know why this is? Unless I missed it, you don't seem to discuss the role of the lengthscale l of the Matern kernel in any of the experiments. Looking at Fig. 2, it appears that the lengthscale might be very short for the NNs. Is there a prior on this parameter? Is it being learnt or fixed to some value? Could you also comment further on the role of this parameter in the larger examples? Given that the activation functions are not monotonic, one might expect the training to be a lot more prone to local minima and generally harder to train (which you allude to in the discussion). A more thorough discussion of this potential limitation would be interesting, though this could be phrased as a direction for future work. In general, I'd be happy to increase my score if you provided some intuition on the points I raised above. Thanks. Minor: Line 300 - the a