Review for NeurIPS paper: Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness

NeurIPS 2020

Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness

Review 1

Summary and Contributions: 1. This paper presents a way to adapt the Gaussian process to high-dimensional data by 1) extracting latent features with a distance-preserving network and 2) introducing distance-aware output layers via random Fourier features, an approximation to the GP. 2. Empirical results on benchmark datasets are provided to show superior performance in out-of-distribution detection.

Strengths: The method presented in the paper can improve out-of-distribution AUC without sacrificing much of the prediction performance.

Weaknesses: 1. The model consists of spectral normalized hidden layers to guarantee a bounded Lipchitz constant for top NN layers, and a random Fourier feature approximated Gaussian process as the last layer. The combination is new, but the overall method is not end-to-end. Thus it can be hard to balance these two components to let them work well with each other. In supplementary material, I saw the entire algorithm is an alternative minimization optimization. So I'm curious about how you choose the initial parameters to make it work well. 2. The main strength is the empirical performance, but the paper does not release code.

Correctness: All techniques are properly used in this paper as far as I can see.

Clarity: 1. I cannot see how Eqn. (5) was used after Section 2. Also, Eqn. (5) itself seems confusing: why your optimal solution p of a minimax problem inf_p sup_{p*} S(p,p*) depends on the inner solution p*? 2. Section 2 seems loosely presented. I suggest presenting the model first and postpone Section 2.

Relation to Prior Work: This paper includes related prior works as far as I can see.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper proposed input-distance-aware metric to measure the uncertainty of deep learning model. The output of DNN is composition of g and h, in which g maps from hidden to the output layer, and h maps from input to the hidden layer. By using Laplacian approximation to Gaussian process which is endowed with distance preserving, the g is distance-aware w.r.t. h. By further using spectural normalization, the h is distance aware w.r.t. x. Combine those together, the distance-aware metric can be derived, to measure the distance of a given sample from training ones, i.e., the extent of uncertainty. This method has been validated on serveral experiments and show promising results on all of them.

Strengths: (i) The authors presented a solid and novel framework to propose a distance-aware metric to measure the uncertainty of a deep learning model. (ii) The idea of spectural normalization which can preserve the order of distance is interesting, not only for uncertainty estimation, but also benefit deep learning model. (iii) This work has conducted convincing experiments on various datasets (including synthetic dataset), which can show the advantage of proposed method, especially the spectural normalization.

Weaknesses: Based on the spectural normalization which bound the l2-norm of W_l, it seems that the \mathcal{X} is with l2-norm as distance metric, which seems not reasonable in image space that may lie in a manifold.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: Obtaining accurate uncertainty estimates for deep learning models is a crucial aspect for reliable application fo NNs. This paper proposes a principled method to obtain uncertainty estimates without the need for an expensive sampling process. In essence, the method involves using spectral normalization to regularize the weights of neural networks, and a distance-aware output Layer with the Laplace-approximated neural Gaussian process, in order to promote the input distance awareness, which the authors of the paper identify as a key property for reliable out-of-domain detection. Careful empirical experiments were also conducted to demonstrate the effectiveness of the proposed method under various settings.

Strengths: The paper is very well written. I really enjoyed going through the paper. All in all, in my opinion, this paper is a clear accept to me. The proposed method is sound and reasonable, and the authors of the paper identify a key condition for NN to achieve high-quality uncertainty estimation that was absent in standard DNNs and previously neglected by researchers in the field.

Weaknesses: A major complaint I have is a lack of an ablation study. The proposed method involves two components: the neural GP layer and the spectral normalization. To my understanding, these two components can be implemented independently? What are the relative contributions of the two components? It would be interesting to see such an analysis to really understand the importance of each. Another limitation is the lack of comparison to other baseline algorithms for OOD detection. While the authors of the paper demonstrate the effectiveness of OOD detection compared to other algorithms for uncertainty estimates, there is a lack of comparison against other methods designed explicitly for OOD detection in my opinion.

Correctness: The claims and proposed method seem to be correct in my opinion.

Clarity: The paper is well written overall.

Relation to Prior Work: Prior works are discussed satisfactorily.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The authors propose a method of uncertainty measure of the deep network prediction. The final layer is replaced by the Gaussian Process (GP) formulation, and the uncertainty information is obtained without multiple forward passes of the same data. The results show that the distance information between data is mainly used for uncertainty measure, while the distance to the prediction boundary is used for other previous methods. It is unclear why the distance to the training data should be used for the uncertainty measure.

Strengths: The method needs one forward pass while the baseline methods in the paper need multiple passes for generating output distribution.

Weaknesses: After reading the paper, it is still arguable why the distance-awareness is decisive for determining the uncertainty. Distance awareness is the property of the conventional local methods such as those using kernels. The experiment for the synthetic data will be reconstructed with conventional Gaussian Processes without neural networks. The effect of the replacement of the last layer seems obvious regarding the distance awareness, but it is unclear whether this distance awareness property is indeed advantageous. From this perspective, the theoretical property in Equation (6) is the property of conventional local methods. It is hard to get the clear idea why the special setting in this paper is inevitable. Deep learning literature has been claiming that the useful transformation of the input space into a disentangled space makes the algorithm powerful, which is contrary to the explanation in this paper that the preserve of the original space distance matters.

Correctness: The experiments for the comparison with conventional methods look fine. But the claim about the advantages of using the proposed method are unclear.

Clarity: The writings and structure of the paper is acceptable.

Relation to Prior Work: The modification from the previous method is clearly explained.

Reproducibility: Yes

Additional Feedback: