Review for NeurIPS paper: On the Expressiveness of Approximate Inference in Bayesian Neural Networks

NeurIPS 2020

On the Expressiveness of Approximate Inference in Bayesian Neural Networks

Review 1

Summary and Contributions: This paper provides a new perspective on typical inference methods (mean-field VI and Monte Carlo drop out) for BNNs. It shows that those methods have a fundamental limitation in representation uncertainty between data points.

Strengths: * The paper discusses an important shortcoming of typical inference methods for BNNs and may inspire research that overcomes these limitations. * The contribution is novel and is of interest to the NeurIPS community.

Weaknesses: * The work seems limited to the regression case. Could you discuss an extension to classification?

Correctness: * All claims and derivations seem to be correct. * The empirical evaluation seems sound but could be a bit more thorough.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: The related work is clearly discussed.

Reproducibility: Yes

Additional Feedback: * Do you believe that your target posterior (using a Gaussian prior) actually represents uncertainty well? Recent work showed limitations of the Bayesian posterior in BNNs using Gaussian priors. * It would be great if you could discuss what properties the variational family needs to meet to represent in-between uncertainty correctly. UPDATE AFTER REBUTTAL The authors addressed my questions and I have decided to keep my score of 6.

Review 2

Summary and Contributions: The authors consider two claims about approximate posterior distributions: 1-The approximating family contains good approximations to the true posterior. 2-The approximate inference method used must be able to find them. They prove that: -For single-hidden-layer models there are simple settings where no fully-factorized Gaussian or MC dropout distribution can be a good fit. -For deeper models, Criterion 1 is satisfied, but the authors offer evidence that deep BNNs still have similar pathologies affecting Criterion 2.

Strengths: -Theorems 1 and 2 are well presented and interesting. The authors acknowledge that the specific situations that they cover might be limited, but provide some evidence in Appendix A that MFVI/MCDO might be bad in more general cases. -Theorem 3 is similarly good. -A number of the experiments are clever and well executed and serve to distinguish between different related hypotheses.

Weaknesses: -I would be inclined to move Figure 9 to the main body. It's actually very interesting that a deeper network is able to pass this test. It puts the emphasis of your argument in section 3.2 onto the claim that it is the inference method, not the model, which is problematic. -Relatedly, I think this interpretation could be clearer in the introduction. In the paragraph starting line 49 you could be clearer that Theorem 3 is a result about models, not inference, and that you have other evidence that the deeper models are fine, but that there are problems with inference. -Your title also seems to be affected by a confusion here. Part of your work is about the expressiveness of approximate *posteriors* and part of it is about the success of approximate *inference*. -I think you slightly overstate the "emprirical evidence that in spite of this flexibility VI in deep BNNs still leads to distributions that suffer from similar pathologies to the shallow case". Your experiments are on fairly small data compared to typical deep learning practices, and you should probably acknowledge that a number of authors have gotten pretty good results with both of these inference methods. I think you can make this statement slightly weaker without detracting from the importance of your work but bringing it more in line with what your experiments actually show. Minor: -Note that Figures 6 and 7 are useful for existence claims but are being used here for for all claims. That is, if you can show an uncertainty plot that is good, you show that a method can be good. If you show one that is bad, you do not show that they cannot be good. This makes Figure 7 less clear cut, especially if you do not describe the range of hyperparameters considered.

Correctness: The results in Table 1 look off to me. Wile it is possible that the deeper models with random acquisition just perform worse it seems more likely that the hyperparameters need to be separately tuned for shallower and deeper models, and it seems you are still using the same hyperparameters for all models. I think this paper is impressive enough to be accepted anyhow, but if you were able to improve on this experiment I think it might help you feel more confident that your worries about inference in deeper models apply in practice.

Clarity: Yes. The paper is a bit dense at points, but it is well written and does a good job with proof sketches to make otherwise complicated proofs interpretable.

Relation to Prior Work: It would be sensible to acknowledge that a number of authors have had considerable success with deep VI/MCDO models, and that in relation to those your results in section 5 should really be seen as raising a question about the inference rather than offering an answer to Criterion 2.

Reproducibility: Yes

Additional Feedback: Section 2.2 does not seem like an important point for you to make. There is extensive prior work (as you acknowledge) on BNN priors and this is not core to your argument. It's probably enough to acknowledge somewhere in a sentence that you pick priors that are not perverse and create space for figures that are more important to your paper that have been moved to the appendix. ---------------------- Thanks for your author response, and for engaging so swiftly with my question about the baseline. I really like this paper. That said, I do think you imply that your results for deeper networks are stronger than they actually are. I think there's a good chance that the effects you identify are much more pronounced for small numbers of datapoints in low-dimensional data. This is reinforced by the observation that your Naval results are quite sensitive to using 55 datapoints instead of the full set. The deeper model performs worse for small amounts of data, but not for larger, suggesting it is overparameterized/underregularized. This makes it unclear that the effects at depth that you identify have the same source as the effects in shallow models that you identify. Naval is also a slightly odd dataset, which single-layer NNs can almost exactly fit, whereas other datasets do not have this problem, did you try other UCI datasets? Also, I think sample-based MI estimators for regression can be quite bad. Basically, I remain quite sceptical about this experiment. I think you should either, then, do experiments in a setting with lots of high-dim data or which otherwise resolves these issues (not at all necessary to achieve publication) or make it clearer that there are limitations to the evidence that your single-layer results extend to deeper models. I've heard some people who have read the paper summarizing it as having said things like "MFVI/MCDO has pathology X" and not differentiating the single-layer and deeper settings, which I think is partly helped by some of the current framing. I'm giving an accept score on the strength of the theoretical results, and the creativity of the experiments, but I'd be disappointed if the mismatch between implied conclusions and the experiments provided were not addressed for the camera ready version.

Review 3

Summary and Contributions: The paper analyzes the issues of the mean field variational inference and Monte-Carlo dropout in uncertainty quantification for Bayesian neural networks. The paper points out that for one layer NN, MFVI and MCDO cannot increase the uncertainty for the prediction in between two separate regions of low uncertainty, and for deep NNs, the issues can be overcome in theory. But the paper empirically shows similar issues (less severe but still there) in learning deep NNs.

Strengths: 1. a very important topic. Uncertainty quantification should be a central task in Bayesian learning, but is largely ignore by the community. 2. profound analysis. and solid results. 3. interesting conclusions

Weaknesses: It will be good to discuss some ideas about how to address or alleviate the issue of MFVI and MCDO in uncertainty estimation for BNNs.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: My takeaway from the paper: The paper reports that factorizing Gaussian posterior, a common choice in BNNs, has a problem of underestimating the variance around the area where there are not many observations. Under a single hidden layer neural network model, this behaviour is severe as shown in Figure 2 and 3. With more layers of a model, the behaviour becomes less severe although it persists.

Strengths: The strengths I believe the paper has: 1. While it is well known that factorizing Gaussian posterior typically underestimates the posterior variance (i.e., overconfident), this paper seems to formally state it in Theorem 1. 2. Observing this behaviour in deep networks seems to be adding a stepping stone in the study of BNNs, as illustrated in Figure 4 & 5 and Table 1.

Weaknesses: The paper's weaknesses I found: 1. While it's nice to see a formal statement of the behaviour of underestimating variance (i.e., over-confident) under the factorizing Gaussian posterior combined with variational inference, this is something known. In fact, many entry level machine learning text books contain this, e.g., take a look at C.Bishop about approximate inference. Also, Renyi divergence VI paper by Li and Turner also states the pathology of the mean field Gaussian posterior under VI (alpha goest to 1 in Renyi divergence) being overconfident in Figure 1. I understand that these results are shown under the shallow models, e.g., Bayesian linear regression, logistic regression, etc. The proposed result in the single-layer neural net could be viewed as a type of shallow model with a specific link function (nonlinearity). Hence, I don't really see such a novelty in their findings reported in the submitted paper. 2. Another weakness I found was it is not so clear to me how these squashed variance between the space where there are not many observations changes with a different objective function (not just a squared difference) and different nonlinearities in the shallow and deep networks, e.g., the shallow network examples shown in Figure 2 and 3. Is the flatness of the variance in the in-between regions a result of ReLu? Also, is the symmetry in the resulting variance estimate a result of squared error in the function? If one chooses to use different nonlinearities and losses, would this behaviour also remain the same? If not, what would change? Are there any setups where this pathological behaviour is less severe? I have read the rebuttal, and would like to keep my score the same, as the novelty aspect seems weak.

Correctness: The claim seems to be correct, while I didn't go through all the steps in their proofs.

Clarity: Yes, I think this paper is clearly written.

Relation to Prior Work: Yes, the relation to the previous work was clearly and correctly mentioned.

Reproducibility: Yes

Additional Feedback: